Commit Graph

82876 Commits

Author SHA1 Message Date
Qu Wenruo
b7f9945a14 btrfs: handle tree backref walk error properly
[BUG]
Smatch reports the following errors related to commit ("btrfs: output
affected files when relocation fails"):

	fs/btrfs/inode.c:283 print_data_reloc_error()
	error: uninitialized symbol 'ref_level'.

[CAUSE]
That part of code is mostly copied from scrub, but unfortunately scrub
code from the beginning is not doing the error handling properly.

The offending code looks like this:

	do {
		ret = tree_backref_for_extent();
		btrfs_warn_rl();
	} while (ret != 1);

There are several problems involved:

- No error handling
  If that tree_backref_for_extent() failed, we would output the same
  error again and again, never really exit as it requires ret == 1 to
  exit.

- Always do one extra output
  As tree_backref_for_extent() only return > 0 if there is no more
  backref item.
  This means after the last item we hit, we would output an invalid
  error message for ret > 0 case.

[FIX]
Fix the old code by:

- Move @ref_root and @ref_level into the if branch
  And do not initialize them, so we can catch such uninitialized values
  just like what we do in the inode.c

- Explicitly check the return value of tree_backref_for_extent()
  And handle ret < 0 and ret > 0 cases properly.

- No more do {} while () loop
  Instead go while (true) {} loop since we will handle @ret manually.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Christoph Hellwig
f880fe6e0b btrfs: don't hold an extra reference for redirtied buffers
When btrfs_redirty_list_add redirties a buffer, it also acquires
an extra reference that is released on transaction commit.  But
this is not required as buffers that are dirty or under writeback
are never freed (look for calls to extent_buffer_under_io())).

Remove the extra reference and the infrastructure used to drop it
again.

History behind redirty logic:

In the first place, it used releasing_list to hold all the
to-be-released extent buffers, and decided which buffers to re-dirty at
the commit time. Then, in a later version, the behaviour got changed to
re-dirty a necessary buffer and add re-dirtied one to the list in
btrfs_free_tree_block(). In short, the list was there mostly for the
patch series' historical reason.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ add Naohiro's comment regarding history ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:26 +02:00
Christoph Hellwig
f18cc97845 btrfs: fix dirty_metadata_bytes for redirtied buffers
dirty_metadata_bytes is decremented in both places that clear the dirty
bit in a buffer, but only incremented in btrfs_mark_buffer_dirty, which
means that a buffer that is redirtied using btrfs_redirty_list_add won't
be added to dirty_metadata_bytes, but it will be subtracted when written
out, leading an inconsistency in the counter.

Move the dirty_metadata_bytes from btrfs_mark_buffer_dirty into
set_extent_buffer_dirty to also account for the redirty case, and remove
the now unused set_extent_buffer_dirty return value.

Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Johannes Thumshirn
bb5167e619 btrfs: unexport btrfs_run_discard_work and make it static
Mark btrfs_run_discard_work static and move it above its callers.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
016f9d0b74 btrfs: rename del_ptr to btrfs_del_ptr and export it
This exists internal to ctree.c, however btrfs check needs to use it for
some of its operations.  I'd rather not duplicate that code inside of
btrfs check as this is low level and I want to keep this code in one
place, so rename the function to btrfs_del_ptr and export it so that it
can be used inside of btrfs-progs safely.  Add a comment to make sure
this doesn't get removed by a future cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
b3cbfb0dd4 btrfs: add a btrfs_csum_type_size helper
This is needed in btrfs-progs for the tools that convert the checksum
types for file systems and a few other things.  We don't have it in the
kernel as we just want to get the size for the super blocks type.
However I don't want to have to manually add this every time we sync
ctree.c into btrfs-progs, so add the helper in the kernel with a note so
it doesn't get removed by a later cleanup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
a95b7f9360 btrfs: add __KERNEL__ check for btrfs_no_printk
We want to override this in btrfs-progs, so wrap this in the __KERNEL__
check so we can easily sync this to btrfs-progs and have our local
version of btrfs_no_printk do the work.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
f541833c8e btrfs: move split_flags/combine_flags helpers to inode-item.h
These are more related to the inode item flags on disk than the
in-memory btrfs_inode, move the helpers to inode-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
2cac5af165 btrfs: move btrfs_verify_level_key into tree-checker.c
This is more a buffer validation helper, move it into the tree-checker
files where it makes more sense.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
c26fa931eb btrfs: add __btrfs_check_node helper
This helper returns a btrfs_tree_block_status for the various errors,
and then btrfs_check_node() will return -EUCLEAN if it gets anything
other than BTRFS_TREE_BLOCK_CLEAN which will be used by the kernel.  In
the future btrfs-progs will use this helper instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
924452c80e btrfs: extend btrfs_leaf_check to return btrfs_tree_block_status
Instead of blanket returning -EUCLEAN for all the failures in
btrfs_check_leaf, use btrfs_tree_block_status and return the appropriate
status for each failure.  Rename the helper to __btrfs_check_leaf and
then make a wrapper of btrfs_check_leaf that will return -EUCLEAN to
non-clean error codes.  This will allow us to have the
__btrfs_check_leaf variant in btrfs-progs while keeping the behavior in
the kernel consistent.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
c8d5421563 btrfs: use btrfs_tree_block_status for leaf item errors
We have a variety of item specific errors that can occur.  For now
simply put these under the umbrella of BTRFS_TREE_BLOCK_INVALID_ITEM,
this can be fleshed out as we need in the future.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
a7b4e6c7aa btrfs: add btrfs_tree_block_status definitions to tree-checker.h
We use this in btrfs-progs to determine if we can fix different types of
corruptions.  We don't care about this in the kernel, however it would
be good to share this code between the kernel and btrfs-progs, so add
the status definitions so we can start converting the tree-checker code
over to using these status flags instead of blanket returning -EUCLEAN.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:25 +02:00
Josef Bacik
85d8a826c7 btrfs: simplify btrfs_check_leaf_* helpers into a single helper
We have two helpers for checking leaves, because we have an extra check
for debugging in btrfs_mark_buffer_dirty(), and at that stage we may
have item data that isn't consistent yet.  However we can handle this
case internally in the helper, if BTRFS_HEADER_FLAG_WRITTEN is set we
know the buffer should be internally consistent, otherwise we need to
skip checking the item data.

Simplify this helper down a single helper and handle the item data
checking logic internally to the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
4aec05fa5a btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing
in the eb already, so simply get the level from the eb inside of
btrfs_set_block_flags.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Josef Bacik
54d687c13a btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space
cache code and into block-rsv.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Qu Wenruo
94ead93e63 btrfs: scrub: use recovered data stripes as cache to avoid unnecessary read
For P/Q stripe scrub, we have quite some duplicated read IO:

- Data stripes read for verification
  This is triggered by the scrub_submit_initial_read() inside
  scrub_raid56_parity_stripe().

- Data stripes read (again) for P/Q stripe verification
  This is triggered by scrub_assemble_read_bios() from scrub_rbio().

  Although we can have hit rbio cache and avoid unnecessary read, the
  chance is very low, as scrub would easily flush the whole rbio cache.

This means, even we're just scrubbing a single P/Q stripe, we would read
the data stripes twice for the best case scenario.  If we need to
recover some data stripes, it would cause more reads on the same data
stripes, again and again.

However before we call raid56_parity_submit_scrub_rbio() we already
have all data stripes repaired and their contents ready to use.
But RAID56 cache is unaware about the scrub cache, thus RAID56 layer
itself still needs to re-read the data stripes.

To avoid such cache miss, this patch would:

- Introduce a new helper, raid56_parity_cache_data_pages()
  This function would grab the pages from an array, and copy the content
  to the rbio, marking all the involved sectors uptodate.

  The page copy is unavoidable because of the cache pages of rbio are all
  self managed, thus can not utilize outside pages without screwing up
  the lifespan.

- Use the repaired data stripes as cache inside
  scrub_raid56_parity_stripe()

By this, we ensure all the data sectors of the scrub rbio are already
uptodate, and no need to read them again from disk.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
7e5ba55994 btrfs: assert tree lock is held when removing free space entries
Removing a free space entry from an in memory space cache requires having
the corresponding btrfs_free_space_ctl's 'tree_lock' held. We have several
code paths that remove an entry, so add assertions where appropriate to
verify we are holding the lock, as the lock is acquired by some other
function up in the call chain, which makes it easy to miss in the future.

Note: for this to work we need to lock the local btrfs_free_space_ctl at
load_free_space_cache(), which was not being done because it's local,
declared on the stack, so no other task has access to it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9649bd9a29 btrfs: assert tree lock is held when linking free space
When linking a free space entry, at link_free_space(), the caller should
be holding the spinlock 'tree_lock' of the given btrfs_free_space_ctl
argument, which is necessary for manipulating the red black tree of free
space entries (done by tree_insert_offset(), which already asserts the
lock is held) and for manipulating the 'free_space', 'free_extents',
'discardable_extents' and 'discardable_bytes' counters of the given
struct btrfs_free_space_ctl.

So assert that the spinlock 'tree_lock' of the given btrfs_free_space_ctl
is held by the current task. We have multiple code paths that end up
calling link_free_space(), and all currently take the lock before calling
it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
91de9e978d btrfs: assert tree lock is held when searching for free space entries
When searching for a free space entry by offset, at tree_search_offset(),
we are supposed to have the btrfs_free_space_ctl's 'tree_lock' held, so
assert that. We have multiple callers of tree_search_offset(), and all
currently hold the necessary lock before calling it.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
13c2018fcc btrfs: assert proper locks are held at tree_insert_offset()
There are multiple code paths leading to tree_insert_offset(), and each
path takes the necessary locks before tree_insert_offset() is called,
since they do other things that require those locks to be held. This makes
it easy to miss the locking somewhere, so make tree_insert_offset() assert
that the required locks are being held by the calling task.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
0d6bac4d30 btrfs: simplify arguments to tree_insert_offset()
For the in-memory component of space caching (free space cache and free
space tree), three of the arguments passed to tree_insert_offset() can
always be taken from the new free space entry that we are about to add.

So simplify tree_insert_offset() to take the new entry instead of the
'offset', 'node' and 'bitmap' arguments. This will also allow to make
further changes simpler.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
b77433b144 btrfs: use precomputed end offsets at do_trimming()
The are two computations of end offsets at do_trimming() that are not
necessary, as they were previously computed and stored in local const
variables. So just use the variables instead, to make the source code
shorter and easier to read.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:24 +02:00
Filipe Manana
9085f42571 btrfs: avoid searching twice for previous node when merging free space entries
At try_merge_free_space(), avoid calling twice rb_prev() to find the
previous node, as that requires looping through the red black tree, so
store the result of the rb_prev() call and then use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Filipe Manana
fbb2e654d8 btrfs: avoid extra memory allocation when copying free space cache
At copy_free_space_cache(), we add a new entry to the block group's ctl
before we free the entry from the temporary ctl. Adding a new entry
requires the allocation of a new struct btrfs_free_space, so we can
avoid a temporary extra allocation by freeing the entry from the
temporary ctl before we add a new entry to the main ctl, which possibly
also reduces the chances for a memory allocation failure in case of very
high memory pressure. So just do that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Tom Rix
12df6a622e btrfs: simplify transid initialization in btrfs_ioctl_wait_sync
A small code simplification, move the default value of transid to its
initialization and remove the else-statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
b9a9a85059 btrfs: output affected files when relocation fails
[PROBLEM]
When relocation fails (mostly due to checksum mismatch), we only got
very cryptic error messages like:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

The end user has to decipher the above messages and use various tools to
locate the affected files and find a way to fix the problem (mostly
deleting the file).  This is not an easy work even for experienced
developer, not to mention the end users.

[SCRUB IS DOING BETTER]
By contrast, scrub is providing much better error messages:

  BTRFS error (device dm-4): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488
  BTRFS warning (device dm-4): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file)
  BTRFS info (device dm-4): scrub: finished on devid 1 with status: 0

Which provides the affected files directly to the end user.

[IMPROVEMENT]
Instead of the generic data checksum error messages, which is not doing
a good job for data reloc inodes, this patch introduce a scrub like
backref walking based solution.

When a sector fails its checksum for data reloc inode, we go the
following workflow:

- Get the real logical bytenr
  For data reloc inode, the file offset is the offset inside the block
  group.
  Thus the real logical bytenr is @file_off + @block_group->start.

- Do an extent type check
  If it's tree blocks it's much easier to handle, just go through
  all the tree block backref.

- Do a backref walk and inode path resolution for data extents
  This is mostly the same as scrub.
  But unfortunately we can not reuse the same function as the output
  format is different.

Now the new output would be more user friendly:

  BTRFS info (device dm-4): relocating block group 13631488 flags data
  BTRFS warning (device dm-4): csum failed root -9 ino 257 off 0 logical 13631488 csum 0x373e1ae3 expected csum 0x98757625 mirror 1
  BTRFS warning (device dm-4): checksum error at logical 13631488 mirror 1 root 5 inode 257 offset 0 length 4096 links 1 (path: file)
  BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-4): balance: ended with status: -5

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
8bfec2e426 btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
e917ff56c8 btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
  mode
- btrfs_write_marked_extents - write marked extents, writeback with
  WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Christoph Hellwig
da02361807 btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.

E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:

Unpatched:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

With synchronous CRCs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.

(thanks to Jens Axboe via Chris Mason for the benchmarking)

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
adbe7e388e btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Anand Jain
29e70be261 btrfs: use SECTOR_SHIFT to convert physical offset to LBA
Use SECTOR_SHIFT while converting a physical address to an LBA, makes
it more readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
eee3b81178 btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:

- Always dump the leaf first, then the error message

- Output the slot number if possible
  Especially in __btrfs_free_extent() the leaf dump of extent tree can
  be pretty large.
  With an extra slot number it's much easier to locate the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:23 +02:00
Qu Wenruo
6c75a589cb btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block,
we can make them to accept const extent buffer pointer.

This removes a forced type convert in extent-tree, where we convert a
const extent buffer pointer to regular one, just to avoid compiler
warning.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Naohiro Aota
b5345d6cee btrfs: export bitmap_test_range_all_{set,zero}
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other
components. Move them to misc.h and use them in zoned.c. Also, as
find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned
int", convert the type to "unsigned long".

While at it, also rewrite the "if (...) return true; else return false;"
pattern and add const to the input bitmap.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
88ad95b055 btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a
sibling node/leaf, it's very unexpected to have the last key of the left
sibling greater than or equals to the first key of the right sibling, as
that means we have a (serious) corruption that breaks the key ordering
properties of a b+tree. Since this is unexpected, surround the comparison
with the unlikely macro, which helps the compiler generate better code
for the most expected case (no existing b+tree corruption). This is also
what we do for other unexpected cases of invalid key ordering (like at
btrfs_set_item_key_safe()).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f2db4d5cb4 btrfs: make btrfs_free_device() static
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Sweet Tea Dorminy
1b53e51a4a btrfs: don't commit transaction for every subvol create
Recently a Meta-internal workload encountered subvolume creation taking
up to 2s each, significantly slower than directory creation. As they
were hoping to be able to use subvolumes instead of directories, and
were looking to create hundreds, this was a significant issue. After
Josef investigated, it turned out to be due to the transaction commit
currently performed at the end of subvolume creation.

This change improves the workload by not doing transaction commit for every
subvolume creation, and merely requiring a transaction commit on fsync.
In the worst case, of doing a subvolume create and fsync in a loop, this
should require an equal amount of time to the current scheme; and in the
best case, the internal workload creating hundreds of subvolumes before
fsyncing is greatly improved.

While it would be nice to be able to use the log tree and use the normal
fsync path, log tree replay can't deal with new subvolume inodes
presently.

It's possible that there's some reason that the transaction commit is
necessary for correctness during subvolume creation; however,
git logs indicate that the commit dates back to the beginning of
subvolume creation, and there are no notes on why it would be necessary.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Filipe Manana
f469c8bd90 btrfs: unexport btrfs_prev_leaf()
btrfs_prev_leaf() is not used outside ctree.c, so there's no need to
export it at ctree.h - just make it static at ctree.c and move its
definition above btrfs_search_slot_for_read(), since that function
calls it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Christian Brauner
1784fbc2ed ovl: port to new mount api
We recently ported util-linux to the new mount api. Now the mount(8)
tool will by default use the new mount api. While trying hard to fall
back to the old mount api gracefully there are still cases where we run
into issues that are difficult to handle nicely.

Now with mount(8) and libmount supporting the new mount api I expect an
increase in the number of bug reports and issues we're going to see with
filesystems that don't yet support the new mount api. So it's time we
rectify this.

When ovl_fill_super() fails before setting sb->s_root, we need to cleanup
sb->s_fs_info.  The logic is a bit convoluted but tl;dr: If sget_fc() has
succeeded fc->s_fs_info will have been transferred to sb->s_fs_info.
So by the time ->fill_super()/ovl_fill_super() is called fc->s_fs_info
is NULL consequently fs_context->free() won't call ovl_free_fs().

If we fail before sb->s_root() is set then ->put_super() won't be called
which would call ovl_free_fs(). IOW, if we fail in ->fill_super() before
sb->s_root we have to clean it up.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
ac519625ed ovl: factor out ovl_parse_options() helper
For parsing a single mount option.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
af5f2396b6 ovl: store enum redirect_mode in config instead of a string
Do all the logic to set the mode during mount options parsing and
do not keep the option string around.

Use a constant_table to translate from enum redirect mode to string
in preperation for new mount api option parsing.

The mount option "off" is translated to either "follow" or "nofollow",
depending on the "redirect_always_follow" build/module config, so
in effect, there are only three possible redirect modes.

This results in a minor change to the string that is displayed
in show_options() - when redirect_dir is enabled by default and the user
mounts with the option "redirect_dir=off", instead of displaying the mode
"redirect_dir=off" in show_options(), the displayed mode will be either
"redirect_dir=follow" or "redirect_dir=nofollow", depending on the value
of "redirect_always_follow" build/module config.

The displayed mode reflects the effective mode, so mounting overlayfs
again with the dispalyed redirect_dir option will result with the same
effective and displayed mode.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:01 +03:00
Amir Goldstein
dcb399de1e ovl: pass ovl_fs to xino helpers
Internal ovl methods should use ovl_fs and not sb as much as
possible.

Use a constant_table to translate from enum xino mode to string
in preperation for new mount api option parsing.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
367d002d6c ovl: clarify ovl_get_root() semantics
Change the semantics to take a reference on upperdentry instead
of transferrig the reference.

This is needed for upcoming port to new mount api.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
e4599d4b1a ovl: negate the ofs->share_whiteout boolean
The default common case is that whiteout sharing is enabled.
Change to storing the negated no_shared_whiteout state, so we will not
need to initialize it.

This is the first step towards removing all config and feature
initializations out of ovl_fill_super().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Christian Brauner
f723edb8a5 ovl: check type and offset of struct vfsmount in ovl_entry
Porting overlayfs to the new amount api I started experiencing random
crashes that couldn't be explained easily. So after much debugging and
reasoning it became clear that struct ovl_entry requires the point to
struct vfsmount to be the first member and of type struct vfsmount.

During the port I added a new member at the beginning of struct
ovl_entry which broke all over the place in the form of random crashes
and cache corruptions. While there's a comment in ovl_free_fs() to the
effect of "Hack! Reuse ofs->layers as a vfsmount array before freeing
it" there's no such comment on struct ovl_entry which makes this easy to
trip over.

Add a comment and two static asserts for both the offset and the type of
pointer in struct ovl_entry.

Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-06-19 14:02:00 +03:00
Amir Goldstein
42dd69ae1a ovl: implement lazy lookup of lowerdata in data-only layers
Defer lookup of lowerdata in the data-only layers to first data access
or before copy up.

We perform lowerdata lookup before copy up even if copy up is metadata
only copy up.  We can further optimize this lookup later if needed.

We do best effort lazy lookup of lowerdata for d_real_inode(), because
this interface does not expect errors.  The only current in-tree caller
of d_real_inode() is trace_uprobe and this caller is likely going to be
followed reading from the file, before placing uprobes on offset within
the file, so lowerdata should be available when setting the uprobe.

Tested-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
4166564478 ovl: prepare for lazy lookup of lowerdata inode
Make the code handle the case of numlower > 1 and missing lowerdata
dentry gracefully.

Missing lowerdata dentry is an indication for lazy lookup of lowerdata
and in that case the lowerdata_redirect path is stored in ovl_inode.

Following commits will defer lookup and perform the lazy lookup on
access.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
2b21da9208 ovl: prepare to store lowerdata redirect for lazy lowerdata lookup
Prepare to allow ovl_lookup() to leave the last entry in a non-dir
lowerstack empty to signify lazy lowerdata lookup.

In this case, ovl_lookup() stores the redirect path from metacopy to
lowerdata in ovl_inode, which is going to be used later to perform the
lazy lowerdata lookup.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:14 +03:00
Amir Goldstein
5436ab0a86 ovl: implement lookup in data-only layers
Lookup in data-only layers only for a lower metacopy with an absolute
redirect xattr.

The metacopy xattr is not checked on files found in the data-only layers
and redirect xattr are not followed in the data-only layers.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
37ebf056d6 ovl: introduce data-only lower layers
Introduce the format lowerdir=lower1:lower2::lowerdata1::lowerdata2
where the lower layers on the right of the :: separators are not merged
into the overlayfs merge dirs.

Data-only lower layers are only allowed at the bottom of the stack.

The files in those layers are only meant to be accessible via absolute
redirect from metacopy files in lower layers.  Following changes will
implement lookup in the data layers.

This feature was requested for composefs ostree use case, where the
lower data layer should only be accessiable via absolute redirects
from metacopy inodes.

The lower data layers are not required to a have a unique uuid or any
uuid at all, because they are never used to compose the overlayfs inode
st_ino/st_dev.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
9e88f90524 ovl: remove unneeded goto instructions
There is nothing in the out goto target of ovl_get_layers().

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
ab1eb5ffb7 ovl: deduplicate lowerdata and lowerstack[]
The ovl_inode contains a copy of lowerdata in lowerstack[], so the
lowerdata inode member can be removed.

Use accessors ovl_lowerdata*() to get the lowerdata whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
ac900ed4f2 ovl: deduplicate lowerpath and lowerstack[]
The ovl_inode contains a copy of lowerpath in lowerstack[0], so the
lowerpath member can be removed.

Use accessor ovl_lowerpath() to get the lowerpath whereever the member
was accessed directly.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
0af950f57f ovl: move ovl_entry into ovl_inode
The lower stacks of all the ovl inode aliases should be identical
and there is redundant information in ovl_entry and ovl_inode.

Move lowerstack into ovl_inode and keep only the OVL_E_FLAGS
per overlay dentry.

Following patches will deduplicate redundant ovl_inode fields.

Note that for pure upper and negative dentries, OVL_E(dentry) may be
NULL now, so it is imporatnt to use the ovl_numlower() accessor.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
163db0da35 ovl: factor out ovl_free_entry() and ovl_stack_*() helpers
In preparation for moving lowerstack into ovl_inode.

Note that in ovl_lookup() the temp stack dentry refs are now cloned
into the final ovl_lowerstack instead of being transferred, so cleanup
always needs to call ovl_stack_free(stack).

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
5522c9c7cb ovl: use ovl_numlower() and ovl_lowerstack() accessors
This helps fortify against dereferencing a NULL ovl_entry,
before we move the ovl_entry reference into ovl_inode.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:13 +03:00
Amir Goldstein
a6ff2bc0be ovl: use OVL_E() and OVL_E_FLAGS() accessors
Instead of open coded instances, because we are about to split
the two apart.

Reviewed-by: Alexander Larsson <alexl@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Amir Goldstein
b07d5cc93e ovl: update of dentry revalidate flags after copy up
After copy up, we may need to update d_flags if upper dentry is on a
remote fs and lower dentries are not.

Add helpers to allow incremental update of the revalidate flags.

Fixes: bccece1ead ("ovl: allow remote upper")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
f4e19e595c ovl: fix null pointer dereference in ovl_get_acl_rcu()
Following process:
         P1                     P2
 path_openat
  link_path_walk
   may_lookup
    inode_permission(rcu)
     ovl_permission
      acl_permission_check
       check_acl
        get_cached_acl_rcu
	 ovl_get_inode_acl
	  realinode = ovl_inode_real(ovl_inode)
	                      drop_cache
		               __dentry_kill(ovl_dentry)
				iput(ovl_inode)
		                 ovl_destroy_inode(ovl_inode)
		                  dput(oi->__upperdentry)
		                   dentry_kill(upperdentry)
		                    dentry_unlink_inode
				     upperdentry->d_inode = NULL
	    ovl_inode_upper
	     upperdentry = ovl_i_dentry_upper(ovl_inode)
	     d_inode(upperdentry) // returns NULL
	  IS_POSIXACL(realinode) // NULL pointer dereference
, will trigger an null pointer dereference at realinode:
  [  205.472797] BUG: kernel NULL pointer dereference, address:
                 0000000000000028
  [  205.476701] CPU: 2 PID: 2713 Comm: ls Not tainted
                 6.3.0-12064-g2edfa098e750-dirty #1216
  [  205.478754] RIP: 0010:do_ovl_get_acl+0x5d/0x300
  [  205.489584] Call Trace:
  [  205.489812]  <TASK>
  [  205.490014]  ovl_get_inode_acl+0x26/0x30
  [  205.490466]  get_cached_acl_rcu+0x61/0xa0
  [  205.490908]  generic_permission+0x1bf/0x4e0
  [  205.491447]  ovl_permission+0x79/0x1b0
  [  205.491917]  inode_permission+0x15e/0x2c0
  [  205.492425]  link_path_walk+0x115/0x550
  [  205.493311]  path_lookupat.isra.0+0xb2/0x200
  [  205.493803]  filename_lookup+0xda/0x240
  [  205.495747]  vfs_fstatat+0x7b/0xb0

Fetch a reproducer in [Link].

Use the helper ovl_i_path_realinode() to get realinode and then do
non-nullptr checking.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217404
Fixes: 332f606b32 ("ovl: enable RCU'd ->get_acl()")
Cc: <stable@vger.kernel.org> # v5.15
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
1a73f5b8f0 ovl: fix null pointer dereference in ovl_permission()
Following process:
          P1                     P2
 path_lookupat
  link_path_walk
   inode_permission
    ovl_permission
      ovl_i_path_real(inode, &realpath)
        path->dentry = ovl_i_dentry_upper(inode)
                          drop_cache
			   __dentry_kill(ovl_dentry)
		            iput(ovl_inode)
		             ovl_destroy_inode(ovl_inode)
		              dput(oi->__upperdentry)
		               dentry_kill(upperdentry)
		                dentry_unlink_inode
				 upperdentry->d_inode = NULL
      realinode = d_inode(realpath.dentry) // return NULL
      inode_permission(realinode)
       inode->i_sb  // NULL pointer dereference
, will trigger an null pointer dereference at realinode:
  [  335.664979] BUG: kernel NULL pointer dereference,
                 address: 0000000000000002
  [  335.668032] CPU: 0 PID: 2592 Comm: ls Not tainted 6.3.0
  [  335.669956] RIP: 0010:inode_permission+0x33/0x2c0
  [  335.678939] Call Trace:
  [  335.679165]  <TASK>
  [  335.679371]  ovl_permission+0xde/0x320
  [  335.679723]  inode_permission+0x15e/0x2c0
  [  335.680090]  link_path_walk+0x115/0x550
  [  335.680771]  path_lookupat.isra.0+0xb2/0x200
  [  335.681170]  filename_lookup+0xda/0x240
  [  335.681922]  vfs_statx+0xa6/0x1f0
  [  335.682233]  vfs_fstatat+0x7b/0xb0

Fetch a reproducer in [Link].

Use the helper ovl_i_path_realinode() to get realinode and then do
non-nullptr checking.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217405
Fixes: 4b7791b2e9 ("ovl: handle idmappings in ovl_permission()")
Cc: <stable@vger.kernel.org> # v5.19
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Christian Brauner <brauner@kernel.org>
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:01:12 +03:00
Zhihao Cheng
b2dd05f107 ovl: let helper ovl_i_path_real() return the realinode
Let helper ovl_i_path_real() return the realinode to prepare for
checking non-null realinode in RCU walking path.

[msz] Use d_inode_rcu() since we are depending on the consitency
between dentry and inode being non-NULL in an RCU setting.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Fixes: ffa5723c6d ("ovl: store lower path in ovl_inode")
Cc: <stable@vger.kernel.org> # v5.19
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-19 14:00:38 +03:00
Chuck Lever
5e092be741 NFSD: Distinguish per-net namespace initialization
I find the naming of nfsd_init_net() and nfsd_startup_net() to be
confusingly similar. Rename the namespace initialization and tear-
down ops and add comments to distinguish their separate purposes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:02:52 -04:00
Jeff Layton
ed9ab7346e nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net
Commit f5f9d4a314 ("nfsd: move reply cache initialization into nfsd
startup") moved the initialization of the reply cache into nfsd startup,
but didn't account for the stats counters, which can be accessed before
nfsd is ever started. The result can be a NULL pointer dereference when
someone accesses /proc/fs/nfsd/reply_cache_stats while nfsd is still
shut down.

This is a regression and a user-triggerable oops in the right situation:

- non-x86_64 arch
- /proc/fs/nfsd is mounted in the namespace
- nfsd is not started in the namespace
- unprivileged user calls "cat /proc/fs/nfsd/reply_cache_stats"

Although this is easy to trigger on some arches (like aarch64), on
x86_64, calling this_cpu_ptr(NULL) evidently returns a pointer to the
fixed_percpu_data. That struct looks just enough like a newly
initialized percpu var to allow nfsd_reply_cache_stats_show to access
it without Oopsing.

Move the initialization of the per-net+per-cpu reply-cache counters
back into nfsd_init_net, while leaving the rest of the reply cache
allocations to be done at nfsd startup time.

Kudos to Eirik who did most of the legwork to track this down.

Cc: stable@vger.kernel.org # v6.3+
Fixes: f5f9d4a314 ("nfsd: move reply cache initialization into nfsd startup")
Reported-and-tested-by: Eirik Fuller <efuller@redhat.com>
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2215429
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-18 12:02:40 -04:00
Joel Granados
2f2665c13a sysctl: replace child with an enumeration
This is part of the effort to remove the empty element at the end of
ctl_table structs. "child" was a deprecated elem in this struct and was
being used to differentiate between two types of ctl_tables: "normal"
and "permanently emtpy".

What changed?:
* Replace "child" with an enumeration that will have two values: the
  default (0) and the permanently empty (1). The latter is left at zero
  so when struct ctl_table is created with kzalloc or in a local
  context, it will have the zero value by default. We document the
  new enum with kdoc.
* Remove the "empty child" check from sysctl_check_table
* Remove count_subheaders function as there is no longer a need to
  calculate how many headers there are for every child
* Remove the recursive call to unregister_sysctl_table as there is no
  need to traverse down the child tree any longer
* Add a new SYSCTL_PERM_EMPTY_DIR binary flag
* Remove the last remanence of child from partport/procfs.c

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-18 02:32:54 -07:00
Joel Granados
94a6490518 sysctl: Remove debugging dump_stack
Remove unneeded dump_stack in __register_sysctl_table

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-06-18 02:32:54 -07:00
Jingbo Xu
f02615eb6f erofs: use separate xattr parsers for listxattr/getxattr
There's a callback styled xattr parser, i.e. xattr_foreach(), which is
shared among listxattr and getxattr.

Convert it to two separate xattr parsers to serve listxattr and getxattr
for better readability.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
4b077b5012 erofs: unify inline/shared xattr iterators for listxattr/getxattr
Make inline_{list,get}xattr() as well as inline_xattr_iter_begin()
unified as erofs_xattr_iter_inline(), and shared_{list,get}xattr()
unified as erofs_xattr_iter_shared().

After these changes, both erofs_xattr_iter_{inline,shared}() return 0 on
success, and negative error on failure.

One thing worth noting is that, the logic of returning it->buffer_ofs
when there's no shared xattrs in shared_listxattr() is moved to
erofs_listxattr() to make the unification possible.  The only difference
is that, semantically the old behavior will return ENOATTR rather than
it->buffer_ofs if ENOATTR encountered when listxattr is parsing upon a
specific shared xattr, while now the new behavior will return
it->buffer_ofs in this case.  This is not an issue, as listxattr upon a
specific xattr won't return ENOATTR.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
5a8ffb1975 erofs: make the size of read data stored in buffer_ofs
Since now xattr_iter structures have been unified, make the size of the
read data stored in buffer_ofs.  Don't bother reusing buffer_size for
this use, which may be confusing.

This is in preparation for the following further cleanup.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-4-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:54 +08:00
Jingbo Xu
8e823961de erofs: unify xattr_iter structures
Unify xattr_iter/listxattr_iter/getxattr_iter structures into
erofs_xattr_iter structure.

This is in preparation for the following further cleanup.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Jingbo Xu
eba67eb6de erofs: use absolute position in xattr iterator
Replace blkaddr/ofs with pos in 'struct erofs_xattr_iter'.

After erofs_bread() is introduced to replace raw page cache APIs for
metadata I/Os handling, xattr_iter_fixup() is no longer needed anymore.

In addition, it is also unnecessary to check if the iterated position is
span over the block boundary as absolute offset is used instead of
blkaddr + offset pairs.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613074114.120115-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Gao Xiang
001b8ccd06 erofs: fix compact 4B support for 16k block size
In compact 4B, two adjacent lclusters are packed together as a unit to
form on-disk indexes for effective random access, as below:

(amortized = 4, vcnt = 2)
       _____________________________________________
      |___@_____ encoded bits __________|_ blkaddr _|
      0        .                                    amortized * vcnt = 8
      .             .
      .                  .              amortized * vcnt - 4 = 4
      .                        .
      .____________________________.
      |_type (2 bits)_|_clusterofs_|

Therefore, encoded bits for each pack are 32 bits (4 bytes). IOWs,
since each lcluster can get 16 bits for its type and clusterofs, the
maximum supported lclustersize for compact 4B format is 16k (14 bits).

Fix this to enable compact 4B format for 16k lclusters (blocks), which
is tested on an arm64 server with 16k page size.

Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601112341.56960-1-hsiangkao@linux.alibaba.com
2023-06-18 12:10:53 +08:00
Jingbo Xu
9c39ec0cff erofs: convert erofs_read_metabuf() to erofs_bread() for xattr
buf->inode is constant once initialized during erofs_buf's lifetime.
Thus call erofs_init_metabuf() and erofs_bread() separately to avoid
the repetition of assigning buf->inode when iterating xattrs.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230601024347.108469-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-06-18 12:10:53 +08:00
Gao Xiang
43d86ec936 erofs: use poison pointer to replace the hard-coded address
It's safer and cleaner to replace such hard-coded illegal pointer
with poison pointers.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-7-hsiangkao@linux.alibaba.com
2023-06-18 12:10:53 +08:00
Gao Xiang
7674a42f35 erofs: use struct lockref to replace handcrafted approach
Let's avoid the current handcrafted lockref although `struct lockref`
inclusion usually increases extra 4 bytes with an explicit spinlock if
CONFIG_DEBUG_SPINLOCK is off.

Apart from the size difference, note that the meaning of refcount is
also changed to active users. IOWs, it doesn't take an extra refcount
for XArray tree insertion.

I don't observe any significant performance difference at least on
our cloud compute server but the new one indeed simplifies the
overall codebase a bit.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230529123727.79943-1-hsiangkao@linux.alibaba.com
2023-06-18 12:10:47 +08:00
Chuck Lever
262176798b NFSD: Add an nfsd4_encode_nfstime4() helper
Clean up: de-duplicate some common code.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-17 13:18:06 -04:00
Namjae Jeon
5005bcb421 ksmbd: validate session id and tree id in the compound request
This patch validate session id and tree id in compound request.
If first operation in the compound is SMB2 ECHO request, ksmbd bypass
session and tree validation. So work->sess and work->tcon could be NULL.
If secound request in the compound access work->sess or tcon, It cause
NULL pointer dereferecing error.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21165
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:51 -05:00
Namjae Jeon
5fe7f7b782 ksmbd: fix out-of-bound read in smb2_write
ksmbd_smb2_check_message doesn't validate hdr->NextCommand. If
->NextCommand is bigger than Offset + Length of smb2 write, It will
allow oversized smb2 write length. It will cause OOB read in smb2_write.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21164
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:36 -05:00
Namjae Jeon
40b268d384 ksmbd: add mnt_want_write to ksmbd vfs functions
ksmbd is doing write access using vfs helpers. There are the cases that
mnt_want_write() is not called in vfs helper. This patch add missing
mnt_want_write() to ksmbd vfs functions.

Cc: stable@vger.kernel.org
Cc: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:19 -05:00
Namjae Jeon
2b9b8f3b68 ksmbd: validate command payload size
->StructureSize2 indicates command payload size. ksmbd should validate
this size with rfc1002 length before accessing it.
This patch remove unneeded check and add the validation for this.

[    8.912583] BUG: KASAN: slab-out-of-bounds in ksmbd_smb2_check_message+0x12a/0xc50
[    8.913051] Read of size 2 at addr ffff88800ac7d92c by task kworker/0:0/7
...
[    8.914967] Call Trace:
[    8.915126]  <TASK>
[    8.915267]  dump_stack_lvl+0x33/0x50
[    8.915506]  print_report+0xcc/0x620
[    8.916558]  kasan_report+0xae/0xe0
[    8.917080]  kasan_check_range+0x35/0x1b0
[    8.917334]  ksmbd_smb2_check_message+0x12a/0xc50
[    8.917935]  ksmbd_verify_smb_message+0xae/0xd0
[    8.918223]  handle_ksmbd_work+0x192/0x820
[    8.918478]  process_one_work+0x419/0x760
[    8.918727]  worker_thread+0x2a2/0x6f0
[    8.919222]  kthread+0x187/0x1d0
[    8.919723]  ret_from_fork+0x1f/0x30
[    8.919954]  </TASK>

Cc: stable@vger.kernel.org
Reported-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-16 21:04:03 -05:00
David Howells
ba00b19067 afs: Fix vlserver probe RTT handling
In the same spirit as commit ca57f02295 ("afs: Fix fileserver probe
RTT handling"), don't rule out using a vlserver just because there
haven't been enough packets yet to calculate a real rtt.  Always set the
server's probe rtt from the estimate provided by rxrpc_kernel_get_srtt,
which is capped at 1 second.

This could lead to EDESTADDRREQ errors when accessing a cell for the
first time, even though the vl servers are known and have responded to a
probe.

Fixes: 1d4adfaf65 ("rxrpc: Make rxrpc_kernel_get_srtt() indicate validity")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2023-June/006746.html
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-16 14:43:41 -07:00
Linus Torvalds
4973ca2955 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSMg4YACgkQxWXV+ddt
 WDvNxg/9G45Lcn3YPYXicbzKcrrz4fpg4gqx9IX226DfJX78iZskl3LN1w+gFcj0
 gAKSC73ZZCGhIqrHOuWIbH5+BRO3FzTB9zr7tfx4H+pFWHs0BgYPqcoBjLTHZ/Pn
 2RYu+F922tGaPW7LZ2LtGlv+8Y4IDtWVe6uRyxSqv3dtF1jcgUfnJk2zJXG5z41R
 h1BSX7mcWUxUXbSJqTzAij7jyvbpnmy1BjsGDRG2G2J/AmvpUBtx1Gc3aKWhD2Up
 vNLQkl4OxbaW1t8CV9u6iGduS5mUAetOXoT2DTr3sSQMeA56Gpues/qb6qQVTbwb
 2cBnwQugZyz39yZkyvvopy6z2rasMmw6V/aPLKTLvPN/P+DYwU+bfcFuNa+LFxz4
 KJqGvZdrwDlhGc80+xjKhly4zLahAt0H+Y1yKjRK2RRx/TsXl4ufVc5hpq9rj8eK
 AoNvoZw9W3/L0juMUfZILhMbD2f7XGbUXlNhIXHCZsOZzuZBqNMNNv9d8b5ncbWE
 q6a5EJXzQzk13kiurVBZJoZokYxsUzEBsKeij4aaP1Rkw8r/62GvEt79Nu8X+67+
 cQyZ6CQ6eZ2PsPx9DtooCbAnH6huIPf9yagn5J2Li6H6VdvOlP6zIi7Tp33AhPdp
 1BMfaNq46l6Gxiu1pnclzSb8abVLb71ZxXNItEK/EkbH/uktaro=
 =NAyd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two fixes for NOCOW files, a regression fix in scrub and an assertion
  fix:

   - NOCOW fixes:
      - keep length of iomap direct io request in case of a failure
      - properly pass mode of extent reference checking, this can break
        some cases for swapfile

   - fix error value confusion when scrubbing a stripe

   - convert assertion to a proper error handling when loading global
     roots, reported by syzbot"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: fix a return value overwrite in scrub_stripe()
  btrfs: do not ASSERT() on duplicated global roots
  btrfs: can_nocow_file_extent should pass down args->strict from callers
  btrfs: fix iomap_begin length for nocow writes
2023-06-16 12:41:56 -07:00
Christoph Hellwig
2e82f6c3bf splice: simplify a conditional in copy_splice_read
Check for -EFAULT instead of wrapping the check in an ret < 0 block.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
Christoph Hellwig
0b24be4691 splice: don't call file_accessed in copy_splice_read
copy_splice_read calls into ->read_iter to read the data, which already
calls file_accessed.

Fixes: 33b3b04154 ("splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:08 -06:00
David Howells
ca2d49f77c splice, net: Fix splice_to_socket() to handle pipe bufs larger than a page
splice_to_socket() assumes that a pipe_buffer won't hold more than a single
page of data - but this assumption can be violated by skb_splice_bits()
when it splices from a socket into a pipe.

The problem is that splice_to_socket() doesn't advance the pipe_buffer
length and offset when transcribing from the pipe buf into a bio_vec, so if
the buf is >PAGE_SIZE, it keeps repeating the same initial chunk and
doesn't advance the tail index.  It then subtracts this from "remain" and
overcounts the amount of data to be sent.

The cleanup phase then tries to overclean the pipe, hits an unused pipe buf
and a NULL-pointer dereference occurs.

Fix this by not restricting the bio_vec size to PAGE_SIZE and instead
transcribing the entirety of each pipe_buffer into a single bio_vec and
advancing the tail index if remain hasn't hit zero yet.

Large bio_vecs will then be split up by iterator functions such as
iov_iter_extract_pages().

This resulted in a KASAN report looking like:

general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
...
RIP: 0010:pipe_buf_release include/linux/pipe_fs_i.h:203 [inline]
RIP: 0010:splice_to_socket+0xa91/0xe30 fs/splice.c:933

Fixes: 2dc334f1a6 ("splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()")
Reported-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/0000000000000900e905fdeb8e39@google.com/
Tested-by: syzbot+f9e28a23426ac3b24f20@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: David Ahern <dsahern@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Christian Brauner <brauner@kernel.org>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/1428985.1686737388@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:50:02 -07:00
Jakub Kicinski
173780ff18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

include/linux/mlx5/driver.h
  617f5db1a6 ("RDMA/mlx5: Fix affinity assignment")
  dc13180824 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/

tools/testing/selftests/net/mptcp/mptcp_join.sh
  47867f0a7e ("selftests: mptcp: join: skip check if MIB counter not supported")
  425ba80312 ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
  45b1a1227a ("mptcp: introduces more address related mibs")
  0639fa230a ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-15 22:19:41 -07:00
Linus Torvalds
62d8779610 Fix two regressions in ext4, one report by syzkaller[1], and reported by
multiple users (and tracked by regzbot[2]).
 
 [1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
 [2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSLPr8ACgkQ8vlZVpUN
 gaO08wgAlk1HqZub9iJm7ZQRx4qVb4zGYGL8So4IE/971FT8Vq3suJsQlTWpPVBB
 8O6H5bsq5F54wfjmY44aUyBjc1SEJL2zfrEeMxcsxH+1tEoEe8i+FT3BXVr6+Y7k
 qdZYrqmtbYpM2dyNW7LY4UDRQBE6ug+r6Cq1z2Mo7cfwfiBcYYe7lncbl3xJe7D8
 8DZLgjQg48wxvU5EOK2yHMHw0YfJySnmaCo7J7sU5TLZj91IEnlNGwAjYIy1GUBV
 3I+/+U70QR1VtZig7v4q6TI3LDHfegRc8zIIOtM3u1qs2bzLFgyoNURtut2vhF2+
 /XyXxFabmlpG7yp3UZVetCECKVa2Kg==
 =WJ4O
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4, one report by syzkaller[1], and reported
  by multiple users (and tracked by regzbot[2])"

[1] https://syzkaller.appspot.com/bug?extid=4acc7d910e617b360859
[2] https://linux-regtracking.leemhuis.info/regzbot/regression/ZIauBR7YiV3rVAHL@glitch/

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: drop the call to ext4_error() from ext4_get_group_info()
  Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
2023-06-15 15:40:58 -07:00
Linus Torvalds
7a043feb6c eight, mostly small, smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSKVUsACgkQiiy9cAdy
 T1EwhAv/bazgxHAjxmoZ8hxw21uH/Ufvf2Cvt49yFTjxiaOCoSPLovA5AyB1FlTz
 xsw/Cf6haO+RIYpLQ1KAIg7Qt3zwUu50Io2s/yyaHmQu7d+eNUPISajrJIjYOnqB
 tAVHav0XcvTaPAzyr9luqmBZnLaAeatK5CbrK+gT4xstW/cuTu/F4JUF8fOL++ww
 /IDTvo6UKqEKuPD8dpszXZnJjuj8aIsRIh8CBGI73QrDVExTzhrdmRULWqwSj0Lr
 nHsh8hgUZ25dSe/9jIagPqgvzaizvcyzopJuO8mpQAMDww9o1kBlud9GLdgNGnfM
 bDHHbn3Y9Fu5VaxMJ/n9lozMXT+OB+3ZrrBFP8J0unSidzsEzWKfYjE0gPXGKru8
 YhvjWKdJZjIOEu9QRuD4KK+MOF20cuq3OmvKpoZiEr2niB7oX25B+JddxW2CJD5a
 oAiP1+1/wiVe79KUlLLRe+OTtujUXtMiXPePzHh15dYetg+P4qJCBqWjDPdzRmdr
 MfMhsZpq
 =peSC
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Eight, mostly small, smb3 client fixes:

   - important fix for deferred close oops (race with unmount) found
     with xfstest generic/098 to some servers

   - important reconnect fix

   - fix problem with max_credits mount option

   - two multichannel (interface related) fixes

   - one trivial removal of confusing comment

   - two small debugging improvements (to better spot crediting
     problems)"

* tag '6.4-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: add a warning when the in-flight count goes negative
  cifs: fix lease break oops in xfstest generic/098
  cifs: fix max_credits implementation
  cifs: fix sockaddr comparison in iface_cmp
  smb/client: print "Unknown" instead of bogus link speed value
  cifs: print all credit counters in DebugData
  cifs: fix status checks in cifs_tree_connect
  smb: remove obsolete comment
2023-06-15 15:24:33 -07:00
Jan Kara
c541dce86c fs: Protect reconfiguration of sb read-write from racing writes
The reconfigure / remount code takes a lot of effort to protect
filesystem's reconfiguration code from racing writes on remounting
read-only. However during remounting read-only filesystem to read-write
mode userspace writes can start immediately once we clear SB_RDONLY
flag. This is inconvenient for example for ext4 because we need to do
some writes to the filesystem (such as preparation of quota files)
before we can take userspace writes so we are clearing SB_RDONLY flag
before we are fully ready to accept userpace writes and syzbot has found
a way to exploit this [1]. Also as far as I'm reading the code
the filesystem remount code was protected from racing writes in the
legacy mount path by the mount's MNT_READONLY flag so this is relatively
new problem. It is actually fairly easy to protect remount read-write
from racing writes using sb->s_readonly_remount flag so let's just do
that instead of having to workaround these races in the filesystem code.

[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/

Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230615113848.8439-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 16:58:04 +02:00
Miquel Raynal
a91845b9a8 sysfs: Skip empty folders creation
Most sysfs attributes are statically defined, the goal with this design
being to be able to move all the filesystem description into read-only
memory. Anyway, it may be relevant in some cases to populate attributes
at run time. This leads to situation where an attribute may or may not be
present depending on conditions which are not known at compile
time, up to the point where no attribute at all gets added in a folder
which then becomes "sometimes" empty. Problem is, providing an attribute
group with a name and without .[bin_]attrs members will be loudly
refused by the core, leading in most cases to a device registration
failure.

The simple way to support such situation right now is to dynamically
allocate an empty attribute array, which is:
* a (small) waste of space
* a waste of time
* disturbing, to say the least, as an empty sysfs folder will be created
  anyway.

Another (even worse) possibility would be to dynamically overwrite a
member of the attribute_group list, hopefully the last, which is also
supposed to remain in the read-only section.

In order to avoid these hackish situations, while still giving a little
bit of flexibility, we might just check the validity of the .[bin_]attrs
list and, if empty, just skip the attribute group creation instead of
failing. This way, developers will not be tempted to workaround the
core with useless allocations or strange writes on supposedly read-only
structures.

The content of the WARN() message is kept but turned into a debug
message in order to help developers understanding why their sysfs
folders might now silently fail to be created.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Message-ID: <20230614063018.2419043-3-miquel.raynal@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-15 13:37:53 +02:00
Miquel Raynal
4981e0139f sysfs: Improve readability by following the kernel coding style
The purpose of the if/else block is to select the right sysfs directory
entry to be used for the files creation. At a first look when you have
the file in front of you, it really seems like the "create_files()"
lines right after the block are badly indented and the "else" does not
guard. In practice the code is correct but lacks curly brackets to show
where the big if/else block actually ends. Add these brackets to comply
with the current kernel coding style and to ease the understanding of
the whole logic.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Message-ID: <20230614063018.2419043-2-miquel.raynal@bootlin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-15 13:37:50 +02:00
Andreas Gruenbacher
cad1e15804 gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
Rename the SDF_FS_FROZEN flag to SDF_FREEZE_INITIATOR to indicate more
clearly that the node that has this flag set is the initiator of the
freeze.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
9e4f09565f gfs2: Reconfiguring frozen filesystem already rejected
Reconfiguring a frozen filesystem is already rejected in
reconfigure_super(), so there is no need to check for that condition
again at the filesystem level.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
e392edd5d5 gfs2: Rename gfs2_freeze_lock{ => _shared }
Rename gfs2_freeze_lock to gfs2_freeze_lock_shared to make it a bit more
obvious that this function establishes the "thawed" state of the freeze
glock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
097cca525a gfs2: Rename the {freeze,thaw}_super callbacks
Rename gfs2_freeze to gfs2_freeze_super and gfs2_unfreeze to
gfs2_thaw_super to match the names of the corresponding super
operations.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Andreas Gruenbacher
af1abe1146 gfs2: Rename remaining "transaction" glock references
The transaction glock was repurposed to serve as the new freeze glock
years ago.  Don't refer to it as the transaction glock anymore.

Also, to be more precise, call it the "freeze glock" instead of the
"freeze lock".  Ditto for the journal glock.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-15 09:57:38 +02:00
Jeff Layton
797a1d894d autofs: set ctime as well when mtime changes on a dir
When adding entries to a directory, POSIX generally requires that the
ctime also be updated alongside the mtime.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Ian Kent <raven@themaw.net>
Message-Id: <20230612104524.17058-4-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:24 +02:00
Wen Yang
33d8b5d782 eventfd: show the EFD_SEMAPHORE flag in fdinfo
The EFD_SEMAPHORE flag should be displayed in fdinfo,
as different value could affect the behavior of eventfd.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Wen Yang <wenyang.linux@foxmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dylan Yudaken <dylany@fb.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Eric Biggers <ebiggers@google.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <tencent_05B9CFEFE6B9BC2A9B3A27886A122A7D9205@qq.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:23 +02:00
Fabio M. De Francesco
5c075c5b8f fs/aio: Stop allocating aio rings from HIGHMEM
There is no need to allocate aio rings from HIGHMEM because of very
little memory needed here.

Therefore, use GFP_USER flag in find_or_create_page() and get rid of
kmap*() mappings.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Message-Id: <20230609145937.17610-1-fmdefrancesco@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-15 09:22:23 +02:00
Kemeng Shi
19a043bb1f ext4: try all groups in ext4_mb_new_blocks_simple
ext4_mb_new_blocks_simple ignores the group before goal, so it will fail
if free blocks reside in group before goal. Try all groups to avoid
unexpected failure.
Search finishes either if any free block is found or if no available
blocks are found. Simpliy check "i >= max" to distinguish the above
cases.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
95a4c3c7e0 ext4: remove ext4_block_group and ext4_block_group_offset declaration
For ext4_block_group and ext4_block_group_offset, there are only
declaration without definition. Just remove them.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
1eff590489 ext4: add EXT4_MB_HINT_GOAL_ONLY test in ext4_mb_use_preallocated
ext4_mb_use_preallocated will ignore the demand to alloc goal blocks,
although the EXT4_MB_HINT_GOAL_ONLY is requested.
For group pa, ext4_mb_group_or_file will not set EXT4_MB_HINT_GROUP_ALLOC
if EXT4_MB_HINT_GOAL_ONLY is set. So we will not alloc goal blocks from
group pa if EXT4_MB_HINT_GOAL_ONLY is set.
For inode pa, ext4_mb_pa_goal_check is added to check if free extent in
found inode pa meets goal blocks when EXT4_MB_HINT_GOAL_ONLY is set.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Suggested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
c3defd99d5 ext4: treat stripe in block unit
Stripe is misused in block unit and in cluster unit in different code
paths. User awared of stripe maybe not awared of bigalloc feature, so
treat stripe only in block unit to fix this.
Besides, it's hard to get stripe aligned blocks (start and length are both
aligned with stripe) if stripe is not aligned with cluster, just disable
stripe and alert user in this case to simpfy the code and avoid
unnecessary work to get stripe aligned blocks which likely to be failed.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
99c515e3a8 ext4: fix wrong unit use in ext4_mb_find_by_goal
We need start in block unit while fe_start is in cluster unit. Use
ext4_grp_offs_to_block helper to convert fe_start to get start in
block unit.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
497885f72d ext4: fix unit mismatch in ext4_mb_new_blocks_simple
The "i" returned from mb_find_next_zero_bit is in cluster unit and we
need offset "block" corresponding to "i" in block unit. Convert "i" to
block unit to fix the unit mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Kemeng Shi
b3916da0d9 ext4: fix wrong unit use in ext4_mb_normalize_request
NRL_CHECK_SIZE will compare input req and size, so req and size should
be in same unit. Input req "fe_len" is in cluster unit while input
size "(8<<20)>>bsbits" is in block unit. Convert "fe_len" to block
unit to fix the mismatch.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20230603150327.3596033-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Matthew Wilcox
0dea40aa31 ext4: Call fsverity_verify_folio()
Now that fsverity supports working on entire folios, call
fsverity_verify_folio() instead of fsverity_verify_page()

Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230516192713.1070469-1-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
d19500da4b ext4: Make ext4_write_inline_data_end() use folio
ext4_write_inline_data_end() is completely converted to work with folio.
Also all callers of ext4_write_inline_data_end() already works on folio
except ext4_da_write_end(). Mostly for consistency and saving few
instructions maybe, this patch just converts ext4_da_write_end() to work
with folio which makes the last caller of ext4_write_inline_data_end()
also converted to work with folio.
We then make ext4_write_inline_data_end() take folio instead of page.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/1bcea771720ff451a5a59b3f1bcd5fae51cb7ce7.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
80be8c5cc9 ext4: Make mpage_journal_page_buffers use folio
This patch converts mpage_journal_page_buffers() to use folio and also
removes the PAGE_SIZE assumption.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/ebc3ac80e6a54b53327740d010ce684a83512021.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
36c9b45040 ext4: Change remaining tracepoints to use folio
ext4_readpage() is converted to ext4_read_folio() hence change the
related tracepoint from trace_ext4_readpage(page) to
trace_ext4_read_folio(folio). Do the same for
trace_ext4_releasepage(page) to trace_ext4_release_folio(folio)

As a minor bit of optimization to avoid an extra dereferencing,
since both of the above functions already were dereferencing
folio->mapping->host, hence change the tracepoint argument to take
(inode, folio).

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/caba2b3c0147bed4ea7706767dc1d19cd0e29ab0.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Ritesh Harjani
0b956de151 ext4: kill unused function ext4_journalled_write_inline_data
Commit 3f079114bf ("ext4: Convert data=journal writeback to use ext4_writepages()")
Added support for writeback of journalled data into ext4_writepages()
and killed function __ext4_journalled_writepage() which used to call
ext4_journalled_write_inline_data() for inline data.
This function got left over by mistake. Hence kill it's definition as
no one uses it.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/122b2a8d5e0650686f23ed6da26ed9e04105562b.1684122756.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-15 00:02:10 -04:00
Fabio M. De Francesco
f451fd97dd ext4: drop the call to ext4_error() from ext4_get_group_info()
A recent patch added a call to ext4_error() which is problematic since
some callers of the ext4_get_group_info() function may be holding a
spinlock, whereas ext4_error() must never be called in atomic context.

This triggered a report from Syzbot: "BUG: sleeping function called from
invalid context in ext4_update_super" (see the link below).

Therefore, drop the call to ext4_error() from ext4_get_group_info(). In
the meantime use eight characters tabs instead of nine characters ones.

Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/
Fixes: 5354b2af34 ("ext4: allow ext4_get_group_info() to fail")
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com
2023-06-14 22:24:05 -04:00
Kemeng Shi
1948279211 Revert "ext4: remove unnecessary check in ext4_bg_num_gdb_nometa"
This reverts commit ad3f09be6c.

The reverted commit was intended to simpfy the code to get group
descriptor block number in non-meta block group by assuming
s_gdb_count is block number used for all non-meta block group descriptors.
However s_gdb_count is block number used for all meta *and* non-meta
group descriptors. So s_gdb_group will be > actual group descriptor block
number used for all non-meta block group which should be "total non-meta
block group" / "group descriptors per block", e.g. s_first_meta_bg.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230613225025.3859522-1-shikemeng@huaweicloud.com
Fixes: ad3f09be6c ("ext4: remove unnecessary check in ext4_bg_num_gdb_nometa")
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-14 22:22:27 -04:00
Jiasheng Jiang
d97038d5ec pstore/ram: Add check for kstrdup
Add check for the return value of kstrdup() and return the error
if it fails in order to avoid NULL pointer dereference.

Fixes: e163fdb3f7 ("pstore/ram: Regularize prz label allocation lifetime")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230614093733.36048-1-jiasheng@iscas.ac.cn
2023-06-14 11:52:10 -07:00
Eric Biggers
74836ecbc5 fsverity: rework fsverity_get_digest() again
Address several issues with the calling convention and documentation of
fsverity_get_digest():

- Make it provide the hash algorithm as either a FS_VERITY_HASH_ALG_*
  value or HASH_ALGO_* value, at the caller's choice, rather than only a
  HASH_ALGO_* value as it did before.  This allows callers to work with
  the fsverity native algorithm numbers if they want to.  HASH_ALGO_* is
  what IMA uses, but other users (e.g. overlayfs) should use
  FS_VERITY_HASH_ALG_* to match fsverity-utils and the fsverity UAPI.

- Make it return the digest size so that it doesn't need to be looked up
  separately.  Use the return value for this, since 0 works nicely for
  the "file doesn't have fsverity enabled" case.  This also makes it
  clear that no other errors are possible.

- Rename the 'digest' parameter to 'raw_digest' and clearly document
  that it is only useful in combination with the algorithm ID.  This
  hopefully clears up a point of confusion.

- Export it to modules, since overlayfs will need it for checking the
  fsverity digests of lowerdata files
  (https://lore.kernel.org/r/dd294a44e8f401e6b5140029d8355f88748cd8fd.1686565330.git.alexl@redhat.com).

Acked-by: Mimi Zohar <zohar@linux.ibm.com> # for the IMA piece
Link: https://lore.kernel.org/r/20230612190047.59755-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-14 10:41:07 -07:00
Qu Wenruo
b50f2d048e btrfs: scrub: fix a return value overwrite in scrub_stripe()
[RETURN VALUE OVERWRITE]
Inside scrub_stripe(), we would submit all the remaining stripes after
iterating all extents.

But since flush_scrub_stripes() can return error, we need to avoid
overwriting the existing @ret if there is any error.

However the existing check is doing the wrong check:

	ret2 = flush_scrub_stripes();
	if (!ret2)
		ret = ret2;

This would overwrite the existing @ret to 0 as long as the final flush
detects no critical errors.

[FIX]
We should check @ret other than @ret2 in that case.

Fixes: 8eb3dd17ea ("btrfs: dev-replace: error out if we have unrepaired metadata error during")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-14 18:30:30 +02:00
Alexander Aring
1696c75f18 fs: dlm: add send ack threshold and append acks to msgs
This patch changes the time when we sending an ack back to tell the
other side it can free some message because it is arrived on the
receiver node, due random reconnects e.g. TCP resets this is handled as
well on application layer to not let DLM run into a deadlock state.

The current handling has the following problems:

1. We end in situations that we only send an ack back message of 16
   bytes out and no other messages. Whereas DLM has logic to combine
   so much messages as it can in one send() socket call. This behaviour
   can be discovered by "trace-cmd start -e dlm_recv" and observing the
   ret field being 16 bytes.

2. When processing of DLM messages will never end because we receive a
   lot of messages, we will not send an ack back as it happens when
   the processing loop ends.

This patch introduces a likely and unlikely threshold case. The likely
case will send an ack back on a transmit path if the threshold is
triggered of amount of processed upper layer protocol. This will solve
issue 1 because it will be send when another normal DLM message will be
sent. It solves issue 2 because it is not part of the processing loop.

There is however a unlikely case, the unlikely case has a bigger
threshold and will be triggered when we only receive messages and do not
sent any message back. This case avoids that the sending node will keep
a lot of message for a long time as we send sometimes ack backs to tell
the sender to finally release messages.

The atomic cmpxchg() is there to provide a atomically ack send with
reset of the upper layer protocol delivery counter.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
d00725cab2 fs: dlm: handle sequence numbers as atomic
Currently seq_next is only be read on the receive side which processed
in an ordered way. The seq_send is being protected by locks. To being
able to read the seq_next value on send side as well we convert it to an
atomic_t value. The atomic_cmpxchg() is probably not necessary, however
the atomic_inc() depends on a if coniditional and this should be handled
in an atomic context.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
75a7d60134 fs: dlm: handle lkb wait count as atomic_t
Currently the lkb_wait_count is locked by the rsb lock and it should be
fine to handle lkb_wait_count as non atomic_t value. However for the
overall process of reducing locking this patch converts it to an
atomic_t value.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
07ee38674a fs: dlm: filter ourself midcomms calls
It makes no sense to call midcomms/lowcomms functionality for the local
node as socket functionality is only required for remote nodes. This
patch filters those calls in the upper layer of lockspace membership
handling instead of doing it in midcomms/lowcomms layer as they should
never be aware of local nodeid.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
70cf2fecf8 fs: dlm: warn about messages from left nodes
This patch warns about messages which are received from nodes who
already left the lockspace resource signaled by the cluster manager.
Before commit 489d8e559c ("fs: dlm: add reliable connection if
reconnect") there was a synchronization issue with the socket
lifetime and the cluster event of leaving a lockspace and other
nodes did not stop of sending messages because the cluster manager has a
pending message to leave the lockspace. The reliable session layer for
dlm use sequence numbers to ensure dlm message were never being dropped.
If this is not corrected synchronized we have a problem, this patch will
use the filter case and turn it into a WARN_ON_ONCE() so we seeing such
issue on the kernel log because it should never happen now.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
5ce9ef30f2 fs: dlm: move dlm_purge_lkb_callbacks to user module
This patch moves the dlm_purge_lkb_callbacks() function from ast to user
dlm module as it is only a function being used by dlm user
implementation. I got be hinted to hold specific locks regarding the
callback handling for dlm_purge_lkb_callbacks() but it was false
positive. It is confusing because ast dlm implementation uses a
different locking behaviour as user locks uses as DLM handles kernel and
user dlm locks differently. To avoid the confusing we move this function
to dlm user implementation.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
d41a1a3db4 fs: dlm: cleanup STOP_IO bitflag set when stop io
There should no difference between setting the CF_IO_STOP flag
before restore_callbacks() to do it before or afterwards. The
restore_callbacks() will be sure that no callback is executed anymore
when the bit wasn't set.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
f8bce79d9d fs: dlm: don't check othercon twice
This patch removes an another check if con->othercon set inside the
branch which already does that.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
cbba21169e fs: dlm: unregister memory at the very last
The dlm modules midcomms, debugfs, lockspace, uses kmem caches. We
ensure that the kmem caches getting deallocated after those modules
exited.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
f68bb23cad fs: dlm: fix missing pending to false
This patch sets the process_dlm_messages_pending boolean to false when
there was no message to process. It is a case which should not happen
but if we are prepared to recover from this situation by setting pending
boolean to false.

Cc: stable@vger.kernel.org
Fixes: dbb751ffab ("fs: dlm: parallelize lowcomms socket handling")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
7a931477bf fs: dlm: clear pending bit when queue was empty
This patch clears the DLM_IFL_CB_PENDING_BIT flag which will be set when
there is callback work queued when there was no callback to dequeue. It
is a buggy case and should never happen, that's why there is a
WARN_ON(). However if the case happens we are prepared to somehow
recover from it.

Cc: stable@vger.kernel.org
Fixes: 61bed0baa4 ("fs: dlm: use a non-static queue for callbacks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Alexander Aring
c6b6d6dcc7 fs: dlm: revert check required context while close
This patch reverts commit 2c3fa6ae4d ("dlm: check required context
while close"). The function dlm_midcomms_close(), which will call later
dlm_lowcomms_close(), is called when the cluster manager tells the node
got fenced which means on midcomms/lowcomms layer to disconnect the node
from the cluster communication. The node can rejoin the cluster later.
This patch was ensuring no new message were able to be triggered when we
are in the close() function context. This was done by checking if the
lockspace has been stopped. However there is a missing check that we
only need to check specific lockspaces where the fenced node is member
of. This is currently complicated because there is no way to easily
check if a node is part of a specific lockspace without stopping the
recovery. For now we just revert this commit as it is just a check to
finding possible leaks of stopping lockspaces before close() is called.

Cc: stable@vger.kernel.org
Fixes: 2c3fa6ae4d ("dlm: check required context while close")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-06-14 10:17:33 -05:00
Shyam Prasad N
e4645cc2f1 cifs: add a warning when the in-flight count goes negative
We've seen the in-flight count go into negative with some
internal stress testing in Microsoft.

Adding a WARN when this happens, in hope of understanding
why this happens when it happens.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:05 -05:00
Steve French
c774e6779f cifs: fix lease break oops in xfstest generic/098
umount can race with lease break so need to check if
tcon->ses->server is still valid to send the lease
break response.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Fixes: 59a556aebc ("SMB3: drop reference to cfile before sending oplock break")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-14 10:15:01 -05:00
David Howells
d44c404207 block: Fix dio_cleanup() to advance the head index
Fix dio_bio_cleanup() to advance the head index into the list of pages past
the pages it has released, as __blockdev_direct_IO() will call it twice if
do_direct_IO() fails.

The issue was causing:

        WARNING: CPU: 6 PID: 2220 at mm/gup.c:76 try_get_folio

This can be triggered by setting up a clean pair of UDF filesystems on
loopback devices and running the generic/451 xfstest with them as the
scratch and test partitions.  Something like the following:

    fallocate /mnt2/udf_scratch -l 1G
    fallocate /mnt2/udf_test -l 1G
    mknod /dev/lo0 b 7 0
    mknod /dev/lo1 b 7 1
    losetup lo0 /mnt2/udf_scratch
    losetup lo1 /mnt2/udf_test
    mkfs -t udf /dev/lo0
    mkfs -t udf /dev/lo1
    cd xfstests
    ./check generic/451

with xfstests configured by putting the following into local.config:

    export FSTYP=udf
    export DISABLE_UDF_TEST=1
    export TEST_DEV=/dev/lo1
    export TEST_DIR=/xfstest.test
    export SCRATCH_DEV=/dev/lo0
    export SCRATCH_MNT=/xfstest.scratch

Fixes: 1ccf164ec8 ("block: Use iov_iter_extract_pages() and page pinning in direct-io.c")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306120931.a9606b88-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1193485.1686693279@warthog.procyon.org.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 06:58:18 -06:00
Christoph Hellwig
8812387d05 zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for zonefs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2023-06-14 08:51:18 +09:00
Damien Le Moal
16d7fd3cfa zonefs: use iomap for synchronous direct writes
Remove the function zonefs_file_dio_append() that is used to manually
issue REQ_OP_ZONE_APPEND BIOs for processing synchronous direct writes
and use iomap instead.

To preserve the use of zone append operations for synchronous writes,
different struct iomap_dio_ops are defined. For synchronous direct
writes using zone append, zonefs_zone_append_dio_ops is introduced.
The submit_bio operation of this structure is defined as the function
zonefs_file_zone_append_dio_submit_io() which is used to change the BIO
opreation for synchronous direct IO writes to REQ_OP_ZONE_APPEND.

In order to preserve the write location check on completion of zone
append BIOs, the end_io operation is also defined using the function
zonefs_file_zone_append_dio_bio_end_io(). This check now relies on the
zonefs_zone_append_bio structure, allocated together with zone append
BIOs with a dedicated BIO set. This structure include the target inode
of a zone append BIO as well as the target append offset location for
the zone append operation. This is used to perform a check against
bio->bi_iter.bi_sector when the BIO completes, without needing to use
the zone information z_wpoffset field, thus removing the need for
taking the inode truncate mutex.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
2023-06-14 08:51:05 +09:00
Long Li
c3b880acad xfs: fix ag count overflow during growfs
I found a corruption during growfs:

 XFS (loop0): Internal error agbno >= mp->m_sb.sb_agblocks at line 3661 of
   file fs/xfs/libxfs/xfs_alloc.c.  Caller __xfs_free_extent+0x28e/0x3c0
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_corruption_error+0x134/0x150
  __xfs_free_extent+0x2c1/0x3c0
  xfs_ag_extend_space+0x291/0x3e0
  xfs_growfs_data+0xd72/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 XFS (loop0): Corruption detected. Unmount and run xfs_repair
 XFS (loop0): Internal error xfs_trans_cancel at line 1097 of file
   fs/xfs/xfs_trans.c.  Caller xfs_growfs_data+0x691/0xe90
 CPU: 0 PID: 573 Comm: xfs_growfs Not tainted 6.3.0-rc7-next-20230420-00001-gda8c95746257
 Call Trace:
  <TASK>
  dump_stack_lvl+0x50/0x70
  xfs_error_report+0x93/0xc0
  xfs_trans_cancel+0x2c0/0x350
  xfs_growfs_data+0x691/0xe90
  xfs_file_ioctl+0x5f9/0x14a0
  __x64_sys_ioctl+0x13e/0x1c0
  do_syscall_64+0x39/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd
 RIP: 0033:0x7f2d86706577

The bug can be reproduced with the following sequence:

 # truncate -s  1073741824 xfs_test.img
 # mkfs.xfs -f -b size=1024 -d agcount=4 xfs_test.img
 # truncate -s 2305843009213693952  xfs_test.img
 # mount -o loop xfs_test.img /mnt/test
 # xfs_growfs -D  1125899907891200  /mnt/test

The root cause is that during growfs, user space passed in a large value
of newblcoks to xfs_growfs_data_private(), due to current sb_agblocks is
too small, new AG count will exceed UINT_MAX. Because of AG number type
is unsigned int and it would overflow, that caused nagcount much smaller
than the actual value. During AG extent space, delta blocks in
xfs_resizefs_init_new_ags() will much larger than the actual value due to
incorrect nagcount, even exceed UINT_MAX. This will cause corruption and
be detected in __xfs_free_extent. Fix it by growing the filesystem to up
to the maximally allowed AGs and not return EINVAL when new AG count
overflow.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-13 08:49:20 -07:00
Andreas Gruenbacher
cea44032bc gfs2: retry interrupted internal reads
The iomap-based read operations done by gfs2 for its system files, such
as rindex, may sometimes be interrupted and return -EINTR.   This
confuses some users of gfs2_internal_read().  Fix that by retrying
interrupted reads.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-13 16:51:25 +02:00
Tuo Li
6fa0a72cbb gfs2: Fix possible data races in gfs2_show_options()
Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
without holding the lock gt_spin in gfs2_show_options():

  val = sdp->sd_tune.gt_logd_secs;
  if (val != 30)
    seq_printf(s, ",commit=%d", val);

And thus can cause data races when gfs2_show_options() and other functions
such as gfs2_reconfigure() are concurrently executed:

  spin_lock(&gt->gt_spin);
  gt->gt_logd_secs = newargs->ar_commit;

To fix these possible data races, the lock sdp->sd_tune.gt_spin is
acquired before accessing the fields of gfs2_tune and released after these
accesses.

Further changes by Andreas:

- Don't hold the spin lock over the seq_printf operations.

Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-13 14:34:17 +02:00
Jan Kara
404615d7f1 ext2: Drop fragment support
Ext2 has fields in superblock reserved for subblock allocation support.
However that never landed. Drop the many years dead code.

Reported-by: syzbot+af5e10f73dbff48f70af@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
2023-06-13 12:37:31 +02:00
Christoph Hellwig
b294349993 xfs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.
Do that for xfs so that noop_direct_IO can eventually be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-12 18:09:23 -07:00
Darrick J. Wong
61d7e8274c xfs: drop EXPERIMENTAL tag for large extent counts
This feature has been baking in upstream for ~10mo with no bug reports.
It seems to work fine here, let's get rid of the scary warnings?

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-06-12 18:09:04 -07:00
Darrick J. Wong
06f3ef6e17 xfs: don't deplete the reserve pool when trying to shrink the fs
Every now and then, xfs/168 fails with this logged in dmesg:

Reserve blocks depleted! Consider increasing reserve pool size.
EXPERIMENTAL online shrink feature in use. Use at your own risk!
Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
Per-AG reservation for AG 1 failed.  Filesystem may run out of space.
Error -28 reserving per-AG metadata reserve pool.
Corruption of in-memory data (0x8) detected at xfs_ag_shrink_space+0x23c/0x3b0 [xfs] (fs/xfs/libxfs/xfs_ag.c:1007).  Shutting down filesystem.

It's silly to deplete the reserved blocks pool just to shrink the
filesystem, particularly since the fs goes down after that.

Fixes: fb2fc17201 ("xfs: support shrinking unused space in the last AG")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-06-12 18:09:04 -07:00
Qu Wenruo
745806fb45 btrfs: do not ASSERT() on duplicated global roots
[BUG]
Syzbot reports a reproducible ASSERT() when using rescue=usebackuproot
mount option on a corrupted fs.

The full report can be found here:
https://syzkaller.appspot.com/bug?extid=c4614eae20a166c25bf0

  BTRFS error (device loop0: state C): failed to load root csum
  assertion failed: !tmp, in fs/btrfs/disk-io.c:1103
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3664!
  invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  CPU: 1 PID: 3608 Comm: syz-executor356 Not tainted 6.0.0-rc7-syzkaller-00029-g3800a713b607 #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
  RIP: 0010:assertfail+0x1a/0x1c fs/btrfs/ctree.h:3663
  RSP: 0018:ffffc90003aaf250 EFLAGS: 00010246
  RAX: 0000000000000032 RBX: 0000000000000000 RCX: f21c13f886638400
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffff888021c640a0 R08: ffffffff816bd38d R09: ffffed10173667f1
  R10: ffffed10173667f1 R11: 1ffff110173667f0 R12: dffffc0000000000
  R13: ffff8880229c21f7 R14: ffff888021c64060 R15: ffff8880226c0000
  FS:  0000555556a73300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000055a2637d7a00 CR3: 00000000709c4000 CR4: 00000000003506e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   btrfs_global_root_insert+0x1a7/0x1b0 fs/btrfs/disk-io.c:1103
   load_global_roots_objectid+0x482/0x8c0 fs/btrfs/disk-io.c:2467
   load_global_roots fs/btrfs/disk-io.c:2501 [inline]
   btrfs_read_roots fs/btrfs/disk-io.c:2528 [inline]
   init_tree_roots+0xccb/0x203c fs/btrfs/disk-io.c:2939
   open_ctree+0x1e53/0x33df fs/btrfs/disk-io.c:3574
   btrfs_fill_super+0x1c6/0x2d0 fs/btrfs/super.c:1456
   btrfs_mount_root+0x885/0x9a0 fs/btrfs/super.c:1824
   legacy_get_tree+0xea/0x180 fs/fs_context.c:610
   vfs_get_tree+0x88/0x270 fs/super.c:1530
   fc_mount fs/namespace.c:1043 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1073
   btrfs_mount+0x3d3/0xbb0 fs/btrfs/super.c:1884

[CAUSE]
Since the introduction of global roots, we handle
csum/extent/free-space-tree roots as global roots, even if no
extent-tree-v2 feature is enabled.

So for regular csum/extent/fst roots, we load them into
fs_info::global_root_tree rb tree.

And we should not expect any conflicts in that rb tree, thus we have an
ASSERT() inside btrfs_global_root_insert().

But rescue=usebackuproot can break the assumption, as we will try to
load those trees again and again as long as we have bad roots and have
backup roots slot remaining.

So in that case we can have conflicting roots in the rb tree, and
triggering the ASSERT() crash.

[FIX]
We can safely remove that ASSERT(), as the caller will properly put the
offending root.

To make further debugging easier, also add two explicit error messages:

- Error message for conflicting global roots
- Error message when using backup roots slot

Reported-by: syzbot+a694851c6ab28cbcfb9c@syzkaller.appspotmail.com
Fixes: abed4aaae4 ("btrfs: track the csum, extent, and free space trees in a rb tree")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 01:21:16 +02:00
Linus Torvalds
fb054096ae 19 hotfixes. 14 are cc:stable and the remainder address issues which were
introduced during this -rc cycle or which were considered inappropriate
 for a backport.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZIdw7QAKCRDdBJ7gKXxA
 jki4AQCygi1UoqVPq4N/NzJbv2GaNDXNmcJIoLvPpp3MYFhucAEAtQNzAYO9z6CT
 iLDMosnuh+1KLTaKNGL5iak3NAxnxQw=
 =mTdI
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "19 hotfixes. 14 are cc:stable and the remainder address issues which
  were introduced during this development cycle or which were considered
  inappropriate for a backport"

* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  zswap: do not shrink if cgroup may not zswap
  page cache: fix page_cache_next/prev_miss off by one
  ocfs2: check new file size on fallocate call
  mailmap: add entry for John Keeping
  mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
  epoll: ep_autoremove_wake_function should use list_del_init_careful
  mm/gup_test: fix ioctl fail for compat task
  nilfs2: reject devices with insufficient block count
  ocfs2: fix use-after-free when unmounting read-only filesystem
  lib/test_vmalloc.c: avoid garbage in page array
  nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
  riscv/purgatory: remove PGO flags
  powerpc/purgatory: remove PGO flags
  x86/purgatory: remove PGO flags
  kexec: support purgatories with .text.hot sections
  mm/uffd: allow vma to merge as much as possible
  mm/uffd: fix vma operation where start addr cuts part of vma
  radix-tree: move declarations to header
  nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12 16:14:34 -07:00
Chris Mason
deccae40e4 btrfs: can_nocow_file_extent should pass down args->strict from callers
Commit 619104ba45 ("btrfs: move common NOCOW checks against a file
extent into a helper") changed our call to btrfs_cross_ref_exist() to
always pass false for the 'strict' parameter.  We're passing this down
through the stack so that we can do a full check for cross references
during swapfile activation.

With strict always false, this test fails:

  btrfs subvol create swappy
  chattr +C swappy
  fallocate -l1G swappy/swapfile
  chmod 600 swappy/swapfile
  mkswap swappy/swapfile

  btrfs subvol snap swappy swapsnap
  btrfs subvol del -C swapsnap

  btrfs fi sync /
  sync;sync;sync

  swapon swappy/swapfile

The fix is to just use args->strict, and everyone except swapfile
activation is passing false.

Fixes: 619104ba45 ("btrfs: move common NOCOW checks against a file extent into a helper")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:08 +02:00
Christoph Hellwig
7833b86595 btrfs: fix iomap_begin length for nocow writes
can_nocow_extent can reduce the len passed in, which needs to be
propagated to btrfs_dio_iomap_begin so that iomap does not submit
more data then is mapped.

This problems exists since the btrfs_get_blocks_direct helper was added
in commit c5794e5178 ("btrfs: Factor out write portion of
btrfs_get_blocks_direct"), but the ordered_extent splitting added in
commit b73a6fd1b1 ("btrfs: split partial dio bios before submit")
added a WARN_ON that made a syzkaller test fail.

Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Fixes: c5794e5178 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
CC: stable@vger.kernel.org # 6.1+
Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-13 00:01:00 +02:00
Chao Yu
5079e1c0c8 f2fs: avoid dead loop in f2fs_issue_checkpoint()
generic/082 reports a bug as below:

__schedule+0x332/0xf60
schedule+0x6f/0xf0
schedule_timeout+0x23b/0x2a0
wait_for_completion+0x8f/0x140
f2fs_issue_checkpoint+0xfe/0x1b0
f2fs_sync_fs+0x9d/0xb0
sync_filesystem+0x87/0xb0
dquot_load_quota_sb+0x41b/0x460
dquot_load_quota_inode+0xa5/0x130
dquot_quota_on+0x4b/0x60
f2fs_quota_on+0xe3/0x1b0
do_quotactl+0x483/0x700
__x64_sys_quotactl+0x15c/0x310
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

The root casue is race case as below:

Thread A			Kworker			IRQ
- write()
: write data to quota.user file

				- writepages
				 - f2fs_submit_page_write
				  - __is_cp_guaranteed return false
				  - inc_page_count(F2FS_WB_DATA)
				 - submit_bio
- quotactl(Q_QUOTAON)
 - f2fs_quota_on
  - dquot_quota_on
   - dquot_load_quota_inode
    - vfs_setup_quota_inode
    : inode->i_flags |= S_NOQUOTA
							- f2fs_write_end_io
							 - __is_cp_guaranteed return true
							 - dec_page_count(F2FS_WB_CP_DATA)
    - dquot_load_quota_sb
     - f2fs_sync_fs
      - f2fs_issue_checkpoint
       - do_checkpoint
        - f2fs_wait_on_all_pages(F2FS_WB_CP_DATA)
        : loop due to F2FS_WB_CP_DATA count is negative

Calling filemap_fdatawrite() and filemap_fdatawait() to keep all data
clean before quota file setup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:37:36 -07:00
Wu Bo
cadfc2f9f8 f2fs: fix args passed to trace_f2fs_lookup_end
The NULL return of 'd_splice_alias' dosen't mean error. Thus the
successful case will also return NULL, which makes the tracepoint always
print 'err=-ENOENT'.

And the different cases of 'new' & 'err' are list as following:
1) dentry exists: err(0) with new(NULL) --> dentry, err=0
2) dentry exists: err(0) with new(VALID) --> new, err=0
3) dentry exists: err(0) with new(ERR) --> dentry, err=ERR
4) no dentry exists: err(-ENOENT) with new(NULL) --> dentry, err=-ENOENT
5) no dentry exists: err(-ENOENT) with new(VALID) --> new, err=-ENOENT
6) no dentry exists: err(-ENOENT) with new(ERR) --> dentry, err=ERR

Signed-off-by: Wu Bo <bo.wu@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Yangtao Li
38b57833de f2fs: flag as supporting buffered async reads
The f2fs uses generic_file_buffered_read(), which supports buffered async
reads since commit 1a0a7853b9 ("mm: support async buffered reads in
generic_file_buffered_read()").

Let's enable it to match other file-systems. The read performance has been
greatly improved under io_uring:

    167M/s -> 234M/s, Increase ratio by 40%

Test w/:
    ./fio --name=onessd --filename=/data/test/local/io_uring_test
    --size=256M --rw=randread --bs=4k --direct=0 --overwrite=0
    --numjobs=1 --iodepth=1 --time_based=0 --runtime=10
    --ioengine=io_uring --registerfiles --fixedbufs
    --gtod_reduce=1 --group_reporting --sqthread_poll=1

Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chao Yu
20872584b8 f2fs: fix to drop all dirty meta/node pages during umount()
For cp error case, there will be dirty meta/node pages remained after
f2fs_write_checkpoint() in f2fs_put_super(), drop them explicitly, and
do sanity check on reference count of dirty pages and inflight IOs.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chunhai Guo
38a4a330c8 f2fs: Detect looped node chain efficiently
find_fsync_dnodes() detect the looped node chain by comparing the loop
counter with free blocks. While it may take tens of seconds to quit when
the free blocks are large enough. We can use Floyd's cycle detection
algorithm to make the detection more efficient.

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Daejun Park
25f9080576 f2fs: add async reset zone command support
This patch enables submit reset zone command asynchornously. It helps
decrease average latency of write IOs in high utilization scenario by
faster checkpointing.

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:09 -07:00
Chao Yu
901c12d144 f2fs: flush error flags in workqueue
In IRQ context, it wakes up workqueue to record errors into on-disk
superblock fields rather than in-memory fields.

Fixes: 1aa161e431 ("f2fs: fix scheduling while atomic in decompression path")
Fixes: 95fa90c9e5 ("f2fs: support recording errors into superblock")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
458c15dfbc f2fs: don't reset unchangable mount option in f2fs_remount()
syzbot reports a bug as below:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:__lock_acquire+0x69/0x2000 kernel/locking/lockdep.c:4942
Call Trace:
 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5691
 __raw_write_lock include/linux/rwlock_api_smp.h:209 [inline]
 _raw_write_lock+0x2e/0x40 kernel/locking/spinlock.c:300
 __drop_extent_tree+0x3ac/0x660 fs/f2fs/extent_cache.c:1100
 f2fs_drop_extent_tree+0x17/0x30 fs/f2fs/extent_cache.c:1116
 f2fs_insert_range+0x2d5/0x3c0 fs/f2fs/file.c:1664
 f2fs_fallocate+0x4e4/0x6d0 fs/f2fs/file.c:1838
 vfs_fallocate+0x54b/0x6b0 fs/open.c:324
 ksys_fallocate fs/open.c:347 [inline]
 __do_sys_fallocate fs/open.c:355 [inline]
 __se_sys_fallocate fs/open.c:353 [inline]
 __x64_sys_fallocate+0xbd/0x100 fs/open.c:353
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is race condition as below:
- since it tries to remount rw filesystem, so that do_remount won't
call sb_prepare_remount_readonly to block fallocate, there may be race
condition in between remount and fallocate.
- in f2fs_remount(), default_options() will reset mount option to default
one, and then update it based on result of parse_options(), so there is
a hole which race condition can happen.

Thread A			Thread B
- f2fs_fill_super
 - parse_options
  - clear_opt(READ_EXTENT_CACHE)

- f2fs_remount
 - default_options
  - set_opt(READ_EXTENT_CACHE)
				- f2fs_fallocate
				 - f2fs_insert_range
				  - f2fs_drop_extent_tree
				   - __drop_extent_tree
				    - __may_extent_tree
				     - test_opt(READ_EXTENT_CACHE) return true
				    - write_lock(&et->lock) access NULL pointer
 - parse_options
  - clear_opt(READ_EXTENT_CACHE)

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+d015b6c2fbb5c383bf08@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/20230522124203.3838360-1-chao@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
d8189834d4 f2fs: fix to avoid NULL pointer dereference f2fs_write_end_io()
butt3rflyh4ck reports a bug as below:

When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is
failed, f2fs kernel thread would invoke callback function to update f2fs io
info, it would call  f2fs_write_end_io and may trigger null-ptr-deref in
NODE_MAPPING.

general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline]
RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370
 <TASK>
 bio_endio+0x5af/0x6c0 block/bio.c:1608
 req_bio_endio block/blk-mq.c:761 [inline]
 blk_update_request+0x5cc/0x1690 block/blk-mq.c:906
 blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023
 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101
 __do_softirq+0x1d4/0x8ef kernel/softirq.c:571
 run_ksoftirqd kernel/softirq.c:939 [inline]
 run_ksoftirqd+0x31/0x60 kernel/softirq.c:931
 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
 kthread+0x33e/0x440 kernel/kthread.c:379
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

The root cause is below race case can cause leaving dirty metadata
in f2fs after filesystem is remount as ro:

Thread A				Thread B
- f2fs_ioc_resize_fs
 - f2fs_readonly   --- return false
 - f2fs_resize_fs
					- f2fs_remount
					 - write_checkpoint
					 - set f2fs as ro
  - free_segment_range
   - update meta_inode's data

Then, if f2fs_put_super()  fails to write_checkpoint due to readonly
status, and meta_inode's dirty data will be writebacked after node_inode
is put, finally, f2fs_write_end_io will access NULL pointer on
sbi->node_inode.

Thread A				IRQ context
- f2fs_put_super
 - write_checkpoint fails
 - iput(node_inode)
 - node_inode = NULL
 - iput(meta_inode)
  - write_inode_now
   - f2fs_write_meta_page
					- f2fs_write_end_io
					 - NODE_MAPPING(sbi)
					 : access NULL pointer on node_inode

Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn
Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
bfd4766239 f2fs: clean up w/ sbi->log_sectors_per_block
Use sbi->log_sectors_per_block to clean up below calculated one:

unsigned int log_sectors_per_block = sbi->log_blocksize - SECTOR_SHIFT;

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
90b7c4b748 f2fs: fix to set noatime and immutable flag for quota file
We should set noatime bit for quota files, since no one cares about
atime of quota file, and we should set immutalbe bit as well, due to
nobody should write to the file through exported interfaces.

Meanwhile this patch use inode_lock to avoid race condition during
inode->i_flags, f2fs_inode->i_flags update.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:08 -07:00
Chao Yu
77e820ea73 f2fs: renew value of F2FS_FEATURE_*
Define F2FS_FEATURE_* macro w/ 32-bits value rather than 16-bits value.

No logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Chao Yu
478d7100f4 f2fs: renew value of F2FS_MOUNT_*
Then we can just define newly introduced mount option w/ lasted
free number rather than random free one.

Just cleanup, no logic changes.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Chao Yu
f082c6b205 f2fs: fix potential deadlock due to unpaired node_write lock use
If S_NOQUOTA is cleared from inode during data page writeback of quota
file, it may miss to unlock node_write lock, result in potential
deadlock, fix to use the lock in paired.

Kworker					Thread
- writepage
 if (IS_NOQUOTA())
   f2fs_down_read(&sbi->node_write);
					- vfs_cleanup_quota_inode
					 - inode->i_flags &= ~S_NOQUOTA;
 if (IS_NOQUOTA())
   f2fs_up_read(&sbi->node_write);

Fixes: 79963d967b ("f2fs: shrink node_write lock coverage")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Yonggil Song
36ded4c106 f2fs: Fix over-estimating free section during FG GC
There was a bug that finishing FG GC unconditionally because free sections
are over-estimated after checkpoint in FG GC.
This patch initializes sec_freed by every checkpoint in FG GC.

Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Daeho Jeong
04abeb699d f2fs: close unused open zones while mounting
Zoned UFS allows only 6 open zones at the same time, so we need to take
care of the count of open zones while mounting.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-06-12 13:04:07 -07:00
Mickaël Salaün
74ce793bcb
hostfs: Fix ephemeral inodes
hostfs creates a new inode for each opened or created file, which
created useless inode allocations and forbade identifying a host file
with a kernel inode.

Fix this uncommon filesystem behavior by tying kernel inodes to host
file's inode and device IDs.  Even if the host filesystem inodes may be
recycled, this cannot happen while a file referencing it is opened,
which is the case with hostfs.  It should be noted that hostfs inode IDs
may not be unique for the same hostfs superblock because multiple host's
(backed) superblocks may be used.

Delete inodes when dropping them to force backed host's file descriptors
closing.

This enables to entirely remove ARCH_EPHEMERAL_INODES, and then makes
Landlock fully supported by UML.  This is very useful for testing
changes.

These changes also factor out and simplify some helpers thanks to the
new hostfs_inode_update() and the hostfs_iget() revamp: read_name(),
hostfs_create(), hostfs_lookup(), hostfs_mknod(), and
hostfs_fill_sb_common().

A following commit with new Landlock tests check this new hostfs inode
consistency.

Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20230612191430.339153-2-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
2023-06-12 21:26:19 +02:00
Luís Henriques
26a6ffff7d ocfs2: check new file size on fallocate call
When changing a file size with fallocate() the new size isn't being
checked.  In particular, the FSIZE ulimit isn't being checked, which makes
fstest generic/228 fail.  Simply adding a call to inode_newsize_ok() fixes
this issue.

Link: https://lkml.kernel.org/r/20230529152645.32680-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Mark Fasheh <mark@fasheh.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Benjamin Segall
2192bba03d epoll: ep_autoremove_wake_function should use list_del_init_careful
autoremove_wake_function uses list_del_init_careful, so should epoll's
more aggressive variant.  It only doesn't because it was copied from an
older wait.c rather than the most recent.

[bsegall@google.com: add comment]
  Link: https://lkml.kernel.org/r/xm26bki0ulsr.fsf_-_@google.com
Link: https://lkml.kernel.org/r/xm26pm6hvfer.fsf@google.com
Fixes: a16ceb1396 ("epoll: autoremove wakers even more aggressively")
Signed-off-by: Ben Segall <bsegall@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:52 -07:00
Ryusuke Konishi
92c5d1b860 nilfs2: reject devices with insufficient block count
The current sanity check for nilfs2 geometry information lacks checks for
the number of segments stored in superblocks, so even for device images
that have been destructively truncated or have an unusually high number of
segments, the mount operation may succeed.

This causes out-of-bounds block I/O on file system block reads or log
writes to the segments, the latter in particular causing
"a_ops->writepages" to repeatedly fail, resulting in sync_inodes_sb() to
hang.

Fix this issue by checking the number of segments stored in the superblock
and avoiding mounting devices that can cause out-of-bounds accesses.  To
eliminate the possibility of overflow when calculating the number of
blocks required for the device from the number of segments, this also adds
a helper function to calculate the upper bound on the number of segments
and inserts a check using it.

Link: https://lkml.kernel.org/r/20230526021332.3431-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+7d50f1e54a12ba3aeae2@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=7d50f1e54a12ba3aeae2
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Luís Henriques
50d927880e ocfs2: fix use-after-free when unmounting read-only filesystem
It's trivial to trigger a use-after-free bug in the ocfs2 quotas code using
fstest generic/452.  After a read-only remount, quotas are suspended and
ocfs2_mem_dqinfo is freed through ->ocfs2_local_free_info().  When unmounting
the filesystem, an UAF access to the oinfo will eventually cause a crash.
 
BUG: KASAN: slab-use-after-free in timer_delete+0x54/0xc0
Read of size 8 at addr ffff8880389a8208 by task umount/669
...
Call Trace:
 <TASK>
 ...
 timer_delete+0x54/0xc0
 try_to_grab_pending+0x31/0x230
 __cancel_work_timer+0x6c/0x270
 ocfs2_disable_quotas.isra.0+0x3e/0xf0 [ocfs2]
 ocfs2_dismount_volume+0xdd/0x450 [ocfs2]
 generic_shutdown_super+0xaa/0x280
 kill_block_super+0x46/0x70
 deactivate_locked_super+0x4d/0xb0
 cleanup_mnt+0x135/0x1f0
 ...
 </TASK>

Allocated by task 632:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 __kasan_kmalloc+0x8b/0x90
 ocfs2_local_read_info+0xe3/0x9a0 [ocfs2]
 dquot_load_quota_sb+0x34b/0x680
 dquot_load_quota_inode+0xfe/0x1a0
 ocfs2_enable_quotas+0x190/0x2f0 [ocfs2]
 ocfs2_fill_super+0x14ef/0x2120 [ocfs2]
 mount_bdev+0x1be/0x200
 legacy_get_tree+0x6c/0xb0
 vfs_get_tree+0x3e/0x110
 path_mount+0xa90/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 650:
 kasan_save_stack+0x1c/0x40
 kasan_set_track+0x21/0x30
 kasan_save_free_info+0x2a/0x50
 __kasan_slab_free+0xf9/0x150
 __kmem_cache_free+0x89/0x180
 ocfs2_local_free_info+0x2ba/0x3f0 [ocfs2]
 dquot_disable+0x35f/0xa70
 ocfs2_susp_quotas.isra.0+0x159/0x1a0 [ocfs2]
 ocfs2_remount+0x150/0x580 [ocfs2]
 reconfigure_super+0x1a5/0x3a0
 path_mount+0xc8a/0xe10
 __x64_sys_mount+0x16f/0x1a0
 do_syscall_64+0x43/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Link: https://lkml.kernel.org/r/20230522102112.9031-1-lhenriques@suse.de
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Ryusuke Konishi
fee5eaecca nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
Syzbot reports that in its stress test for resize ioctl, the log writing
function nilfs_segctor_do_construct hits a WARN_ON in
nilfs_segctor_truncate_segments().

It turned out that there is a problem with the current implementation of
the resize ioctl, which changes the writable range on the device (the
range of allocatable segments) at the end of the resize process.

This order is necessary for file system expansion to avoid corrupting the
superblock at trailing edge.  However, in the case of a file system
shrink, if log writes occur after truncating out-of-bounds trailing
segments and before the resize is complete, segments may be allocated from
the truncated space.

The userspace resize tool was fine as it limits the range of allocatable
segments before performing the resize, but it can run into this issue if
the resize ioctl is called alone.

Fix this issue by changing nilfs_sufile_resize() to update the range of
allocatable segments immediately after successful truncation of segment
space in case of file system shrink.

Link: https://lkml.kernel.org/r/20230524094348.3784-1-konishi.ryusuke@gmail.com
Fixes: 4e33f9eab0 ("nilfs2: implement resize ioctl")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+33494cd0df2ec2931851@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/0000000000005434c405fbbafdc5@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:51 -07:00
Peter Xu
5543d3c43c mm/uffd: allow vma to merge as much as possible
We used to not pass in the pgoff correctly when register/unregister uffd
regions, it caused incorrect behavior on vma merging and can cause
mergeable vmas being separate after ioctls return.

For example, when we have:

  vma1(range 0-9, with uffd), vma2(range 10-19, no uffd)

Then someone unregisters uffd on range (5-9), it should logically become:

  vma1(range 0-4, with uffd), vma2(range 5-19, no uffd)

But with current code we'll have:

  vma1(range 0-4, with uffd), vma3(range 5-9, no uffd), vma2(range 10-19, no uffd)

This patch allows such merge to happen correctly before ioctl returns.

This behavior seems to have existed since the 1st day of uffd.  Since
pgoff for vma_merge() is only used to identify the possibility of vma
merging, meanwhile here what we did was always passing in a pgoff smaller
than what we should, so there should have no other side effect besides not
merging it.  Let's still tentatively copy stable for this, even though I
don't see anything will go wrong besides vma being split (which is mostly
not user visible).

Link: https://lkml.kernel.org/r/20230517190916.3429499-3-peterx@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Peter Xu
270aa01062 mm/uffd: fix vma operation where start addr cuts part of vma
Patch series "mm/uffd: Fix vma merge/split", v2.

This series contains two patches that fix vma merge/split for userfaultfd
on two separate issues.

Patch 1 fixes a regression since 6.1+ due to something we overlooked when
converting to maple tree apis.  The plan is we use patch 1 to replace the
commit "2f628010799e (mm: userfaultfd: avoid passing an invalid range to
vma_merge())" in mm-hostfixes-unstable tree if possible, so as to bring
uffd vma operations back aligned with the rest code again.

Patch 2 fixes a long standing issue that vma can be left unmerged even if
we can for either uffd register or unregister.

Many thanks to Lorenzo on either noticing this issue from the assert
movement patch, looking at this problem, and also provided a reproducer on
the unmerged vma issue [1].

[1] https://gist.github.com/lorenzo-stoakes/a11a10f5f479e7a977fc456331266e0e


This patch (of 2):

It seems vma merging with uffd paths is broken with either
register/unregister, where right now we can feed wrong parameters to
vma_merge() and it's found by recent patch which moved asserts upwards in
vma_merge() by Lorenzo Stoakes:

https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/

It's possible that "start" is contained within vma but not clamped to its
start.  We need to convert this into either "cannot merge" case or "can
merge" case 4 which permits subdivision of prev by assigning vma to prev. 
As we loop, each subsequent VMA will be clamped to the start.

This patch will eliminate the report and make sure vma_merge() calls will
become legal again.

One thing to mention is that the "Fixes: 29417d292bd0" below is there only
to help explain where the warning can start to trigger, the real commit to
fix should be 69dbe6daf1.  Commit 29417d292b helps us to identify the
issue, but unfortunately we may want to keep it in Fixes too just to ease
kernel backporters for easier tracking.

Link: https://lkml.kernel.org/r/20230517190916.3429499-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20230517190916.3429499-2-peterx@redhat.com
Fixes: 69dbe6daf1 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Closes: https://lore.kernel.org/all/ZFunF7DmMdK05MoF@FVFF77S0Q05N.cambridge.arm.com/
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:50 -07:00
Ryusuke Konishi
2f012f2bac nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
A syzbot fault injection test reported that nilfs_btnode_create_block, a
helper function that allocates a new node block for b-trees, causes a
kernel BUG for disk images where the file system block size is smaller
than the page size.

This was due to unexpected flags on the newly allocated buffer head, and
it turned out to be because the buffer flags were not cleared by
nilfs_btnode_abort_change_key() after an error occurred during a b-tree
update operation and the buffer was later reused in that state.

Fix this issue by using nilfs_btnode_delete() to abandon the unused
preallocated buffer in nilfs_btnode_abort_change_key().

Link: https://lkml.kernel.org/r/20230513102428.10223-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+b0a35a5c1f7e846d3b09@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d1d6c205ebc4d512@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-12 11:31:49 -07:00
Bob Peterson
c8ed1b3593 gfs2: Fix duplicate should_fault_in_pages() call
In gfs2_file_buffered_write(), we currently jump from the second call of
function should_fault_in_pages() to above the first call, so
should_fault_in_pages() is getting called twice in a row, causing it to
accidentally fall back to single-page writes rather than trying the more
efficient multi-page writes first.

Fix that by moving the retry label to the correct place, behind the
first call to should_fault_in_pages().

Fixes: e1fa9ea85c ("gfs2: Stop using glock holder auto-demotion for now")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-12 20:04:35 +02:00
Linus Torvalds
ace9e12da2 for-6.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSHScAACgkQxWXV+ddt
 WDvoLA/8CxGfC9i/zO2odxbV1id8JiubGyi2Q28ANE3ygwRBI2dh7u2TBTv9aKPF
 Bzm6VsafG2OwMuwu08jO3t98+QrxU9vb6YCzCPL4t+8IDLJhwpz6zdH/Lvl3RnyV
 nz+aKHi2vfTRKt1Cf4uB5dVzPM3QVHYi3vidt15Suf2nhKnXimu0FVGXabQfd44z
 cCE4ep8IkLshcrsEOwVQj44isRXztJza3D6P7zPfu0NB5Bue7VJNBI4JoGOAT8UQ
 8c+V1U6EbMARWcdbk4Vm34IoAAxcQW6MNnHG83+ie2OpuKJ9g7oNXMTPL73gntNr
 DtC38Vr8gbpXJFmqOCwD8+9f3jP2pX6LjJT0IR6eGJbCleWd6JPlvnfJ+QHdb/vE
 LblDjH84O0Js+0iPKOSKzglfrKZPYDEnIBUwbZQICj/8+aHPU1Y4eTRcv52bVnpa
 1umdz19Sjh0HjuX4k44E/fLgGnLw+ezxhe6WQ7RdDrnr4+9tXpz0z/ZsatIgl1Pc
 wfS5Y2XBIdzKBIF8FxAEL3xCXd6byOsMMhSRu6J7W8Tgw5dnvKiQLRCK+FIpBRru
 WZ7vrNKz67marmqcIp0Hpoipd5+ib6pAdZs69GAvk4bWvVoLZ0Vuyb3lQr5fg6Vm
 Xn1iwcYoWjlAYrpVW31dlaVCfoewm96qbzNa3XqA87I/6frGFcc=
 =ABpK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A  more fixes and regression fixes:

   - in subpage mode, fix crash when repairing metadata at the end of
     a stripe

   - properly enable async discard when remounting from read-only to
     read-write

   - scrub regression fixes:
      - respect read-only scrub when attempting to do a repair
      - fix reporting of found errors, the stats don't get properly
        accounted after a stripe repair"

* tag 'for-6.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: scrub: also report errors hit during the initial read
  btrfs: scrub: respect the read-only flag during repair
  btrfs: properly enable async discard when switching from RO->RW
  btrfs: subpage: fix a crash in metadata repair path
2023-06-12 10:53:35 -07:00
Dai Ngo
58f5d89400 NFSD: add encoding of op_recall flag for write delegation
Modified nfsd4_encode_open to encode the op_recall flag properly
for OPEN result with write delegation granted.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
2023-06-12 12:16:35 -04:00
NeilBrown
665e89ab7c lockd: drop inappropriate svc_get() from locked_get()
The below-mentioned patch was intended to simplify refcounting on the
svc_serv used by locked.  The goal was to only ever have a single
reference from the single thread.  To that end we dropped a call to
lockd_start_svc() (except when creating thread) which would take a
reference, and dropped the svc_put(serv) that would drop that reference.

Unfortunately we didn't also remove the svc_get() from
lockd_create_svc() in the case where the svc_serv already existed.
So after the patch:
 - on the first call the svc_serv was allocated and the one reference
   was given to the thread, so there are no extra references
 - on subsequent calls svc_get() was called so there is now an extra
   reference.
This is clearly not consistent.

The inconsistency is also clear in the current code in lockd_get()
takes *two* references, one on nlmsvc_serv and one by incrementing
nlmsvc_users.   This clearly does not match lockd_put().

So: drop that svc_get() from lockd_get() (which used to be in
lockd_create_svc().

Reported-by: Ido Schimmel <idosch@idosch.org>
Closes: https://lore.kernel.org/linux-nfs/ZHsI%2FH16VX9kJQX1@shredder/T/#u
Fixes: b73a297204 ("lockd: move lockd_start_svc() call into lockd_create_svc()")
Signed-off-by: NeilBrown <neilb@suse.de>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-12 12:16:34 -04:00
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
81b1fb7d17 fs: remove sb->s_mode
There is no real need to store the open mode in the super_block now.
It is only used by f2fs, which can easily recalculate it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
3f0b3e785e block: add a sb_open_mode helper
Add a helper to return the open flags for blkdev_get_by* for passed in
super block flags instead of open coding the logic in many places.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2ef789288a btrfs: don't pass a holder for non-exclusive blkdev_get_by_path
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead and remove the holder argument from the
call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls.

Exclusive mode for device scanning is not used since commit 50d281fc43
("btrfs: scan device in non-exclusive mode")".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
7b7b06d55a gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag"), file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.

Remove .direct_IO from gfs2_aops and set FMODE_CAN_ODIRECT in
gfs2_open_common for regular files that do not use data journalling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-12 12:18:55 +02:00
Amir Goldstein
7b8c9d7bb4 fsnotify: move fsnotify_open() hook into do_dentry_open()
fsnotify_open() hook is called only from high level system calls
context and not called for the very many helpers to open files.

This may makes sense for many of the special file open cases, but it is
inconsistent with fsnotify_close() hook that is called for every last
fput() of on a file object with FMODE_OPENED.

As a result, it is possible to observe ACCESS, MODIFY and CLOSE events
without ever observing an OPEN event.

Fix this inconsistency by replacing all the fsnotify_open() hooks with
a single hook inside do_dentry_open().

If there are special cases that would like to opt-out of the possible
overhead of fsnotify() call in fsnotify_open(), they would probably also
want to avoid the overhead of fsnotify() call in the rest of the fsnotify
hooks, so they should be opening that file with the __FMODE_NONOTIFY flag.

However, in the majority of those cases, the s_fsnotify_connectors
optimization in fsnotify_parent() would be sufficient to avoid the
overhead of fsnotify() call anyway.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230611122429.1499617-1-amir73il@gmail.com>
2023-06-12 10:43:45 +02:00
Shyam Prasad N
5e90aa21eb cifs: fix max_credits implementation
The current implementation of max_credits on the client does
not work because the CreditRequest logic for several commands
does not take max_credits into account.

Still, we can end up asking the server for more credits, depending
on the number of credits in flight. For this, we need to
limit the credits while parsing the responses too.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Shyam Prasad N
2991b77409 cifs: fix sockaddr comparison in iface_cmp
iface_cmp used to simply do a memcmp of the two
provided struct sockaddrs. The comparison needs to do more
based on the address family. Similar logic was already
present in cifs_match_ipaddr. Doing something similar now.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Enzo Matsumiya
50e63d6db6 smb/client: print "Unknown" instead of bogus link speed value
The virtio driver for Linux guests will not set a link speed to its
paravirtualized NICs.  This will be seen as -1 in the ethernet layer, and
when some servers (e.g. samba) fetches it, it's converted to an unsigned
value (and multiplied by 1000 * 1000), so in client side we end up with:

1)      Speed: 4294967295000000 bps

in DebugData.

This patch introduces a helper that returns a speed string (in Mbps or
Gbps) if interface speed is valid (>= SPEED_10 and <= SPEED_800000), or
"Unknown" otherwise.

The reason to not change the value in iface->speed is because we don't
know the real speed of the HW backing the server NIC, so let's keep
considering these as the fastest NICs available.

Also print "Capabilities: None" when the interface doesn't support any.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:49 -05:00
Shyam Prasad N
2a44b38966 cifs: print all credit counters in DebugData
Output of /proc/fs/cifs/DebugData shows only the per-connection
counter for the number of credits of regular type. i.e. the
credits reserved for echo and oplocks are not displayed.

There have been situations recently where having this info
would have been useful. This change prints the credit counters
of all three types: regular, echo, oplocks.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
Shyam Prasad N
91f4480c41 cifs: fix status checks in cifs_tree_connect
The ordering of status checks at the beginning of
cifs_tree_connect is wrong. As a result, a tcon
which is good may stay marked as needing reconnect
infinitely.

Fixes: 2f0e4f0342 ("cifs: check only tcon status on tcon related functions")
Cc: stable@vger.kernel.org # 6.3
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
鑫华
a5998a9ec3 smb: remove obsolete comment
Because do_gettimeofday has been removed and replaced by ktime_get_real_ts64,
So just remove the comment as it's not needed now.

Signed-off-by: 鑫华 <jixianghua@xfusion.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-11 20:52:48 -05:00
Jeff Layton
518f375c15 nfsd: don't provide pre/post-op attrs if fh_getattr fails
nfsd calls fh_getattr to get the latest inode attrs for pre/post-op
info. In the event that fh_getattr fails, it resorts to scraping cached
values out of the inode directly.

Since these attributes are optional, we can just skip providing them
altogether when this happens.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Neil Brown <neilb@suse.de>
2023-06-11 16:37:46 -04:00
Chuck Lever
df56b384de NFSD: Remove nfsd_readv()
nfsd_readv()'s consumers now use nfsd_iter_read().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:46 -04:00
Chuck Lever
703d752155 NFSD: Hoist rq_vec preparation into nfsd_read() [step two]
Now that the preparation of an rq_vec has been removed from the
generic read path, nfsd_splice_read() no longer needs to reset
rq_next_page.

nfsd4_encode_read() calls nfsd_splice_read() directly. As far as I
can ascertain, resetting rq_next_page for NFSv4 splice reads is
unnecessary because rq_next_page is already set correctly.

Moreover, resetting it might even be incorrect if previous
operations in the COMPOUND have already consumed at least a page of
the send buffer. I would expect that the result would be encoding
the READ payload over previously-encoded results.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:46 -04:00
Chuck Lever
507df40ebf NFSD: Hoist rq_vec preparation into nfsd_read()
Accrue the following benefits:

a) Deduplicate this common bit of code.

b) Don't prepare rq_vec for NFSv2 and NFSv3 spliced reads, which
   don't use rq_vec. This is already the case for
   nfsd4_encode_read().

c) Eventually, converting NFSD's read path to use a bvec iterator
   will be simpler.

In the next patch, nfsd_iter_read() will replace nfsd_readv() for
all NFS versions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:45 -04:00
Chuck Lever
ed4a567a17 NFSD: Update rq_next_page between COMPOUND operations
A GETATTR with a large result can advance xdr->page_ptr without
updating rq_next_page. If a splice READ follows that GETATTR in the
COMPOUND, nfsd_splice_actor can start splicing at the wrong page.

I've also seen READLINK and READDIR leave rq_next_page in an
unmodified state.

There are potentially a myriad of combinations like this, so play it
safe: move the rq_next_page update to nfsd4_encode_operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:45 -04:00
Chuck Lever
ba21e20b30 NFSD: Use svcxdr_encode_opaque_pages() in nfsd4_encode_splice_read()
Commit 15b23ef5d3 ("nfsd4: fix corruption of NFSv4 read data")
encountered exactly the same issue: after a splice read, a
filesystem-owned page is left in rq_pages[]; the symptoms are the
same as described there.

If the computed number of pages in nfsd4_encode_splice_read() is not
exactly the same as the actual number of pages that were consumed by
nfsd_splice_actor() (say, because of a bug) then hilarity ensues.

Instead of recomputing the page offset based on the size of the
payload, use rq_next_page, which is already properly updated by
nfsd_splice_actor(), to cause svc_rqst_release_pages() to operate
correctly in every instance.

This is a defensive change since we believe that after commit
27c934dd88 ("nfsd: don't replace page in rq_pages if it's a
continuation of last page") has been applied, there are no known
opportunities for nfsd_splice_actor() to screw up. So I'm not
marking it for stable backport.

Reported-by: Andy Zlotek <andy.zlotek@oracle.com>
Suggested-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-11 16:37:40 -04:00
Linus Torvalds
65d7ca5987 five smb3 server fixes, all also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSFH04ACgkQiiy9cAdy
 T1HXbAv5AZmcFRMjMTY4Yk/4Csm9wFn11ieMtmtBPfzdHLt5SNnVL+QK69wUJEVb
 Z0Q/mcc6xJvgzda3W9k7jg0G3ZiWty9GJ8ezqlVNLUQdRuRVqnoS5FoF9d68/UUN
 t+AVz2mLb28wIZbx13OXQ4Eb46KgbI4fQN/mZ7c6BrbKHu6CxYg84SyXG8WaBgw3
 FYGpGGnF94O26kbAwa+9ZI51zrNOpHFtjNZtdq/c/uYhpSa9aE4uNDkI841A2v2W
 //l9CxbK8gwXObOFmXlsMOrT+z/4rigL7iWJaR/6fRvcswv6x1Nqd8xvLum9hzDl
 IRvXrshKUP1w7i5eAYUavzIqc11nlGM+EGUhYpkTmED6oxpXufyf5029JH5ZziSR
 0lsQ/1ZCI1/7h5fkZIxI7khwIhOBVitrRj4rylTQzlrTyZZUdR+O2ONlsqmP5+tX
 Iw+z0NHrg/4+Mt2SSMx67QXoT2RU8E0GxtLOk5u6Hq4KJjv8wC5NquA+EVnVovsK
 he0D1ibu
 =JIvi
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Five smb3 server fixes, all also for stable:

   - Fix four slab out of bounds warnings: improve checks for protocol
     id, and for small packet length, and for create context parsing,
     and for negotiate context parsing

   - Fix for incorrect dereferencing POSIX ACLs"

* tag '6.4-rc5-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: validate smb request protocol id
  ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop
  ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR()
  ksmbd: fix out-of-bound read in parse_lease_state()
  ksmbd: fix out-of-bound read in deassemble_neg_contexts()
2023-06-11 10:07:35 -07:00
Joseph Qi
69fe5c430c ocfs2: cleanup trace events
After commit 6dc8bc0fb3 ("ocfs2: switch to iter_file_splice_write()"),
ocfs2 has switched from ocfs2_file_splice_write() to
iter_file_splice_write(), so cleanup the corresponding trace event as
well.

Link: https://lkml.kernel.org/r/20230528132033.217664-2-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:23 -07:00
Joseph Qi
d32840ad4a ocfs2: correct return value of ocfs2_local_free_info()
Now in ocfs2_local_free_info(), it returns 0 even if it actually fails. 
Though it doesn't cause any real problem since the only caller
dquot_disable() ignores the return value, we'd better return correct as it
is.

Link: https://lkml.kernel.org/r/20230528132033.217664-1-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:22 -07:00
Vincent Whitchurch
e994f5b677 squashfs: cache partial compressed blocks
Before commit 93e72b3c61 ("squashfs: migrate from ll_rw_block
usage to BIO"), compressed blocks read by squashfs were cached in the page
cache, but that is not the case after that commit.  That has lead to
squashfs having to re-read a lot of sectors from disk/flash.

For example, the first sectors of every metadata block need to be read
twice from the disk.  Once partially to read the length, and a second time
to read the block itself.  Also, in linear reads of large files, the last
sectors of one data block are re-read from disk when reading the next data
block, since the compressed blocks are of variable sizes and not aligned
to device blocks.  This extra I/O results in a degrade in read performance
of, for example, ~16% in one scenario on my ARM platform using squashfs
with dm-verity and NAND.

Since the decompressed data is cached in the page cache or squashfs'
internal metadata and fragment caches, caching _all_ compressed pages
would lead to a lot of double caching and is undesirable.  But make the
code cache any disk blocks which were only partially requested, since
these are the ones likely to include data which is needed by other file
system blocks.  This restores read performance in my test scenario.

The compressed block caching is only applied when the disk block size is
equal to the page size, to avoid having to deal with caching sub-page
reads.

[akpm@linux-foundation.org: fs/squashfs/block.c needs linux/pagemap.h]
[vincent.whitchurch@axis.com: fix page update race]
  Link: https://lkml.kernel.org/r/20230526-squashfs-cache-fixup-v1-1-d54a7fa23e7b@axis.com
[vincent.whitchurch@axis.com: fix page indices]
  Link: https://lkml.kernel.org/r/20230526-squashfs-cache-fixup-v1-2-d54a7fa23e7b@axis.com
[akpm@linux-foundation.org: fix layout, per hch]
Link: https://lkml.kernel.org/r/20230510-squashfs-cache-v4-1-3bd394e1ee71@axis.com
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:14 -07:00
Christoph Hellwig
6b81459c9c squashfs: don't include buffer_head.h
Squashfs has stopped using buffers heads in 93e72b3c61
("squashfs: migrate from ll_rw_block usage to BIO").

Link: https://lkml.kernel.org/r/20230517071622.245151-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:14 -07:00
Azeem Shaikh
9e627588bf procfs: replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.  This read may exceed the
destination size limit.  This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated [1].  In an effort
to remove strlcpy() completely [2], replace strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Link: https://lkml.kernel.org/r/20230510212457.3491385-1-azeemshaikh38@gmail.com
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 17:44:13 -07:00
Christoph Hellwig
64d1b4dd82 fuse: use direct_write_fallback
Use the generic direct_write_fallback helper instead of duplicating the
logic.

Link: https://lkml.kernel.org/r/20230601145904.1385409-13-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
596df33d67 fuse: drop redundant arguments to fuse_perform_write
pos is always equal to iocb->ki_pos, and mapping is always equal to
iocb->ki_filp->f_mapping.

Link: https://lkml.kernel.org/r/20230601145904.1385409-12-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
70e986c3b4 fuse: update ki_pos in fuse_perform_write
Both callers of fuse_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-11-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:54 -07:00
Christoph Hellwig
44fff0fa08 fs: factor out a direct_write_fallback helper
Add a helper dealing with handling the syncing of a buffered write
fallback for direct I/O.

Link: https://lkml.kernel.org/r/20230601145904.1385409-10-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
8ee93b4bb6 iomap: use kiocb_write_and_wait and kiocb_invalidate_pages
Use the common helpers for direct I/O page invalidation instead of open
coding the logic.  This leads to a slight reordering of checks in
__iomap_dio_rw to keep the logic straight.

Link: https://lkml.kernel.org/r/20230601145904.1385409-9-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
219580eea1 iomap: update ki_pos in iomap_file_buffered_write
All callers of iomap_file_buffered_write need to updated ki_pos, move it
into common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
c402a9a943 filemap: add a kiocb_invalidate_post_direct_write helper
Add a helper to invalidate page cache after a dio write.

Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:53 -07:00
Christoph Hellwig
182c25e9c1 filemap: update ki_pos in generic_perform_write
All callers of generic_perform_write need to updated ki_pos, move it into
common code.

Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:52 -07:00
Christoph Hellwig
936e114a24 iomap: update ki_pos a little later in iomap_dio_complete
Move the ki_pos update down a bit to prepare for a better common helper
that invalidates pages based of an iocb.

Link: https://lkml.kernel.org/r/20230601145904.1385409-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:52 -07:00
Christoph Hellwig
0d625446d0 backing_dev: remove current->backing_dev_info
Patch series "cleanup the filemap / direct I/O interaction", v4.

This series cleans up some of the generic write helper calling conventions
and the page cache writeback / invalidation for direct I/O.  This is a
spinoff from the no-bufferhead kernel project, for which we'll want to an
use iomap based buffered write path in the block layer.


This patch (of 12):

The last user of current->backing_dev_info disappeared in commit
b9b1335e64 ("remove bdi_congested() and wb_congested() and related
functions").  Remove the field and all assignments to it.

Link: https://lkml.kernel.org/r/20230601145904.1385409-1-hch@lst.de
Link: https://lkml.kernel.org/r/20230601145904.1385409-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:51 -07:00
Lorenzo Stoakes
ca5e863233 mm/gup: remove vmas parameter from get_user_pages_remote()
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-

- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
  remove it.

- __access_remote_vm() was already using vma_lookup() when the original
  lookup failed so by doing the lookup directly this also de-duplicates the
  code.

We are able to perform these VMA operations as we already hold the
mmap_lock in order to be able to call get_user_pages_remote().

As part of this work we add get_user_page_vma_remote() which abstracts the
VMA lookup, error handling and decrementing the page reference count should
the VMA lookup fail.

This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.

[akpm@linux-foundation.org: avoid passing NULL to PTR_ERR]
Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64)
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390)
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:26 -07:00
Yuanchu Xie
7bab8dfb12 mm: pagemap: restrict pagewalk to the requested range
The pagewalk in pagemap_read reads one PTE past the end of the requested
range, and stops when the buffer runs out of space. While it produces
the right result, the extra read is unnecessary and less performant.

I timed the following command before and after this patch:
	dd count=100000 if=/proc/self/pagemap of=/dev/null
The results are consistently within 0.001s across 5 runs.

Before:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0763159 s, 671 MB/s

real    0m0.078s
user    0m0.012s
sys     0m0.065s

After:
100000+0 records in
100000+0 records out
51200000 bytes (51 MB) copied, 0.0487928 s, 1.0 GB/s

real    0m0.050s
user    0m0.011s
sys     0m0.039s

Link: https://lkml.kernel.org/r/20230515172608.3558391-1-yuanchu@google.com
Signed-off-by: Yuanchu Xie <yuanchu@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Zach O'Keefe <zokeefe@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:20 -07:00
Ackerley Tng
adef080382 fs: hugetlbfs: set vma policy only when needed for allocating folio
Calling hugetlb_set_vma_policy() later avoids setting the vma policy
and then dropping it on a page cache hit.

Link: https://lkml.kernel.org/r/20230502235622.3652586-1-ackerleytng@google.com
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:17 -07:00
Yosry Ahmed
2816ea2abf writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.

A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic.  This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted.  There were two remaining atomic flushing
contexts after that series.  This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled.  We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter.  This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
remove them and simplify the code.

[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/


This patch (of 5):

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.

Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:14 -07:00
Linus Torvalds
7e8c948b3f A fix for a potential data corruption in differential backup and
snapshot-based mirroring scenarios in RBD and a reference counting
 fixup to avoid use-after-free in CephFS, all marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmSDUh4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi0guB/4l7wOFnFvC+Dz5Y0KKuq2zFGQ64eZM
 hVpKEANsV/py/MTOdCzhW5cNcNj5/g8+1eozGxA8IzckzWf+25ziIn+BNWOO7DK1
 eO1U0wdiFnkXzr3nKSqNqm+hrUupAUd4Rb6644I4FwWKRu1WQydRjmvFVE+gw86O
 eeXujr3IlhhDF/VqO0sekCx9MaFPQaCaoscM3gU04meKAG84jt3oezueOlRqYFTX
 batwJ33wzVtLSh1NJIhC0iBMuBgvnuqQ9R8bHTdSNkR8Ov4V3B4DQGL4lmYnBxbv
 L3fMcz+sdZu3bDptUta4ZgdS4LkxUUUUEK07XeoBhAjZ3qPrMiD/gXay
 =S7Tu
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for a potential data corruption in differential backup and
  snapshot-based mirroring scenarios in RBD and a reference counting
  fixup to avoid use-after-free in CephFS, all marked for stable"

* tag 'ceph-for-6.4-rc6' of https://github.com/ceph/ceph-client:
  ceph: fix use-after-free bug for inodes when flushing capsnaps
  rbd: get snapshot context after exclusive lock is ensured to be held
  rbd: move RBD_OBJ_FLAG_COPYUP_ENABLED flag setting
2023-06-09 10:53:58 -07:00
Linus Torvalds
8fc1c596c2 Fix an ext4 regression which breaks remounting r/w file systems that
have the quota feature enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmSClpwACgkQ8vlZVpUN
 gaME9AgAhKZuMUc9HxcMmXgkrVVTYh18OQVNwb8yI09gVspUkNC43mgwtYLAL6XT
 fr1ifzSDgjKYSAaSXALP//zxo/xGAl1cKfjLz/XqeqU2TlVnQyvUTkO5ywe2YYHo
 UZ2HbbiYXKH1c5PsqcwaiPmcf/sZvrXjvpvP4v4iO/b58UBKG5OIIlP6X/XgArco
 KYeMuFZYOX+jsgP/9+I+cutUEa9paHnMxycmPARdrnem3eA0jZtS7YrLJijxFmbV
 c4AFcbtuaW+ZmMHPsxtYoeQxM7TbL3vA0FghS14MxtdCipLkv5MTPnstaW/jGnST
 IlHLwLADCktbusV8UVJotWfAWcKeGw==
 =lKq5
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
 "Fix an ext4 regression which breaks remounting r/w file systems that
  have the quota feature enabled"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: only check dquot_initialize_needed() when debugging
  Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
2023-06-09 08:23:01 -07:00
David Howells
219d92056b splice, net: Fix SPLICE_F_MORE signalling in splice_direct_to_actor()
splice_direct_to_actor() doesn't manage SPLICE_F_MORE correctly[1] - and,
as a result, it incorrectly signals/fails to signal MSG_MORE when splicing
to a socket.  The problem I'm seeing happens when a short splice occurs
because we got a short read due to hitting the EOF on a file: as the length
read (read_len) is less than the remaining size to be spliced (len),
SPLICE_F_MORE (and thus MSG_MORE) is set.

The issue is that, for the moment, we have no way to know *why* the short
read occurred and so can't make a good decision on whether we *should* keep
MSG_MORE set.

MSG_SENDPAGE_NOTLAST was added to work around this, but that is also set
incorrectly under some circumstances - for example if a short read fills a
single pipe_buffer, but the next read would return more (seqfile can do
this).

This was observed with the multi_chunk_sendfile tests in the tls kselftest
program.  Some of those tests would hang and time out when the last chunk
of file was less than the sendfile request size:

	build/kselftest/net/tls -r tls.12_aes_gcm.multi_chunk_sendfile

This has been observed before[2] and worked around in AF_TLS[3].

Fix this by making splice_direct_to_actor() always signal SPLICE_F_MORE if
we haven't yet hit the requested operation size.  SPLICE_F_MORE remains
signalled if the user passed it in to splice() but otherwise gets cleared
when we've read sufficient data to fulfill the request.

If, however, we get a premature EOF from ->splice_read(), have sent at
least one byte and SPLICE_F_MORE was not set by the caller, ->splice_eof()
will be invoked.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org

Link: https://lore.kernel.org/r/499791.1685485603@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1591392508-14592-1-git-send-email-pooja.trivedi@stackpath.com/ [2]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d452d48b9f8b1a7f8152d33ef52cfd7fe1735b0a [3]
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:31 -07:00
David Howells
2bfc668509 splice, net: Add a splice_eof op to file-ops and socket-ops
Add an optional method, ->splice_eof(), to allow splice to indicate the
premature termination of a splice to struct file_operations and struct
proto_ops.

This is called if sendfile() or splice() encounters all of the following
conditions inside splice_direct_to_actor():

 (1) the user did not set SPLICE_F_MORE (splice only), and

 (2) an EOF condition occurred (->splice_read() returned 0), and

 (3) we haven't read enough to fulfill the request (ie. len > 0 still), and

 (4) we have already spliced at least one byte.

A further patch will modify the behaviour of SPLICE_F_MORE to always be
passed to the actor if either the user set it or we haven't yet read
sufficient data to fulfill the request.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: David Hildenbrand <david@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: linux-mm@kvack.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:30 -07:00
David Howells
2dc334f1a6 splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
Replace generic_splice_sendpage() + splice_from_pipe + pipe_to_sendpage()
with a net-specific handler, splice_to_socket(), that calls sendmsg() with
MSG_SPLICE_PAGES set instead of calling ->sendpage().

MSG_MORE is used to indicate if the sendmsg() is expected to be followed
with more data.

This allows multiple pipe-buffer pages to be passed in a single call in a
BVEC iterator, allowing the processing to be pushed down to a loop in the
protocol driver.  This helps pave the way for passing multipage folios down
too.

Protocols that haven't been converted to handle MSG_SPLICE_PAGES yet should
just ignore it and do a normal sendmsg() for now - although that may be a
bit slower as it may copy everything.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 19:40:30 -07:00
Jakub Kicinski
449f6bc17a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

Conflicts:

net/sched/sch_taprio.c
  d636fc5dd6 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping")
  dced11ef84 ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()")

net/ipv4/sysctl_net_ipv4.c
  e209fee411 ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294")
  ccce324dab ("tcp: make the first N SYN RTO backoffs linear")
https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/

No adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08 11:35:14 -07:00
Linus Torvalds
79b6fad546 xfs: fixes for 6.4-rc5
This update contains:
 - Propagate unlinked inode list corruption back up to log recovery (regression
   fix).
 - improve corruption detection for AGFL entries, AGFL indexes and XEFI extents
   (syzkaller fuzzer oops report).
 - Avoid double perag reference release (regression fix).
 - Improve extent merging detection in scrub (regression fix).
 - Fix a new undefined high bit shift (regression fix).
 - Fix for AGF vs inode cluster buffer deadlock (regression fix).
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmSBcAIUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h1gEg/+IfG2aNR8P4+rqhOJ2yF5fZYtsqS1
 HkOX/N/Q8gBNXMwh3wWoXyBJk7gBhwySwcGvlYXoMZf6+alXGHaTMl8whxmFYaT9
 +aPBWo4lXRec6YHx016ZOjnNkLiWhyxdUvh85IFf0EJm5mK9QqjoX+lmbPc7HDzh
 0nFL66jaxM8W36QhK0srdwwjD3kNgZ2ZRNonlRULOzyTPpFfh985esTrmfmn3Ulx
 xiejw57xdpti9x+Pm5WZjUsW1/gx50hMS+yn/KiIWTQqncIO/OuirZSTrOFtUWTM
 xIfMB9xlkdaSmMCUyx2r2RVWJawXP++aT7nbza2eWJa0WSn5kZmHXugzI+V9zUx7
 M0oakkOXJl2pYakVr7G8JU4djZkQNu41JkuLVf5U7O3yYRWlXzViAqljd3S2C/+i
 pSjG9ram8esd/CAmw/hE6Jvhm6QYS1/D3KQ9Gs6JaptzR8Xjc7t7GEj1T0pMyPem
 iZx80C6fi87k/94hQ+HXalrAyJER9EmcQ25yngKucjgfrO0BrzNLGDus4uY0+IzX
 Y2T6xcSF/Vhd1soaklRuHryF7Vv7ECCIWVUV2pH7GHZwv0LaXvBnC1CUdYXskXy8
 RGlIBL75lBkOfZp0zK/R11sm1qxzXPCayBkZtSglj/RdNaZiFO9uwO139eGl+xP5
 ytjMvitThOGoAfU=
 =vT74
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Dave Chinner:
 "These are a set of regression fixes discovered on recent kernels. I
  was hoping to send this to you a week and half ago, but events out of
  my control delayed finalising the changes until early this week.

  Whilst the diffstat looks large for this stage of the merge window, a
  large chunk of it comes from moving the guts of one function from one
  file to another i.e. it's the same code, it is just run in a different
  context where it is safe to hold a specific lock. Otherwise the
  individual changes are relatively small and straigtht forward.

  Summary:

   - Propagate unlinked inode list corruption back up to log recovery
     (regression fix)

   - improve corruption detection for AGFL entries, AGFL indexes and
     XEFI extents (syzkaller fuzzer oops report)

   - Avoid double perag reference release (regression fix)

   - Improve extent merging detection in scrub (regression fix)

   - Fix a new undefined high bit shift (regression fix)

   - Fix for AGF vs inode cluster buffer deadlock (regression fix)"

* tag 'xfs-6.4-rc5-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: collect errors from inodegc for unlinked inode recovery
  xfs: validate block number being freed before adding to xefi
  xfs: validity check agbnos on the AGFL
  xfs: fix agf/agfl verification on v4 filesystems
  xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()
  xfs: fix broken logic when detecting mergeable bmap records
  xfs: Fix undefined behavior of shift into sign bit
  xfs: fix AGF vs inode cluster buffer deadlock
  xfs: defered work could create precommits
  xfs: restore allocation trylock iteration
  xfs: buffer pins need to hold a buffer reference
2023-06-08 08:46:58 -07:00
Theodore Ts'o
dea9d8f764 ext4: only check dquot_initialize_needed() when debugging
ext4_xattr_block_set() relies on its caller to call dquot_initialize()
on the inode.  To assure that this has happened there are WARN_ON
checks.  Unfortunately, this is subject to false positives if there is
an antagonist thread which is flipping the file system at high rates
between r/o and rw.  So only do the check if EXT4_XATTR_DEBUG is
enabled.

Link: https://lore.kernel.org/r/20230608044056.GA1418535@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-08 10:06:40 -04:00
Theodore Ts'o
1b29243933 Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled"
This reverts commit a44be64bbe.

Link: https://lore.kernel.org/r/653b3359-2005-21b1-039d-c55ca4cffdcc@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-08 09:57:04 -04:00
Christoph Hellwig
4bb218a65a fs: unexport buffer_check_dirty_writeback
buffer_check_dirty_writeback is only used by the block device aops,
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230608122958.276954-1-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-08 14:39:57 +02:00
Qu Wenruo
79b8ee702c btrfs: scrub: also report errors hit during the initial read
[BUG]
After the recent scrub rework introduced in commit e02ee89baa ("btrfs:
scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
btrfs scrub no longer reports repaired errors any more:

  # mkfs.btrfs -f $dev -d DUP
  # mount $dev $mnt
  # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
  # umount $dev
  # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
  # mount $dev $mnt
  # btrfs scrub start -BR $mnt
  scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
  Scrub started:    Tue Jun  6 14:56:50 2023
  Status:           finished
  Duration:         0:00:00
  	data_extents_scrubbed: 2
  	tree_extents_scrubbed: 18
  	data_bytes_scrubbed: 131072
  	tree_bytes_scrubbed: 294912
  	read_errors: 0
  	csum_errors: 0 <<< No errors here
  	verify_errors: 0
         [...]
  	uncorrectable_errors: 0
  	unverified_errors: 0
  	corrected_errors: 16		<<< Only corrected errors
  	last_physical: 2723151872

This can confuse btrfs-progs, as it relies on the csum_errors to
determine if there is anything wrong.

While on v6.3.x kernels, the report is different:

 	csum_errors: 16			<<<
 	verify_errors: 0
	[...]
 	uncorrectable_errors: 0
 	unverified_errors: 0
 	corrected_errors: 16 <<<

[CAUSE]
In the reworked scrub, we update the scrub progress inside
scrub_stripe_report_errors(), using various bitmaps to update the
result.

For example for csum_errors, we use bitmap_weight() of
stripe->csum_error_bitmap.

Unfortunately at that stage, all error bitmaps (except
init_error_bitmap) are the result of the latest repair attempt, thus if
the stripe is fully repaired, those error bitmaps will all be empty,
resulting the above output mismatch.

To fix this, record the number of errors into stripe->init_nr_*_errors.
Since we don't really care about where those errors are, we only need to
record the number of errors.

Then in scrub_stripe_report_errors(), use those initial numbers to
update the progress other than using the latest error bitmaps.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 14:34:01 +02:00
Qu Wenruo
1f2030ff6e btrfs: scrub: respect the read-only flag during repair
[BUG]
With recent scrub rework, the scrub operation no longer respects the
read-only flag passed by "-r" option of "btrfs scrub start" command.

  # mkfs.btrfs -f -d raid1 $dev1 $dev2
  # mount $dev1 $mnt
  # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file
  # sync
  # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1
  # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2
  # mount $dev1 $mnt -o ro
  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
  Scrub started:    Tue Jun  6 09:59:14 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 16 <<< Still has corrupted sectors
  	last_physical: 1351614464

  # btrfs scrub start -BrRd $mnt
  Scrub device $dev1 (id 1) done
  Scrub started:    Tue Jun  6 10:00:17 2023
  Status:           finished
  Duration:         0:00:00
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

  Scrub device $dev2 (id 2) done
         [...]
  	corrected_errors: 0 <<< No more errors
  	last_physical: 1372585984

[CAUSE]
In the newly reworked scrub code, repair is always submitted no matter
if we're doing a read-only scrub.

[FIX]
Fix it by skipping the write submission if the scrub is a read-only one.

Unfortunately for the report part, even for a read-only scrub we will
still report it as corrected errors, as we know it's repairable, even we
won't really submit the write.

Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-08 13:56:38 +02:00
David Howells
f5f82cd187 Move netfs_extract_iter_to_sg() to lib/scatterlist.c
Move netfs_extract_iter_to_sg() to lib/scatterlist.c as it's going to be
used by more than just network filesystems (AF_ALG, for example).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
936dc763c5 Wrap lines at 80
Wrap a line at 80 to stop checkpatch complaining.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Simon Horman <simon.horman@corigine.com>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
3b9e9f72ba Fix a couple of spelling mistakes
Fix a couple of spelling mistakes in a comment.

Suggested-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/ZHH2mSRqeL4Gs1ft@corigine.com/
Link: https://lore.kernel.org/r/ZHH1nqZWOGzxlidT@corigine.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
David Howells
0d7aeb6870 Drop the netfs_ prefix from netfs_extract_iter_to_sg()
Rename netfs_extract_iter_to_sg() and its auxiliary functions to drop the
netfs_ prefix.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-crypto@vger.kernel.org
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: netdev@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-06-08 13:42:33 +02:00
Xiubo Li
409e873ea3 ceph: fix use-after-free bug for inodes when flushing capsnaps
There is a race between capsnaps flush and removing the inode from
'mdsc->snap_flush_list' list:

   == Thread A ==                     == Thread B ==
ceph_queue_cap_snap()
 -> allocate 'capsnapA'
 ->ihold('&ci->vfs_inode')
 ->add 'capsnapA' to 'ci->i_cap_snaps'
 ->add 'ci' to 'mdsc->snap_flush_list'
    ...
   == Thread C ==
ceph_flush_snaps()
 ->__ceph_flush_snaps()
  ->__send_flush_snap()
                                handle_cap_flushsnap_ack()
                                 ->iput('&ci->vfs_inode')
                                   this also will release 'ci'
                                    ...
				      == Thread D ==
                                ceph_handle_snap()
                                 ->flush_snaps()
                                  ->iterate 'mdsc->snap_flush_list'
                                   ->get the stale 'ci'
 ->remove 'ci' from                ->ihold(&ci->vfs_inode) this
   'mdsc->snap_flush_list'           will WARNING

To fix this we will increase the inode's i_count ref when adding 'ci'
to the 'mdsc->snap_flush_list' list.

[ idryomov: need_put int -> bool ]

Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2209299
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-06-08 08:56:25 +02:00
Thomas Weißschuh
6217642027 fs: avoid empty option when generating legacy mount string
As each option string fragment is always prepended with a comma it would
happen that the whole string always starts with a comma. This could be
interpreted by filesystem drivers as an empty option and may produce
errors.

For example the NTFS driver from ntfs.ko behaves like this and fails
when mounted via the new API.

Link: https://github.com/util-linux/util-linux/issues/2298
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Fixes: 3e1aeb00e6 ("vfs: Implement a filesystem superblock creation/configuration context")
Cc: stable@vger.kernel.org
Message-Id: <20230607-fs-empty-option-v1-1-20c8dbf4671b@weissschuh.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07 21:49:55 +02:00
David Howells
a27648c742 afs: Fix setting of mtime when creating a file/dir/symlink
kafs incorrectly passes a zero mtime (ie. 1st Jan 1970) to the server when
creating a file, dir or symlink because the mtime recorded in the
afs_operation struct gets passed to the server by the marshalling routines,
but the afs_mkdir(), afs_create() and afs_symlink() functions don't set it.

This gets masked if a file or directory is subsequently modified.

Fix this by filling in op->mtime before calling the create op.

Fixes: e49c7b2f6d ("afs: Build an abstraction around an "operation" concept")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-07 09:03:12 -07:00
Miklos Szeredi
a9d1c4c6df fuse: revalidate: don't invalidate if interrupted
If the LOOKUP request triggered from fuse_dentry_revalidate() is
interrupted, then the dentry will be invalidated, possibly resulting in
submounts being unmounted.

Reported-by: Xu Rongbo <xurongbo@baidu.com>
Closes: https://lore.kernel.org/all/CAJfpegswN_CJJ6C3RZiaK6rpFmNyWmXfaEpnQUJ42KCwNF5tWw@mail.gmail.com/
Fixes: 9e6268db49 ("[PATCH] FUSE - read-write operations")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
Bernd Schubert
3066ff9347 fuse: Apply flags2 only when userspace set the FUSE_INIT_EXT
This is just a safety precaution to avoid checking flags on memory that was
initialized on the user space side.  libfuse zeroes struct fuse_init_out
outarg, but this is not guranteed to be done in all implementations.
Better is to act on flags and to only apply flags2 when FUSE_INIT_EXT is
set.

There is a risk with this change, though - it might break existing user
space libraries, which are already using flags2 without setting
FUSE_INIT_EXT.

The corresponding libfuse patch is here
https://github.com/libfuse/libfuse/pull/662

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Fixes: 53db28933e ("fuse: extend init flags")
Cc: <stable@vger.kernel.org> # v5.17
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 17:49:20 +02:00
zyfjeff
1c3610d30e fuse: remove duplicate check for nodeid
before this check, the nodeid has already been checked once, so
the check here doesn't make an sense, so remove the check for
nodeid here.

            if (err || !outarg->nodeid)
                    goto out_put_forget;

            err = -EIO;
    >>>     if (!outarg->nodeid)
                    goto out_put_forget;

Signed-off-by: zyfjeff <zyfjeff@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:34 +02:00
Miklos Szeredi
5cadfbd5a1 fuse: add feature flag for expire-only
Add an init flag idicating whether the FUSE_EXPIRE_ONLY flag of
FUSE_NOTIFY_INVAL_ENTRY is effective.

This is needed for backports of this feature, otherwise the server could
just check the protocol version.

Fixes: 4f8d37020e ("fuse: add "expire only" mode to FUSE_NOTIFY_INVAL_ENTRY")
Cc: <stable@vger.kernel.org> # v6.2
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2023-06-07 16:26:33 +02:00
Jan Kara
2454ad83b9
fs: Restrict lock_two_nondirectories() to non-directory inodes
Currently lock_two_nondirectories() is skipping any passed directories.
After vfs_rename() uses lock_two_inodes(), all the remaining four users
of this function pass only regular files to it. So drop the somewhat
unusual "skip directory" logic and instead warn if anybody passes
directory to it. This also allows us to use lock_two_inodes() in
lock_two_nondirectories() to concentrate the lock ordering logic in less
places.

Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-6-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-07 09:15:11 +02:00
Chris Mason
981a37bab5 btrfs: properly enable async discard when switching from RO->RW
The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info
to force discards off when the filesystem has aborted or we're generally
not able to run discards.  This gets flipped on when we're mounted rw,
and also when we go from ro->rw.

Commit 63a7cb1307 ("btrfs: auto enable discard=async when possible")
enabled async discard by default, and this meant
"mount -o ro /dev/xxx /yyy" had async discards turned on.

Unfortunately, this meant our check in btrfs_remount_cleanup() would see
that discards are already on:

    /* If we toggled discard async */
    if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
	btrfs_test_opt(fs_info, DISCARD_ASYNC))
	    btrfs_discard_resume(fs_info);

So, we'd never call btrfs_discard_resume() when remounting the root
filesystem from ro->rw.

drgn shows this really nicely:

import os
import sys

from drgn.helpers.linux.fs import path_lookup
from drgn import NULL, Object, Type, cast

def btrfs_sb(sb):
    return cast("struct btrfs_fs_info *", sb.s_fs_info)

if len(sys.argv) == 1:
    path = "/"
else:
    path = sys.argv[1]

fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info)

BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING']
if fs_info.flags & BTRFS_FS_DISCARD_RUNNING:
    print("discard running flag is on")
else:
    print("discard running flag is off")

[root]# mount | grep nvme
/dev/nvme0n1p3 on / type btrfs
(rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

[root]# ./discard_running.drgn
discard running flag is off

[root]# mount -o remount,discard=sync /
[root]# mount -o remount,discard=async /
[root]# ./discard_running.drgn
discard running flag is on

The fix is to call btrfs_discard_resume() when we're going from ro->rw.
It already checks to make sure the async discard flag is on, so it'll do
the right thing.

Fixes: 63a7cb1307 ("btrfs: auto enable discard=async when possible")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-06 19:44:22 +02:00
Bob Peterson
f9da18cd46 gfs2: Don't remember delete unless it's successful
This patch changes function evict_unlinked_inode so it does not call
gfs2_inode_remember_delete until it gets a good return code from
gfs2_dinode_dealloc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
17a5934653 gfs2: Update rl_unlinked before releasing rgrp lock
Function gfs2_free_di was changing the rgrp lvb count of unlinked
dinodes after the lock was released. This patch moves it inside the
lock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
dac0fc31be gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
This patch fixes a case in which function gfs2_quota_hold encounters an
assert error and exits. The lack of gfs2_qa_put causes further problems
when the inode is evicted and the get/put count is non-zero.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
9b620429ec gfs2: ignore rindex_update failure in dinode_dealloc
Before this patch, function gfs2_dinode_dealloc would abort if it got a
bad return code from gfs2_rindex_update(). The problem is that it left the
dinode in the unlinked (not free) state, which meant subsequent fsck
would clean it up and flag an error. That meant some of our QE tests
would fail.

The sole purpose of gfs2_rindex_update(), in this code path, is to read in
any newer rgrps added by gfs2_grow. But since this is a delete operation
it won't actually use any of those new rgrps. It can really only twiddle
the bits from "Unlinked" to "Free" in an existing rgrp. Therefore the
error should not prevent the transition from unlinked to free.

This patch makes gfs2_dinode_dealloc ignore the bad return code and
proceed with freeing the dinode so the QE tests will not be tripped up.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
e4f82bf21f gfs2: fix minor comment typos
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Bob Peterson
a9b0f6f4ad gfs2: simplify gdlm_put_lock with out_free label
This patch introduces a new out_free label and consolidates the three
places function gdlm_put_lock freed the glock.  No change in
functionality.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-06 18:35:06 +02:00
Kirill A. Shutemov
dcdfdd40fa mm: Add support for unaccepted memory
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
2023-06-06 16:38:22 +02:00
Linus Torvalds
0bdd0f0bf1 gfs2 fix
- Don't get stuck writing page onto itself under direct I/O.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmR/KcEUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTr3wA//eCaUYWKOiSXbbvNeP1pdawN8zRZO
 alpK/nsujcB4bDdiDUTREYFBfcmBKEIboFz6e02DL8MPp2mc0fZ/Ox9yq/nR7o9x
 9Y1CjWmC1zoUkqw6V8+vbg0m432OlcWglppgywHjvyUiEnyUzBfnxIRH/k0lLYor
 yR7vViSZQhJ4jroSeEVNKsCeZUiY7y5tiLo+bcHYYF7lolab/ZfNxacr1/lSAuww
 WS0frjAeSBneQ5aU2JQ60lbJcJQydfdoS3n0dlyX6qJeVoFCnhQAJeiVSQaYMFpE
 HaYFs/3YGjzAkoWqX5CAzLIfxsIHepdaP4PtITg3xwyMQ1j3X5H5n1FOyJCcbXz7
 s7gq/RXjU3TV/hVPSukVGhwjavB8gMrbQtmpSYHpA99ldeNfIVwps2PSWdoDND/B
 7w+g0T5U+yd7DNz2tL9YML7Anioc0K6y1hVvacuPIgNHJyLQ5XaPYJXUUJFfl0X6
 njcZVmfK56RQRPR7jDp26F4X+Pw+GjahJJq05zCwsmFnP6+pX2gjYNx9LidXBugU
 1L8BlN2IZvc9ShvInZIuQHwTMEXAMjjSd5JtX7iZLnHxeWDfIlo64ZrroiiNbrk6
 pDqG+C6fkYV1h1fzX3bvFZIvFoKERxu9u9TunBIiBQFHQNYWoe2i5zpZNXyT7Soo
 epIqTaWmf8l/3vo=
 =9syc
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Don't get stuck writing page onto itself under direct I/O

* tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't get stuck writing page onto itself under direct I/O
2023-06-06 05:49:06 -07:00
Andy Shevchenko
7afb6d8fa8 jbd2: Avoid printing outside the boundary of the buffer
Theoretically possible that "%pg" will take all room for the j_devname
and hence the "-%lu" will go outside the boundary due to unconditional
sprintf() in use. To make this code more robust, replace two sequential
s*printf():s by a single call and then replace forbidden character.
It's possible to do this way, because '/' won't ever be in the result
of "-%lu".

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230605170553.7835-2-andriy.shevchenko@linux.intel.com
2023-06-05 15:31:12 -07:00
Qu Wenruo
917ac77846 btrfs: subpage: fix a crash in metadata repair path
[BUG]
Test case btrfs/027 would crash with subpage (64K page size, 4K
sectorsize) with the following dying messages:

  debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
  assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/messages.c:259!
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Call trace:
   btrfs_assertfail+0x28/0x2c [btrfs]
   btrfs_map_repair_block+0x150/0x2b8 [btrfs]
   btrfs_repair_io_failure+0xd4/0x31c [btrfs]
   btrfs_read_extent_buffer+0x150/0x16c [btrfs]
   read_tree_block+0x38/0xbc [btrfs]
   read_tree_root_path+0xfc/0x1bc [btrfs]
   btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
   open_ctree+0xa30/0x172c [btrfs]
   btrfs_mount_root+0x3c4/0x4a4 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   vfs_kern_mount.part.0+0x90/0xd4
   vfs_kern_mount+0x14/0x28
   btrfs_mount+0x114/0x418 [btrfs]
   legacy_get_tree+0x30/0x60
   vfs_get_tree+0x28/0xec
   path_mount+0x3e0/0xb64
   __arm64_sys_mount+0x200/0x2d8
   invoke_syscall+0x48/0x114
   el0_svc_common.constprop.0+0x60/0x11c
   do_el0_svc+0x38/0x98
   el0_svc+0x40/0xa8
   el0t_64_sync_handler+0xf4/0x120
   el0t_64_sync+0x190/0x194
  Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)

[CAUSE]
In btrfs/027 we test RAID6 with missing devices, in this particular
case, we're repairing a metadata at the end of a data stripe.

But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
and for subpage case this can cross stripe boundary and lead to the
above BUG_ON().

This metadata repair code is always there, since the introduction of
subpage support, but this can trigger BUG_ON() after the bio split
ability at btrfs_map_bio().

[FIX]
Instead of passing the old PAGE_SIZE, we calculate the correct length
based on the eb size and page size for both regular and subpage cases.

CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-05 19:21:57 +02:00
Christoph Hellwig
cf056a4312 init: improve the name_to_dev_t interface
name_to_dev_t has a very misleading name, that doesn't make clear
it should only be used by the early init code, and also has a bad
calling convention that doesn't allow returning different kinds of
errors.  Rename it to early_lookup_bdev to make the use case clear,
and return an errno, where -EINVAL means the string could not be
parsed, and -ENODEV means it the string was valid, but there was
no device found for it.

Also stub out the whole call for !CONFIG_BLOCK as all the non-block
root cases are always covered in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:56:46 -06:00
Christoph Hellwig
dd2e31afba ext4: wire up the ->mark_dead holder operation for log devices
Implement a set of holder_ops that shut down the file system when the
block device used as log device is removed undeneath the file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
f5db130d44 ext4: wire up sops->shutdown
Wire up the shutdown method to shut down the file system when the
underlying block device is marked dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
97524b454b ext4: split ext4_shutdown
Split ext4_shutdown into a low-level helper that will be reused for
implementing the shutdown super operation and a wrapper for the ioctl
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
8067ca1dcd xfs: wire up the ->mark_dead holder operation for log and RT devices
Implement a set of holder_ops that shut down the file system when the
block device used as log or RT device is removed undeneath the file
system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
e7caa877e5 xfs: wire up sops->shutdown
Wire up the shutdown method to shut down the file system when the
underlying block device is marked dead.  Add a new message to
clearly distinguish this shutdown reason from other shutdowns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
87efb39075 fs: add a method to shut down the file system
Add a new ->shutdown super operation that can be used to tell the file
system to shut down, and call it from newly created holder ops when the
block device under a file system shuts down.

This only covers the main block device for "simple" file systems using
get_tree_bdev / mount_bdev.  File systems their own get_tree method
or opening additional devices will need to set up their own
blk_holder_ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
0718afd47f block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Ye Bin
d6a95db3c7 quota: fix warning in dqgrab()
There's issue as follows when do fault injection:
WARNING: CPU: 1 PID: 14870 at include/linux/quotaops.h:51 dquot_disable+0x13b7/0x18c0
Modules linked in:
CPU: 1 PID: 14870 Comm: fsconfig Not tainted 6.3.0-next-20230505-00006-g5107a9c821af-dirty #541
RIP: 0010:dquot_disable+0x13b7/0x18c0
RSP: 0018:ffffc9000acc79e0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88825e41b980
RDX: 0000000000000000 RSI: ffff88825e41b980 RDI: 0000000000000002
RBP: ffff888179f68000 R08: ffffffff82087ca7 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed102f3ed026 R12: ffff888179f68130
R13: ffff888179f68110 R14: dffffc0000000000 R15: ffff888179f68118
FS:  00007f450a073740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffe96f2efd8 CR3: 000000025c8ad000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 dquot_load_quota_sb+0xd53/0x1060
 dquot_resume+0x172/0x230
 ext4_reconfigure+0x1dc6/0x27b0
 reconfigure_super+0x515/0xa90
 __x64_sys_fsconfig+0xb19/0xd20
 do_syscall_64+0x39/0xb0
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Above issue may happens as follows:
ProcessA              ProcessB                    ProcessC
sys_fsconfig
  vfs_fsconfig_locked
   reconfigure_super
     ext4_remount
      dquot_suspend -> suspend all type quota

                 sys_fsconfig
                  vfs_fsconfig_locked
                    reconfigure_super
                     ext4_remount
                      dquot_resume
                       ret = dquot_load_quota_sb
                        add_dquot_ref
                                           do_open  -> open file O_RDWR
                                            vfs_open
                                             do_dentry_open
                                              get_write_access
                                               atomic_inc_unless_negative(&inode->i_writecount)
                                              ext4_file_open
                                               dquot_file_open
                                                dquot_initialize
                                                  __dquot_initialize
                                                   dqget
						    atomic_inc(&dquot->dq_count);

                          __dquot_initialize
                           __dquot_initialize
                            dqget
                             if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
                               ext4_acquire_dquot
			        -> Return error DQ_ACTIVE_B flag isn't set
                         dquot_disable
			  invalidate_dquots
			   if (atomic_read(&dquot->dq_count))
	                    dqgrab
			     WARN_ON_ONCE(!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
	                      -> Trigger warning

In the above scenario, 'dquot->dq_flags' has no DQ_ACTIVE_B is normal when
dqgrab().
To solve above issue just replace the dqgrab() use in invalidate_dquots() with
atomic_inc(&dquot->dq_count).

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230605140731.2427629-3-yebin10@huawei.com>
2023-06-05 16:50:30 +02:00
Jan Kara
6a4e336379 quota: Properly disable quotas when add_dquot_ref() fails
When add_dquot_ref() fails (usually due to IO error or ENOMEM), we want
to disable quotas we are trying to enable. However dquot_disable() call
was passed just the flags we are enabling so in case flags ==
DQUOT_USAGE_ENABLED dquot_disable() call will just fail with EINVAL
instead of properly disabling quotas. Fix the problem by always passing
DQUOT_LIMITS_ENABLED | DQUOT_USAGE_ENABLED to dquot_disable() in this
case.

Reported-and-tested-by: Ye Bin <yebin10@huawei.com>
Reported-by: syzbot+e633c79ceaecbf479854@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230605140731.2427629-2-yebin10@huawei.com>
2023-06-05 16:50:30 +02:00
Chuck Lever
82078b9895 NFSD: Ensure that xdr_write_pages updates rq_next_page
All other NFSv[23] procedures manage to keep page_ptr and
rq_next_page in lock step.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:44 -04:00
Chuck Lever
66a21db7db NFSD: Replace encode_cinfo()
De-duplicate "reserve_space; encode_cinfo".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:44 -04:00
Chuck Lever
adaa7a50d0 NFSD: Add encoders for NFSv4 clientids and verifiers
Deduplicate some common code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:44 -04:00
Chuck Lever
39d432fc76 NFSD: trace nfsctl operations
Add trace log eye-catchers that record the arguments used to
configure NFSD. This helps when troubleshooting the NFSD
administrative interfaces.

These tracepoints can capture NFSD start-up and shutdown times and
parameters, changes in lease time and thread count, and a request
to end the namespace's NFSv4 grace period, in addition to the set
of NFS versions that are enabled.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:43 -04:00
Chuck Lever
3434d7aa77 NFSD: Clean up nfsctl_transaction_write()
For easier readability, follow the common convention:

    if (error)
	handle_error;
    continue_normally;

No behavior change is expected.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Chuck Lever
442a629009 NFSD: Clean up nfsctl white-space damage
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:42 -04:00
Christian Brauner
2d8ae8c417 nfsd: use vfs setgid helper
We've aligned setgid behavior over multiple kernel releases. The details
can be found in commit cf619f8919 ("Merge tag 'fs.ovl.setgid.v6.2' of
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") and
commit 426b4ca2d6 ("Merge tag 'fs.setgid.v6.0' of
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux").
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Usually ATTR_KILL_SGID is
raised in e.g., chown_common() and is subject to the
setattr_should_drop_sgid() check to determine whether the setgid bit can
be retained. Since nfsd is raising ATTR_KILL_SGID unconditionally it
will cause notify_change() to strip it even if the caller had the
necessary privileges to retain it. Ensure that nfsd only raises
ATR_KILL_SGID if the caller lacks the necessary privileges to retain the
setgid bit.

Without this patch the setgid stripping tests in LTP will fail:

> As you can see, the problem is S_ISGID (0002000) was dropped on a
> non-group-executable file while chown was invoked by super-user, while

[...]

> fchown02.c:66: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

[...]

> chown02.c:57: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700

With this patch all tests pass.

Reported-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-06-05 09:01:41 -04:00
Fabio M. De Francesco
d0e135408e highmem: Rename put_and_unmap_page() to unmap_and_put_page()
With commit 849ad04cf5 ("new helper: put_and_unmap_page()"), Al Viro
introduced the put_and_unmap_page() to use in those many places where we
have a common pattern consisting of calls to kunmap_local() +
put_page().

Obviously, first we unmap and then we put pages. Instead, the original
name of this helper seems to imply that we first put and then unmap.

Therefore, rename the helper and change the only known upstreamed user
(i.e., fs/sysv) before this helper enters common use and might become
difficult to find all call sites and instead easy to break the builds.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Message-Id: <20230602103307.5637-1-fmdefrancesco@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-05 13:51:00 +02:00
David Howells
79aa284946 cachefiles: Allow the cache to be non-root
Set mode 0600 on files in the cache so that cachefilesd can run as an
unprivileged user rather than leaving the files all with 0.  Directories
are already set to 0700.

Userspace then needs to set the uid and gid before issuing the "bind"
command and the cache must've been chown'd to those IDs.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
cc: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-erofs@lists.ozlabs.org
cc: linux-fsdevel@vger.kernel.org
Message-Id: <1853230.1684516880@warthog.procyon.org.uk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-05 10:55:15 +02:00
Greg Kroah-Hartman
16b58423b4 Merge 6.4-rc5 into driver-core-next
We need the driver core fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-05 07:37:37 +02:00
Dave Chinner
d4d12c02bf xfs: collect errors from inodegc for unlinked inode recovery
Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.

This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.

Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.

In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.

In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Dave Chinner
7dfee17b13 xfs: validate block number being freed before adding to xefi
Bad things happen in defered extent freeing operations if it is
passed a bad block number in the xefi. This can come from a bogus
agno/agbno pair from deferred agfl freeing, or just a bad fsbno
being passed to __xfs_free_extent_later(). Either way, it's very
difficult to diagnose where a null perag oops in EFI creation
is coming from when the operation that queued the xefi has already
been completed and there's no longer any trace of it around....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Dave Chinner
3148ebf2c0 xfs: validity check agbnos on the AGFL
If the agfl or the indexing in the AGF has been corrupted, getting a
block form the AGFL could return an invalid block number. If this
happens, bad things happen. Check the agbno we pull off the AGFL
and return -EFSCORRUPTED if we find somethign bad.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Dave Chinner
e0a8de7da3 xfs: fix agf/agfl verification on v4 filesystems
When a v4 filesystem has fl_last - fl_first != fl_count, we do not
not detect the corruption and allow the AGF to be used as it if was
fully valid. On V5 filesystems, we reset the AGFL to empty in these
cases and avoid the corruption at a small cost of leaked blocks.

If we don't catch the corruption on V4 filesystems, bad things
happen later when an allocation attempts to trim the free list
and either double-frees stale entries in the AGFl or tries to free
NULLAGBNO entries.

Either way, this is bad. Prevent this from happening by using the
AGFL_NEED_RESET logic for v4 filesysetms, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Dave Chinner
1e473279f4 xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()
xfs_bmap_longest_free_extent() can return an error when accessing
the AGF fails. In this case, the behaviour of
xfs_filestream_pick_ag() is conditional on the error. We may
continue the loop, or break out of it. The error handling after the
loop cleans up the perag reference held when the break occurs. If we
continue, the next loop iteration handles cleaning up the perag
reference.

EIther way, we don't need to release the active perag reference when
xfs_bmap_longest_free_extent() fails. Doing so means we do a double
decrement on the active reference count, and this causes tha active
reference count to fall to zero. At this point, new active
references will fail.

This leads to unmount hanging because it tries to grab active
references to that perag, only for it to fail. This happens inside a
loop that retries until a inode tree radix tree tag is cleared,
which cannot happen because we can't get an active reference to the
perag.

The unmount livelocks in this path:

  xfs_reclaim_inodes+0x80/0xc0
  xfs_unmount_flush_inodes+0x5b/0x70
  xfs_unmountfs+0x5b/0x1a0
  xfs_fs_put_super+0x49/0x110
  generic_shutdown_super+0x7c/0x1a0
  kill_block_super+0x27/0x50
  deactivate_locked_super+0x30/0x90
  deactivate_super+0x3c/0x50
  cleanup_mnt+0xc2/0x160
  __cleanup_mnt+0x12/0x20
  task_work_run+0x5e/0xa0
  exit_to_user_mode_prepare+0x1bc/0x1c0
  syscall_exit_to_user_mode+0x16/0x40
  do_syscall_64+0x40/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: eb70aa2d8e ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:15 +10:00
Darrick J. Wong
6be73cecb5 xfs: fix broken logic when detecting mergeable bmap records
Commit 6bc6c99a944c was a well-intentioned effort to initiate
consolidation of adjacent bmbt mapping records by setting the PREEN
flag.  Consolidation can only happen if the length of the combined
record doesn't overflow the 21-bit blockcount field of the bmbt
recordset.  Unfortunately, the length test is inverted, leading to it
triggering on data forks like these:

 EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET               TOTAL
   0: [0..16777207]:        76110848..92888055  0 (76110848..92888055) 16777208
   1: [16777208..20639743]: 92888056..96750591  0 (92888056..96750591)  3862536

Note that record 0 has a length of 16777208 512b blocks.  This
corresponds to 2097151 4k fsblocks, which is the maximum.  Hence the two
records cannot be merged.

However, the logic is still wrong even if we change the in-loop
comparison, because the scope of our examination isn't broad enough
inside the loop to detect mappings like this:

   0: [0..9]:               76110838..76110847  0 (76110838..76110847)       10
   1: [10..16777217]:       76110848..92888055  0 (76110848..92888055) 16777208
   2: [16777218..20639753]: 92888056..96750591  0 (92888056..96750591)  3862536

These three records could be merged into two, but one cannot determine
this purely from looking at records 0-1 or 1-2 in isolation.

Hoist the mergability detection outside the loop, and base its decision
making on whether or not a merged mapping could be expressed in fewer
bmbt records.  While we're at it, fix the incorrect return type of the
iter function.

Fixes: 336642f792 ("xfs: alert the user about data/attr fork mappings that could be merged")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 14:48:12 +10:00
Geert Uytterhoeven
4320f34666 xfs: Fix undefined behavior of shift into sign bit
With gcc-5:

    In file included from ./include/trace/define_trace.h:102:0,
		     from ./fs/xfs/scrub/trace.h:988,
		     from fs/xfs/scrub/trace.c:40:
    ./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’:
    ./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant
     #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */
				^

Shifting the (signed) value 1 into the sign bit is undefined behavior.

Fix this for all definitions in the file by shifting "1U" instead of
"1".

This was exposed by the first user added in commit 466c525d6d
("xfs: minimize overhead of drain wakeups by using jump labels").

Fixes: 160b5a7845 ("xfs: hoist the already_fixed variable to the scrub context")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 04:09:27 +10:00
Dave Chinner
82842fee6e xfs: fix AGF vs inode cluster buffer deadlock
Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.

For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.

Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.

This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.

To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.

However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.

This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.

Fixes: 298f7bec50 ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 04:08:27 +10:00
Dave Chinner
cb04211748 xfs: defered work could create precommits
To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer operations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 04:07:27 +10:00
Dave Chinner
00dcd17cfa xfs: restore allocation trylock iteration
It was accidentally dropped when refactoring the allocation code,
resulting in the AG iteration always doing blocking AG iteration.
This results in a small performance regression for a specific fsmark
test that runs more user data writer threads than there are AGs.

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 2edf06a50f ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 04:06:27 +10:00
Dave Chinner
89a4bf0dc3 xfs: buffer pins need to hold a buffer reference
When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count.  If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.

This results in a use after free that can only be triggered
in active filesystem shutdown situations.

To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.

Reported-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de> 
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-06-05 04:05:27 +10:00
Eric Biggers
13e2408d02 fsverity: simplify error handling in verify_data_block()
Clean up the error handling in verify_data_block() to (a) eliminate the
'err' variable which has caused some confusion because the function
actually returns a bool, (b) reduce the compiled code size slightly, and
(c) execute one fewer branch in the success case.

Link: https://lore.kernel.org/r/20230604022312.48532-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04 05:56:11 -07:00
Eric Biggers
d1f0c5ea04 fsverity: don't use bio_first_page_all() in fsverity_verify_bio()
bio_first_page_all(bio)->mapping->host is not compatible with large
folios, since the first page of the bio is not necessarily the head page
of the folio, and therefore it might not have the mapping pointer set.

Therefore, move the dereference of ->mapping->host into
verify_data_blocks(), which works with a folio.

(Like the commit that this Fixes, this hasn't actually been tested with
large folios yet, since the filesystems that use fs/verity/ don't
support that yet.  But based on code review, I think this is needed.)

Fixes: 5d0f0e57ed ("fsverity: support verifying data from large folios")
Link: https://lore.kernel.org/r/20230604022101.48342-1-ebiggers@kernel.org
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04 05:56:11 -07:00
Eric Biggers
32ab3c5e62 fsverity: constify fsverity_hash_alg
Now that fsverity_hash_alg doesn't have an embedded mempool, it can be
'const' almost everywhere.  Add it.

Link: https://lore.kernel.org/r/20230604022348.48658-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04 05:56:11 -07:00
Eric Biggers
8fcd94add6 fsverity: use shash API instead of ahash API
The "ahash" API, like the other scatterlist-based crypto APIs such as
"skcipher", comes with some well-known limitations.  First, it can't
easily be used with vmalloc addresses.  Second, the request struct can't
be allocated on the stack.  This adds complexity and a possible failure
point that needs to be worked around, e.g. using a mempool.

The only benefit of ahash over "shash" is that ahash is needed to access
traditional memory-to-memory crypto accelerators, i.e. drivers/crypto/.
However, this style of crypto acceleration has largely fallen out of
favor and been superseded by CPU-based acceleration or inline crypto
engines.  Also, ahash needs to be used asynchronously to take full
advantage of such hardware, but fs/verity/ has never done this.

On all systems that aren't actually using one of these ahash-only crypto
accelerators, ahash just adds unnecessary overhead as it sits between
the user and the underlying shash algorithms.

Also, XFS is planned to cache fsverity Merkle tree blocks in the
existing XFS buffer cache.  As a result, it will be possible for a
single Merkle tree block to be split across discontiguous pages
(https://lore.kernel.org/r/20230405233753.GU3223426@dread.disaster.area).
This data will need to be hashed.  It is easiest to work with a vmapped
address in this case.  However, ahash is incompatible with this.

Therefore, let's convert fs/verity/ from ahash to shash.  This
simplifies the code, and it should also slightly improve performance for
everyone who wasn't actually using one of these ahash-only crypto
accelerators, i.e. almost everyone (or maybe even everyone)!

Link: https://lore.kernel.org/r/20230516052306.99600-1-ebiggers@kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-06-04 05:54:28 -07:00
Linus Torvalds
6d7d0603ca Fix an ext4 regression which landed during the 6.4 merge window.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmR6SK0ACgkQ8vlZVpUN
 gaMc2wf+JVcIh8o2Xc9lvQlr3obVmoj867rY+7fM9VDvVjCU7/0y9Hhmshmt+kwh
 14qj9H9kjRwBavlJkip5T8fnDQt6oAmlf5l/n9jjNLr/A7XN9s0C8h9K6CeL56cD
 MmeAtI9CI4iIxuK/SAlCY3sFm/cCkUWN6j0RQopfbiu5GrWQ8yMEV29ovSonlWSo
 tw+Xb3x9kw1j/sSJsf3BXhbkYEotbZFN7gaxtbh5ll9iHSDIP6Mq0BsM142tuUJl
 u7Y3Or/e5TiAUzp2coidzkj6by9JLJFzNFTgQQkMFGlhPHZNeJnMwiv+ToHpUFM7
 ltTmfGkAHWxX45WpvI4S/GdQcyollQ==
 =8bbS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
 "Fix an ext4 regression which landed during the 6.4 merge window"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
2023-06-02 17:25:22 -04:00
Linus Torvalds
e0178b546d for-6.4-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmR6NA4ACgkQxWXV+ddt
 WDuySw//TLkn3Q2UXZrxbcC9npTvVtIl8bm/UeRNY14Q4/ImC/HHNgAmIlO33J0c
 6/kqoujHLkXWhOyLME9QfqgMwhOEWz1kluU6vXpNQ0i3CE/4T9jceAphqxLcLhjr
 TtnV5SkGbgs+tsAyADfoFB/659JNo+zC4ZN1tSa/TFoZ7xbx7CkCGaAt4V8kkrQw
 BdcKMHBoN9CJE3waatAEcZPqUobEi0Wc+3W38fNOmFJoo3CQXobc5Rb5+1dEOy2G
 nEdfe/HUYVfT4PaSHS4ollQ2ajG+BXOOjd2X4ux2w7dk3iSkcIJFSu942vdtgM6Y
 ygeuhd4cZu6VCYN7lz0qbl8+t5rcRgErKMT5KiJ9fFQ7JDgRGTb6Mr+loPzxlbZ0
 bOgXvqb4mCNrPiQjzuNqUnr5AzD0X2ObTX0g9IsInJaiH7BtGRwBL/FWeX2XMxLQ
 SKBnFETJ1kqxg5/0YY1a9rCfciiDrSOZ1YgY74CEOh/JsJA+4fwx6ojV7uAdnGTg
 hjPhmwK3PjgjvoYcUEN7hIini2mSqyyw9+QynZ611HHV8dy2z4fG0xoubO2cUWsP
 e8JizBiUZWiVqj7UHXvLD7XkDFBJDXjD6iTopaZVz6ae4w4S9Dn3QroNvWshWmGC
 suukX3ZFASpeIJlftrrTzf1r8zvyfgGbS7sZ6ZwhIRx3wr1FFZw=
 =O3yC
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One regression fix.

  The rewrite of scrub code in 6.4 broke device replace in zoned mode,
  some of the writes could happen out of order so this had to be
  adjusted for all cases"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix dev-replace after the scrub rework
2023-06-02 17:16:19 -04:00
Ojaswin Mujoo
3582e74599 Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"
This reverts commit 32c0869370.

The reverted commit was intended to remove a dead check however it was observed
that this check was actually being used to exit early instead of looping
sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
the goal extent. Due to this, a my performance tests (fsmark, parallel file
writes in a highly fragmented FS) were seeing a 2x-3x regression.

Example, the default value of the following variables is:

sbi->s_mb_max_to_scan = 200
sbi->s_mb_min_to_scan = 10

In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
early and try again. This loop will go on until we have processed
sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
just use whatever we have even if it is smaller than goal extent.

Now, the regression comes when we find an extent bigger than goal. Earlier, in
this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
the bigger extent. However with commit 32c08693 that check was removed and hence
we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
free extent to satisfy the request. The only time we would exit early would be
when the free extent is *exactly* the size of our goal, which is pretty uncommon
occurrence and so we would almost always end up looping 200 times.

Hence, revert the commit by adding the check back to fix the regression. Also
add a comment to outline this policy.

Fixes: 32c0869370 ("ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/ddcae9658e46880dfec2fb0aa61d01fb3353d202.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-06-02 14:47:29 -04:00
Linus Torvalds
a746ca666a nfsd-6.4 fixes:
- Two minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmR3f/AACgkQM2qzM29m
 f5dgCQ/+KiS7B1u8O8VrsAy3D2G309UZUf2QMr982pDimaXBR65ggKI5GWTjjwrW
 ob0P18prMetT6SGUAIJ3nlEsjDfs+CgKPq/zE2uqB0Sgv4ROmWoboygVsCk+JiTp
 xlystF9XBMSjqu23LazFE6+yTAMWcX7ddB5NrJCZr+yMffVZhN6onM38YtQNKh1F
 fdY+tY7NgEu24+60Heb8ZAmo2+mX8fSe+lVXuwV3h1HRKand1463NNSMGH0t805P
 nBdytIBuwWbOHv8mgL3FB042PSGiHsaL5Yevq42Je+7nN92olHS0ML2IBgcwzHsw
 Duc8E8MvElUIADXIfOg5SODjUJ8C5Avclyj7tbcnc9H5jdvolWshbJa8CkhTkvC2
 mdVTxDlOpWlBK4fLEtEJQzHldppGZqmnS85ppl4Ipjdu6s6WGyFs6kKXAyKGnsvc
 ydNFyhqv88BY6H3S681nLdTdwMFpxa8RAyJqXkcOBmbJd/8naDpDq7bi6GxNUvWs
 36ymLklJzWxKP/B0tayvqA+dTiKbs/480Xz37vXrcJigBeWpwyNQG2NUeGeqP2kd
 xvEiCZHYq0gmTjGKnxeyvzAeKAlzRKSo4BiI5tUYBOBhV0OHIozz7TMa+Br91sEl
 fNrjnr/d8XAxAVmJcWuSlqfggOVFS1D8FxYIhcbY1NV6zPWogu0=
 =2AMA
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Two minor bug fixes

* tag 'nfsd-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix double fget() bug in __write_ports_addfd()
  nfsd: make a copy of struct iattr before calling notify_change
2023-06-02 13:38:55 -04:00
Namjae Jeon
1c1bcf2d3e ksmbd: validate smb request protocol id
This patch add the validation for smb request protocol id.
If it is not one of the four ids(SMB1_PROTO_NUMBER, SMB2_PROTO_NUMBER,
SMB2_TRANSFORM_PROTO_NUM, SMB2_COMPRESSION_TRANSFORM_ID), don't allow
processing the request. And this will fix the following KASAN warning
also.

[   13.905265] BUG: KASAN: slab-out-of-bounds in init_smb2_rsp_hdr+0x1b9/0x1f0
[   13.905900] Read of size 16 at addr ffff888005fd2f34 by task kworker/0:2/44
...
[   13.908553] Call Trace:
[   13.908793]  <TASK>
[   13.908995]  dump_stack_lvl+0x33/0x50
[   13.909369]  print_report+0xcc/0x620
[   13.910870]  kasan_report+0xae/0xe0
[   13.911519]  kasan_check_range+0x35/0x1b0
[   13.911796]  init_smb2_rsp_hdr+0x1b9/0x1f0
[   13.912492]  handle_ksmbd_work+0xe5/0x820

Cc: stable@vger.kernel.org
Reported-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-02 12:30:57 -05:00
Namjae Jeon
368ba06881 ksmbd: check the validation of pdu_size in ksmbd_conn_handler_loop
The length field of netbios header must be greater than the SMB header
sizes(smb1 or smb2 header), otherwise the packet is an invalid SMB packet.

If `pdu_size` is 0, ksmbd allocates a 4 bytes chunk to `conn->request_buf`.
In the function `get_smb2_cmd_val` ksmbd will read cmd from
`rcv_hdr->Command`, which is `conn->request_buf + 12`, causing the KASAN
detector to print the following error message:

[    7.205018] BUG: KASAN: slab-out-of-bounds in get_smb2_cmd_val+0x45/0x60
[    7.205423] Read of size 2 at addr ffff8880062d8b50 by task ksmbd:42632/248
...
[    7.207125]  <TASK>
[    7.209191]  get_smb2_cmd_val+0x45/0x60
[    7.209426]  ksmbd_conn_enqueue_request+0x3a/0x100
[    7.209712]  ksmbd_server_process_request+0x72/0x160
[    7.210295]  ksmbd_conn_handler_loop+0x30c/0x550
[    7.212280]  kthread+0x160/0x190
[    7.212762]  ret_from_fork+0x1f/0x30
[    7.212981]  </TASK>

Cc: stable@vger.kernel.org
Reported-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-02 12:30:57 -05:00
Namjae Jeon
25933573ef ksmbd: fix posix_acls and acls dereferencing possible ERR_PTR()
Dan reported the following error message:

fs/smb/server/smbacl.c:1296 smb_check_perm_dacl()
    error: 'posix_acls' dereferencing possible ERR_PTR()
fs/smb/server/vfs.c:1323 ksmbd_vfs_make_xattr_posix_acl()
    error: 'posix_acls' dereferencing possible ERR_PTR()
fs/smb/server/vfs.c:1830 ksmbd_vfs_inherit_posix_acl()
    error: 'acls' dereferencing possible ERR_PTR()

__get_acl() returns a mix of error pointers and NULL. This change it
with IS_ERR_OR_NULL().

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-02 12:30:57 -05:00
Namjae Jeon
fc6c6a3c32 ksmbd: fix out-of-bound read in parse_lease_state()
This bug is in parse_lease_state, and it is caused by the missing check
of `struct create_context`. When the ksmbd traverses the create_contexts,
it doesn't check if the field of `NameOffset` and `Next` is valid,
The KASAN message is following:

[    6.664323] BUG: KASAN: slab-out-of-bounds in parse_lease_state+0x7d/0x280
[    6.664738] Read of size 2 at addr ffff888005c08988 by task kworker/0:3/103
...
[    6.666644] Call Trace:
[    6.666796]  <TASK>
[    6.666933]  dump_stack_lvl+0x33/0x50
[    6.667167]  print_report+0xcc/0x620
[    6.667903]  kasan_report+0xae/0xe0
[    6.668374]  kasan_check_range+0x35/0x1b0
[    6.668621]  parse_lease_state+0x7d/0x280
[    6.668868]  smb2_open+0xbe8/0x4420
[    6.675137]  handle_ksmbd_work+0x282/0x820

Use smb2_find_context_vals() to find smb2 create request lease context.
smb2_find_context_vals validate create context fields.

Cc: stable@vger.kernel.org
Reported-by: Chih-Yen Chang <cc85nod@gmail.com>
Tested-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-02 12:30:57 -05:00
Namjae Jeon
f1a411873c ksmbd: fix out-of-bound read in deassemble_neg_contexts()
The check in the beginning is
`clen + sizeof(struct smb2_neg_context) <= len_of_ctxts`,
but in the end of loop, `len_of_ctxts` will subtract
`((clen + 7) & ~0x7) + sizeof(struct smb2_neg_context)`, which causes
integer underflow when clen does the 8 alignment. We should use
`(clen + 7) & ~0x7` in the check to avoid underflow from happening.

Then there are some variables that need to be declared unsigned
instead of signed.

[   11.671070] BUG: KASAN: slab-out-of-bounds in smb2_handle_negotiate+0x799/0x1610
[   11.671533] Read of size 2 at addr ffff888005e86cf2 by task kworker/0:0/7
...
[   11.673383] Call Trace:
[   11.673541]  <TASK>
[   11.673679]  dump_stack_lvl+0x33/0x50
[   11.673913]  print_report+0xcc/0x620
[   11.674671]  kasan_report+0xae/0xe0
[   11.675171]  kasan_check_range+0x35/0x1b0
[   11.675412]  smb2_handle_negotiate+0x799/0x1610
[   11.676217]  ksmbd_smb_negotiate_common+0x526/0x770
[   11.676795]  handle_ksmbd_work+0x274/0x810
...

Cc: stable@vger.kernel.org
Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com>
Tested-by: Chih-Yen Chang <cc85nod@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-06-02 12:30:57 -05:00
Jan Kara
28eceeda13
fs: Lock moved directories
When a directory is moved to a different directory, some filesystems
(udf, ext4, ocfs2, f2fs, and likely gfs2, reiserfs, and others) need to
update their pointer to the parent and this must not race with other
operations on the directory. Lock the directories when they are moved.
Although not all filesystems need this locking, we perform it in
vfs_rename() because getting the lock ordering right is really difficult
and we don't want to expose these locking details to filesystems.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-5-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 15:00:18 +02:00
Jan Kara
f23ce75718
fs: Establish locking order for unrelated directories
Currently the locking order of inode locks for directories that are not
in ancestor relationship is not defined because all operations that
needed to lock two directories like this were serialized by
sb->s_vfs_rename_mutex. However some filesystems need to lock two
subdirectories for RENAME_EXCHANGE operations and for this we need the
locking order established even for two tree-unrelated directories.
Provide a helper function lock_two_inodes() that establishes lock
ordering for any two inodes and use it in lock_two_directories().

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-4-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 15:00:17 +02:00
Jan Kara
cde3c9d7e2
Revert "f2fs: fix potential corruption when moving a directory"
This reverts commit d94772154e. The
locking is going to be provided by VFS.

CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-3-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 14:55:32 +02:00
Jan Kara
7517ce5dc4
Revert "udf: Protect rename against modification of moved directory"
This reverts commit f950fd0529. The
locking is going to be provided by vfs_rename() in the following
patches.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-2-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 14:55:32 +02:00
Jan Kara
3658840cd3
ext4: Remove ext4 locking of moved directory
Remove locking of moved directory in ext4_rename2(). We will take care
of it in VFS instead. This effectively reverts commit 0813299c58
("ext4: Fix possible corruption when moving a directory") and followup
fixes.

CC: Ted Tso <tytso@mit.edu>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230601105830.13168-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 14:55:32 +02:00
Kees Cook
820eb59da8 jfs: Use unsigned variable for length calculations
To avoid confusing the compiler about possible negative sizes, switch
"ssize" which can never be negative from int to u32.  Seen with GCC 13:

../fs/jfs/namei.c: In function 'jfs_symlink': ../include/linux/fortify-string.h:57:33: warning: '__builtin_memcpy' pointer overflow between offset 0 and size [-2147483648, -1]
[-Warray-bounds=]
   57 | #define __underlying_memcpy     __builtin_memcpy
      |                                 ^
...
../fs/jfs/namei.c:950:17: note: in expansion of macro 'memcpy'
  950 |                 memcpy(ip->i_link, name, ssize);
      |                 ^~~~~~

Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Jeff Xu <jeffxu@chromium.org>
Message-Id: <20230204183355.never.877-kees@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-02 10:30:09 +02:00
Mike Christie
f9010dbdce fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
<michael.christie@oracle.com> which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d50 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Co-developed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-01 17:15:33 -04:00
Johannes Thumshirn
c2478469f2 fs: iomap: use bio_add_folio_nofail where possible
When the iomap buffered-io code can't add a folio to a bio, it allocates a
new bio and adds the folio to that one. This is done using bio_add_folio(),
but doesn't check for errors.

As adding a folio to a newly created bio can't fail, use the newly
introduced bio_add_folio_nofail() function.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/58fa893c24c67340a63323f09a179fefdca07f2a.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01 09:13:31 -06:00
Qu Wenruo
b675df0257 btrfs: zoned: fix dev-replace after the scrub rework
[BUG]
After commit e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.

Even an empty zoned btrfs cannot be replaced:

  # mkfs.btrfs -f /dev/nvme0n1
  # mount /dev/nvme0n1 /mnt/btrfs
  # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
  Resetting device zones /dev/nvme1n1 (160 zones) ...
  ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error

And we can hit kernel crash related to that:

  BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
  BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
  nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
  I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
  BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
  BUG: kernel NULL pointer dereference, address: 00000000000000a8
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
  RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
  Call Trace:
   <IRQ>
   btrfs_lookup_ordered_extent+0x31/0x190
   btrfs_record_physical_zoned+0x18/0x40
   btrfs_simple_end_io+0xaf/0xc0
   blk_update_request+0x153/0x4c0
   blk_mq_end_request+0x15/0xd0
   nvme_poll_cq+0x1d3/0x360
   nvme_irq+0x39/0x80
   __handle_irq_event_percpu+0x3b/0x190
   handle_irq_event+0x2f/0x70
   handle_edge_irq+0x7c/0x210
   __common_interrupt+0x34/0xa0
   common_interrupt+0x7d/0xa0
   </IRQ>
   <TASK>
   asm_common_interrupt+0x22/0x40

[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.

And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.

However we have several different bugs all related to zoned dev-replace:

- We are using ZONE_APPEND operation for metadata style write back
  For zoned devices, btrfs has two ways to write data:

  * ZONE_APPEND for data
    This allows higher queue depth, but will not be able to know where
    the write would land.
    Thus needs to grab the real on-disk physical location in it's endio.

  * WRITE for metadata
    This requires single queue depth (new writes can only be submitted
    after previous one finished), and all writes must be sequential.

  For scrub, we go single queue depth, but still goes with ZONE_APPEND,
  which requires btrfs_bio::inode being populated.
  This is the cause of that crash.

- No correct tracing of write_pointer
  After a write finished, we should forward sctx->write_pointer, or
  fill_writer_pointer_gap() would not work properly and cause more
  than necessary zero out, and fill the whole zone prematurely.

- Incorrect physical bytenr passed to fill_writer_pointer_gap()
  In scrub_write_sectors(), one call site passes logical address, which
  is completely wrong.

  The other call site passes physical address of current sector, but
  we should pass the physical address of the btrfs_bio we're submitting.

  This is the cause of the -EIO errors.

[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().

- Manually forward sctx->write_pointer after successful writeback

- Use the physical address of the to-be-submitted btrfs_bio for
  fill_writer_pointer_gap()

Now zoned device replace would work as expected.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-01 15:12:02 +02:00
Andreas Gruenbacher
fa58cc888d gfs2: Don't get stuck writing page onto itself under direct I/O
When a direct I/O write is performed, iomap_dio_rw() invalidates the
part of the page cache which the write is going to before carrying out
the write.  In the odd case, the direct I/O write will be reading from
the same page it is writing to.  gfs2 carries out writes with page
faults disabled, so it should have been obvious that this page
invalidation can cause iomap_dio_rw() to never make any progress.
Currently, gfs2 will end up in an endless retry loop in
gfs2_file_direct_write() instead, though.

Break this endless loop by limiting the number of retries and falling
back to buffered I/O after that.

Also simplify should_fault_in_pages() sightly and add a comment to make
the above case easier to understand.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-06-01 14:55:43 +02:00
Linus Torvalds
8828003759 eight server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmR4QoQACgkQiiy9cAdy
 T1HSBQwAqdgCLV+JAWruePcwF2ChHAegX8Ks0ogG7gK7yvjz262MjSqC//JW9YBb
 GLwShjMvVNclF4CET6NX+UCBMdwNu5YgewMZiBRKx6ewPPQteXjvS/9rDQ/SrI8A
 y9llSwZfLMHQoRyG1ziuZ7v2QL5jFhLZ5SXaS9YQcbEb9iWFcJeJNUc33UqP/qBC
 BPzT41wPxdF33tA/DEku33QJHfQWpR/J0W10kzOOFLMJHuDMicTLbKZpyCmIv6W1
 nYn+82VOgjB7+u+RkeOuWmc5+mdfNZbcmw7bjApZayZXsSYt0DEqJVwkYt45IzkM
 Qilri1FOFNkmo835Svz7LHcvXeRSB3nNSyaYBQroXPsgiZxc9M+o3M+noPvHHVuI
 v78v65jHwQFEjjI2nnDWf4zSkfxhKxRiqKi6a6+2SGBrIgDygROQe8Ky9mgR7MXi
 yTpe8vJHQgywbs8ynIARJZ0jdvokUhu6TaJgUY4ITWKhtntc+Hd2YieVa7K2Hn8O
 n+CO9xdk
 =dtcd
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc4-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Eight server fixes (most also for stable):

   - Two fixes for uninitialized pointer reads (rename and link)

   - Fix potential UAF in oplock break

   - Two fixes for potential out of bound reads in negotiate

   - Fix crediting bug

   - Two fixes for xfstests (allocation size fix for test 694 and lookup
     issue shown by test 464)"

* tag '6.4-rc4-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: call putname after using the last component
  ksmbd: fix incorrect AllocationSize set in smb2_get_info
  ksmbd: fix UAF issue from opinfo->conn
  ksmbd: fix multiple out-of-bounds read during context decoding
  ksmbd: fix slab-out-of-bounds read in smb2_handle_negotiate
  ksmbd: fix credit count leakage
  ksmbd: fix uninitialized pointer read in smb2_create_link()
  ksmbd: fix uninitialized pointer read in ksmbd_vfs_rename()
2023-06-01 08:27:34 -04:00
Prince Kumar Maurya
ea2b62f305 fs/sysv: Null check to prevent null-ptr-deref bug
sb_getblk(inode->i_sb, parent) return a null ptr and taking lock on
that leads to the null-ptr-deref bug.

Reported-by: syzbot+aad58150cbc64ba41bdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=aad58150cbc64ba41bdc
Signed-off-by: Prince Kumar Maurya <princekumarmaurya06@gmail.com>
Message-Id: <20230531013141.19487-1-princekumarmaurya06@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-01 09:51:06 +02:00
Linus Torvalds
929ed21dfd four small smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmR3nmkACgkQiiy9cAdy
 T1ELQwv/c8V2oHalMYKQAdrQ9CpsXnEmdBoiv5QGfWpczkNOLy2OO6zRyauVd9gW
 yxT3YQvemkQbrIwo4IBTsmE/2o0Z3rJYZfgFffNev9yg8QgCMFrHbKQ74Cm7zQcZ
 La8UQTjC0AdcvMcTKCw3Y8ZMvu/V+6caK679kAnplhQ92uw3cUrOkc+y9nGeVsyp
 CLLtgwJWMD/h30kBwSzeQR8AL7bNKCgxUAnOW2Q/8pOLVcbujExgYOxXSzEM8KxA
 XWbIbB5H+9FJRTR4nq2yjHVFJiG61Dt07FaFQaXQuDb8EbyrG8SHJTocYnjRSOUj
 PirKqkF8GxwSqfMNRhQnIT4VMvfyGRk25E/Snwic96NrSyDL7a4jmktDrG/5Fo6a
 GfEdxpxFvjOMg3kf6MuMkd8SIH0o+drRRXqW23YyqY30mg/fNYJhHg9opizbJYip
 IR4MMLCFIiH1IYGugzGOm6t9Wb6EEHuw6GkJixGoWuM/bnSxdkEVAKHT4F8A5CFU
 m4FFrWtX
 =UVOh
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Four small smb3 client fixes:

   - two small fixes suggested by kernel test robot

   - small cleanup fix

   - update Paulo's email address in the maintainer file"

* tag '6.4-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: address unused variable warning
  smb: delete an unnecessary statement
  smb3: missing null check in SMB2_change_notify
  smb3: update a reviewer email in MAINTAINERS file
2023-05-31 19:24:01 -04:00
Linus Torvalds
fd2186d1c7 Fix two regressions in ext4 and a number of issues reported by syzbot.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmR3asIACgkQ8vlZVpUN
 gaObtQgAoPD1ytW5rfDHIK3s7SXImETH/viqXshlPLF9VcJBO4c5Cj1smfgG1o5v
 7iitA3KqT/WZ1sp1ileP0hErYYhAaQRwTzSvEwB7DYJp9KdQu6Gqa8TYL1WoXbih
 tNJAeDJaZtPPaeaMQsBdR8cmKeGVVo7ErrcFU6759Ew2g6+UbO2eKjRlhx5xA35q
 fszI3A5aqz5TYrbap6tnlzovTDbuJo8O8dNXUp07iPwA2Lllf+v54Anl/iN85rR9
 +TbrTuvs4rA1Gbd42coro902Hsg3HmslTI6ajnz5oDPYF9/M3KYRYEzP/ufVzh6p
 dmQ3rwYTPSPc1rrLGmXh3YeYIpuJAA==
 =71Qb
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix two regressions in ext4 and a number of issues reported by syzbot"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable the lazy init thread when remounting read/write
  ext4: fix fsync for non-directories
  ext4: add lockdep annotations for i_data_sem for ea_inode's
  ext4: disallow ea_inodes with extended attributes
  ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()
  ext4: add EA_INODE checking to ext4_iget()
2023-05-31 14:06:01 -04:00
Muchun Song
30480b988f kernfs: fix missing kernfs_idr_lock to remove an ID from the IDR
The root->ino_idr is supposed to be protected by kernfs_idr_lock, fix
it.

Fixes: 488dee96bb ("kernfs: allow creating kernfs objects with arbitrary uid/gid")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230523024017.24851-1-songmuchun@bytedance.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-31 19:03:49 +01:00
Ivan Orlov
266bff7345 debugfs: Correct the 'debugfs_create_str' docs
The documentation of the 'debugfs_create_str' says that the function
returns a pointer to a dentry created, or an ERR_PTR in case of error.
Actually, this is not true: this function doesn't return anything at all.
Correct the documentation correspondingly.

Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Link: https://lore.kernel.org/r/20230514172353.52878-1-ivan.orlov0322@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-31 19:02:14 +01:00
Johannes Thumshirn
0fa5b08cf6 zonefs: use __bio_add_page for adding single page to bio
The zonefs superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/04c9978ccaa0fc9871cd4248356638d98daccf0c.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-31 09:50:02 -06:00
Johannes Thumshirn
effa7ddeeb gfs2: use __bio_add_page for adding single page to bio
The GFS2 superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/087c67d4e4973f949d3519c1e4822784ce583c5a.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-31 09:50:02 -06:00
Johannes Thumshirn
2896db174c jfs: logmgr: use __bio_add_page to add single page to bio
The JFS IO code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/9fb5ed86d19f6e0b6f64dfc4109a48ff8ff24497.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-31 09:50:02 -06:00
Johannes Thumshirn
741af75d40 fs: buffer: use __bio_add_page to add single page to bio
The buffer_head submission code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/84ff2dcbe81b258a73ad900adb5266e208b61a4d.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-31 09:50:02 -06:00
David Howells
1ccf164ec8 block: Use iov_iter_extract_pages() and page pinning in direct-io.c
Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.

This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230526214142.958751-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-31 09:48:15 -06:00
Arnd Bergmann
3b1ddbb62e This is an attempt to harden the typing on virt_to_pfn()
and pfn_to_virt().
 
 Making virt_to_pfn() a static inline taking a strongly typed
 (const void *) makes the contract of a passing a pointer of that
 type to the function explicit and exposes any misuse of the
 macro virt_to_pfn() acting polymorphic and accepting many types
 such as (void *), (unitptr_t) or (unsigned long) as arguments
 without warnings.
 
 For symmetry, we do the same with pfn_to_virt().
 
 The problem with this inconsistent typing was pointed out by
 Russell King:
 https://lore.kernel.org/linux-arm-kernel/YoJDKJXc0MJ2QZTb@shell.armlinux.org.uk/
 
 And confirmed by Andrew Morton:
 https://lore.kernel.org/linux-mm/20220701160004.2ffff4e5ab59a55499f4c736@linux-foundation.org/
 
 So the recognition of the problem is widespread.
 
 These platforms have been chosen as initial conversion targets:
 
 - ARM
 - ARM64/Aarch64
 - asm-generic (including for example x86)
 - m68k
 
 The idea is that if this goes in, it will block further misuse
 of the function signatures due to the large compile coverage,
 and then I can go in and fix the remaining architectures on a
 one-by-one basis.
 
 Some of the patches have been circulated before but were not
 picked up by subsystem maintainers, so now the arch tree is
 target for this series.
 
 It has passed zeroday builds after a lot of iterations in my
 personal tree, but there could be some randconfig outliers.
 New added or deeply hidden problems appear all the time so
 some minor fallout can be expected.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmR0jVQACgkQQRCzN7AZ
 XXMujhAAmkp8mWsDpPqLw11HyTYOZgMOr6pil3uo2/X+x5fFgnJBL+aLN1zvriDK
 j2ywiN7dmAU+veIz3bYJ89e1uiMHSOCrzagSEF1fNDA4tsmp7zfOY9I1LhRK8JVd
 Wf+RplXDxcn3vwGXW5ycv5pwnDEOeQ7AESg18++SW+alRXItqybZijmpHgZrBAEN
 Ms2C6ZeW3zmL6gUpFkL1z8ghVgH46cCysbLKcDvC/fk4M26hzuc1X1Rl0s9BTyxV
 nZZDZsXwKMVmFtIDHMJFysFGRfsP6QbdHtAFuVVqJa8P01Oe3PgbcFEMQEVVJQLf
 u0Y2o1jVk7mCPw5NpI7S2g9293imqU4LgzTmJKnEqmznc8hLD3hWLzz5p07tw9dA
 KaZqzv7zzXWxQmaCW2k0yQRkyjapPo+UwbgSNfj4nyH54Tp3dBfActm3cYdB6WBl
 8ED48c4kFEJkRZ5YW6QDwcQDrZEVp1ZzcDf30t2pMUDFQh5F7/KCKALbkL/dVxuc
 INZDQMHqpQ5P1EmkX+HVyt9syxg5TnQuU+J0vVQBbR8RsdJ+Pap/OBb4EjcgbP6+
 DXS+ApbFwEw6m+jQZ4bnNJcdJfDmax46tmHfADIuUEXxF4I5s33hEz5ca+5xN4Ko
 C4gukN94T6GobcaSYZf035O4QURoDF53oEVfpqsoqsiREyBddFI=
 =7T6L
 -----END PGP SIGNATURE-----

Merge tag 'virt-to-pfn-for-arch-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator into asm-generic

This is an attempt to harden the typing on virt_to_pfn()
and pfn_to_virt().

Making virt_to_pfn() a static inline taking a strongly typed
(const void *) makes the contract of a passing a pointer of that
type to the function explicit and exposes any misuse of the
macro virt_to_pfn() acting polymorphic and accepting many types
such as (void *), (unitptr_t) or (unsigned long) as arguments
without warnings.

For symmetry, we do the same with pfn_to_virt().

The problem with this inconsistent typing was pointed out by
Russell King:
https://lore.kernel.org/linux-arm-kernel/YoJDKJXc0MJ2QZTb@shell.armlinux.org.uk/

And confirmed by Andrew Morton:
https://lore.kernel.org/linux-mm/20220701160004.2ffff4e5ab59a55499f4c736@linux-foundation.org/

So the recognition of the problem is widespread.

These platforms have been chosen as initial conversion targets:

- ARM
- ARM64/Aarch64
- asm-generic (including for example x86)
- m68k

The idea is that if this goes in, it will block further misuse
of the function signatures due to the large compile coverage,
and then I can go in and fix the remaining architectures on a
one-by-one basis.

Some of the patches have been circulated before but were not
picked up by subsystem maintainers, so now the arch tree is
target for this series.

It has passed zeroday builds after a lot of iterations in my
personal tree, but there could be some randconfig outliers.
New added or deeply hidden problems appear all the time so
some minor fallout can be expected.

* tag 'virt-to-pfn-for-arch-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator:
  m68k/mm: Make pfn accessors static inlines
  arm64: memory: Make virt_to_pfn() a static inline
  ARM: mm: Make virt_to_pfn() a static inline
  asm-generic/page.h: Make pfn accessors static inlines
  xen/netback: Pass (void *) to virt_to_page()
  netfs: Pass a pointer to virt_to_page()
  cifs: Pass a pointer to virt_to_page() in cifsglob
  cifs: Pass a pointer to virt_to_page()
  riscv: mm: init: Pass a pointer to virt_to_page()
  ARC: init: Pass a pointer to virt_to_pfn() in init
  m68k: Pass a pointer to virt_to_pfn() virt_to_page()
  fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
2023-05-31 16:33:56 +02:00
Dan Carpenter
c034203b6a nfsd: fix double fget() bug in __write_ports_addfd()
The bug here is that you cannot rely on getting the same socket
from multiple calls to fget() because userspace can influence
that.  This is a kind of double fetch bug.

The fix is to delete the svc_alien_sock() function and instead do
the checking inside the svc_addsock() function.

Fixes: 3064639423 ("nfsd: check passed socket's net matches NFSd superblock's one")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-31 09:57:14 -04:00
Azeem Shaikh
7391928025 befs: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated.
In an effort to remove strlcpy() completely, replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230509014136.2095900-1-azeemshaikh38@gmail.com
2023-05-30 16:42:00 -07:00
Christophe JAILLET
36650a357e binfmt: Slightly simplify elf_fdpic_map_file()
There is no point in initializing 'load_addr' and 'seg' here, they are both
re-written just before being used below.

Doing so, 'load_addr' can be moved in the #ifdef CONFIG_MMU section.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/4f5e4096ad7f17716e924b5bd080e5709fc0b84b.1685290790.git.christophe.jaillet@wanadoo.fr
2023-05-30 15:49:46 -07:00
Christophe JAILLET
e6302d5a28 binfmt: Use struct_size()
Use struct_size() instead of hand-writing it. It is less verbose, more
robust and more informative.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/53150beae5dc04dac513dba391a2e4ae8696a7f3.1685290790.git.christophe.jaillet@wanadoo.fr
2023-05-30 15:49:46 -07:00
Linus Torvalds
48b1320a67 for-6.4-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmR2TDwACgkQxWXV+ddt
 WDsMvQ/+KgUXW+Liu5BaOyD5UzPL4BgHWiPTmJyRpsWTkGm8LE/yRCRoxqp1XbU+
 nOjQpjkxI+ziRgKpDTAGFK/w51TV9ECM5wyZiXx93TO6iaTOuYCtSnSsWylzEC1H
 q9I3znLJSWrnBPTktwTZ29rvKvXj1k3th8ypyI9ho7N+3H0Uzt2VIPxrH2oVXZNz
 f2vkjSX9pKGN5zxM2ahd3Nde4Ma6yAlJLD+pnlYK20zH/30cAXdJsUCsUqQLXDL1
 sUR++Br7qym3Wqn9Qa5R71IPJ1FieW2NaHgAz4dBBFfqe5PR7YCGL/Md6G+CFJ1E
 qLLFOWpELpqkeQdvivBnMZWqgpw+54Pdfuqxg7VylEmUc1y6CK4ab5XctpXIf75h
 6bK0RPZ7D9jZl6JukkWftoS4XnW2cseyEfHneDMZDty4v1bxwR6g7i4ZTym413Gx
 Td1Z+G6BN5O5ih0Pc0CgSS3QnndWTUl3LAHiuxRErrK4dxpeuQlDTGWWY7YVyRPJ
 O9yC24GbHyWYBYHtNACEn6/GlXQjtswhjlHxqONmQfnstZL7Fz8si9EQEOWwssJE
 PIlb022a1mvR42yHr64TE0SzpDZbMY8mnULAsSrWgPXh3IAt1ztUuJajcFs84MZr
 qWewi4F/3wDAB0m1lUbAOmeBbpAw5gSGHhwBrjdK3EWJr2kxQ50=
 =viyP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "One bug fix and two build warning fixes:

   - call proper end bio callback for metadata RAID0 in a rare case of
     an unaligned block

   - fix uninitialized variable (reported by gcc 10.2)

   - fix warning about potential access beyond array bounds on mips64
     with 64k pages (runtime check would not allow that)"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
  btrfs: fix an uninitialized variable warning in btrfs_log_inode
  btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
2023-05-30 17:23:50 -04:00
Theodore Ts'o
eb1f822c76 ext4: enable the lazy init thread when remounting read/write
In commit a44be64bbe ("ext4: don't clear SB_RDONLY when remounting
r/w until quota is re-enabled") we defer clearing tyhe SB_RDONLY flag
in struct super.  However, we didn't defer when we checked sb_rdonly()
to determine the lazy itable init thread should be enabled, with the
next result that the lazy inode table initialization would not be
properly started.  This can cause generic/231 to fail in ext4's
nojournal mode.

Fix this by moving when we decide to start or stop the lazy itable
init thread to after we clear the SB_RDONLY flag when we are
remounting the file system read/write.

Fixes a44be64bbe ("ext4: don't clear SB_RDONLY when remounting r/w until...")

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230527035729.1001605-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-30 15:33:57 -04:00
Jan Kara
1077b2d53e ext4: fix fsync for non-directories
Commit e360c6ed72 ("ext4: Drop special handling of journalled data
from ext4_sync_file()") simplified ext4_sync_file() by dropping special
handling of journalled data mode as it was not needed anymore. However
that branch was also used for directories and symlinks and since the
fastcommit code does not track metadata changes to non-regular files, the
change has caused e.g. fsync(2) on directories to not commit transaction
as it should. Fix the problem by adding handling for non-regular files.

Fixes: e360c6ed72 ("ext4: Drop special handling of journalled data from ext4_sync_file()")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/all/ZFqO3xVnmhL7zv1x@debian-BULLSEYE-live-builder-AMD64
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20230524104453.8734-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-30 15:33:57 -04:00
Theodore Ts'o
aff3bea953 ext4: add lockdep annotations for i_data_sem for ea_inode's
Treat i_data_sem for ea_inodes as being in their own lockdep class to
avoid lockdep complaints about ext4_setattr's use of inode_lock() on
normal inodes potentially causing lock ordering with i_data_sem on
ea_inodes in ext4_xattr_inode_write().  However, ea_inodes will be
operated on by ext4_setattr(), so this isn't a problem.

Cc: stable@kernel.org
Link: https://syzkaller.appspot.com/bug?extid=298c5d8fb4a128bc27b0
Reported-by: syzbot+298c5d8fb4a128bc27b0@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-5-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-30 15:33:57 -04:00
Theodore Ts'o
2bc7e7c1a3 ext4: disallow ea_inodes with extended attributes
An ea_inode stores the value of an extended attribute; it can not have
extended attributes itself, or this will cause recursive nightmares.
Add a check in ext4_iget() to make sure this is the case.

Cc: stable@kernel.org
Reported-by: syzbot+e44749b6ba4d0434cd47@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-4-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-30 15:33:57 -04:00
Theodore Ts'o
b928dfdcb2 ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()
If the ea_inode has been pushed out of the inode cache while there is
still a reference in the mb_cache, the lockdep subclass will not be
set on the inode, which can lead to some lockdep false positives.

Fixes: 33d201e027 ("ext4: fix lockdep warning about recursive inode locking")
Cc: stable@kernel.org
Reported-by: syzbot+d4b971e744b1f5439336@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-3-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-30 15:33:50 -04:00
Bagas Sanjaya
aac2fa2013 fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier
Replace license boilerplate in udftime.c with SPDX identifier for
LGPL-2.0.

Cc: Paul Eggert <eggert@twinsun.com>
Cc: Richard Fontana <rfontana@redhat.com>
Cc: Pali Rohár <pali@kernel.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230522005434.22133-3-bagasdotme@gmail.com>
2023-05-30 15:40:20 +02:00
Bagas Sanjaya
5ce345541e fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier
The notice refers to full GPL 2.0 text on now defunct MIT FTP site [1].
Replace it with appropriate SPDX license identifier.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Pali Rohár <pali@kernel.org>
Link: https://web.archive.org/web/20020809115410/ftp://prep.ai.mit.edu/pub/gnu/GPL [1]
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230522005434.22133-2-bagasdotme@gmail.com>
2023-05-30 15:39:13 +02:00
Jan Kara
576215cffd fs: Drop wait_unfrozen wait queue
wait_unfrozen waitqueue is used only in quota code to wait for
filesystem to become unfrozen. In that place we can just use
sb_start_write() - sb_end_write() pair to achieve the same. So just
remove the waitqueue.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230525141710.7595-1-jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-30 15:36:40 +02:00
Gao Xiang
7b4e372c36 erofs: adapt managed inode operations into folios
This patch gets rid of erofs_try_to_free_cached_page() and fold it
into .release_folio().

It also moves managed inode operations into zdata.c, which simplifies
the code a bit.  No logic changes.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-5-hsiangkao@linux.alibaba.com
2023-05-29 23:06:03 +08:00
Gao Xiang
967c28b23f erofs: kill hooked chains to avoid loops on deduplicated compressed images
After heavily stressing EROFS with several images which include a
hand-crafted image of repeated patterns for more than 46 days, I found
two chains could be linked with each other almost simultaneously and
form a loop so that the entire loop won't be submitted.  As a
consequence, the corresponding file pages will remain locked forever.

It can be _only_ observed on data-deduplicated compressed images.
For example, consider two chains with five pclusters in total:
	Chain 1:  2->3->4->5    -- The tail pcluster is 5;
        Chain 2:  5->1->2       -- The tail pcluster is 2.

Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link
to Chain 2 at the same time with pcluster 2.

Since hooked chains are all linked locklessly now, I have no idea how
to simply avoid the race.  Instead, let's avoid hooked chains completely
until I could work out a proper way to fix this and end users finally
tell us that it's needed to add it back.

Actually, this optimization can be found with multi-threaded workloads
(especially even more often on deduplicated compressed images), yet I'm
not sure about the overall system impacts of not having this compared
with implementation complexity.

Fixes: 267f2492c8 ("erofs: introduce multi-reference pclusters (fully-referenced)")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-4-hsiangkao@linux.alibaba.com
2023-05-29 23:05:43 +08:00
Gao Xiang
6ab5eed600 erofs: avoid on-stack pagepool directly passed by arguments
On-stack pagepool is used so that short-lived temporary pages could be
shared within a single I/O request (e.g. among multiple pclusters).

Moving the remaining frontend-related uses into
z_erofs_decompress_frontend to avoid too many arguments.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-3-hsiangkao@linux.alibaba.com
2023-05-29 23:05:25 +08:00
Gao Xiang
05b63d2beb erofs: allocate extra bvec pages directly instead of retrying
If non-bootstrap bvecs cannot be kept in place (very rarely), an extra
short-lived page is allocated.

Let's just allocate it immediately rather than do unnecessary -EAGAIN
return first and retry as a cleanup.  Also it's unnecessary to use
__GFP_NOFAIL here since we could gracefully fail out this case instead.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Link: https://lore.kernel.org/r/20230526201459.128169-2-hsiangkao@linux.alibaba.com
2023-05-29 23:04:45 +08:00
Linus Walleij
ee5971613d netfs: Pass a pointer to virt_to_page()
Like the other calls in this function virt_to_page() expects
a pointer, not an integer.

However since many architectures implement virt_to_pfn() as
a macro, this function becomes polymorphic and accepts both a
(unsigned long) and a (void *).

Fix this up with an explicit cast.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-29 11:27:08 +02:00
Linus Walleij
605a97e839 cifs: Pass a pointer to virt_to_page() in cifsglob
Like the other calls in this function virt_to_page() expects
a pointer, not an integer.

However since many architectures implement virt_to_pfn() as
a macro, this function becomes polymorphic and accepts both a
(unsigned long) and a (void *).

Fix this up with an explicit cast.

Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-29 11:27:08 +02:00
Linus Walleij
1db3af8ed8 cifs: Pass a pointer to virt_to_page()
Like the other calls in this function virt_to_page() expects
a pointer, not an integer.

However since many architectures implement virt_to_pfn() as
a macro, this function becomes polymorphic and accepts both a
(unsigned long) and a (void *).

Fix this up with an explicit cast.

Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-29 11:27:08 +02:00
Linus Walleij
9b2d38b4e4 fs/proc/kcore.c: Pass a pointer to virt_addr_valid()
The virt_addr_valid() should be passed a pointer, the current
code passing a long unsigned int is just exploiting the
unintentional polymorphism of these calls being implemented
as preprocessor macros.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-29 11:27:07 +02:00
Al Viro
b8b9e8b35d ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore
... and that's how it should've been done in the first place

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:31 +02:00
Al Viro
dae42837ba ext2_{set_link,delete_entry}(): don't bother with page_addr
ext2_set_link() simply doesn't use it anymore and ext2_delete_entry()
can easily obtain it from the directory entry pointer...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:28 +02:00
Al Viro
91f646fb97 ext2_put_page(): accept any pointer within the page
eliminates the need to keep the pointer to the first byte within
the page if we are guaranteed to have pointers to some byte
in the same page at hand.

Don't backport without commit 88d7b12068 ("highmem: round down the
address passed to kunmap_flush_on_unmap()").

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:25 +02:00
Al Viro
46022375ab ext2_get_page(): saner type
We need to pass to caller both the page reference and pointer to the
first byte in the now-mapped page.  The former always has the same type,
the latter varies from caller to caller.  So make it
	void *ext2_get_page(...., struct page **page)
rather than
	struct page *ext2_get_page(..., void **page_addr)
and avoid the casts...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:22 +02:00
Al Viro
8600839269 ext2: use offset_in_page() instead of open-coding it as subtraction
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:19 +02:00
Al Viro
8f1dca19b1 ext2_rename(): set_link and delete_entry may fail
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Tested-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-29 11:03:06 +02:00
Yue Hu
796e9149a2 erofs: clean up z_erofs_pcluster_readmore()
`end` parameter is no needed since it's pointless for !backmost, we can
handle it with backmost internally.  And we only expand the trailing
edge, so the newstart can be replaced with ->headoffset.

Also, remove linux/prefetch.h inclusion since that is not used anymore
after commit 386292919c ("erofs: introduce readmore decompression
strategy").

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230525072605.17857-1-zbestahu@gmail.com
[ Gao Xiang: update commit description. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-29 15:54:51 +08:00
Yue Hu
ef4b4b46c6 erofs: remove the member readahead from struct z_erofs_decompress_frontend
The struct member is only used to add REQ_RAHEAD during I/O submission.
So it is cleaner to pass it as a parameter than keep it in the struct.

Also, rename function z_erofs_get_sync_decompress_policy() to
z_erofs_is_sync_decompress() for better clarity and conciseness.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230524063944.1655-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-29 15:54:50 +08:00
Yue Hu
597e2953ae erofs: fold in z_erofs_decompress()
No need this helper since it's just a simple wrapper for decompress
method and only one caller.  So, let's fold in directly instead.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230426084449.12781-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-29 15:54:49 +08:00
Theodore Ts'o
b3e6bcb945 ext4: add EA_INODE checking to ext4_iget()
Add a new flag, EXT4_IGET_EA_INODE which indicates whether the inode
is expected to have the EA_INODE flag or not.  If the flag is not
set/clear as expected, then fail the iget() operation and mark the
file system as corrupted.

This commit also makes the ext4_iget() always perform the
is_bad_inode() check even when the inode is already inode cache.  This
allows us to remove the is_bad_inode() check from the callers of
ext4_iget() in the ea_inode code.

Reported-by: syzbot+cbb68193bdb95af4340a@syzkaller.appspotmail.com
Reported-by: syzbot+62120febbd1ee3c3c860@syzkaller.appspotmail.com
Reported-by: syzbot+edce54daffee36421b4c@syzkaller.appspotmail.com
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20230524034951.779531-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-28 14:18:03 -04:00
Steve French
fdd7d1fff4 cifs: address unused variable warning
Fix trivial unused variable warning (when SMB1 support disabled)

"ioctl.c:324:17: warning: variable 'caps' set but not used [-Wunused-but-set-variable]"

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202305250056.oZhsJmdD-lkp@intel.com/
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-27 03:33:23 -05:00
Dan Carpenter
396ac4c982 smb: delete an unnecessary statement
We don't need to set the list iterators to NULL before a
list_for_each_entry() loop because they are assigned inside the
macro.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 21:07:16 -05:00
Namjae Jeon
6fe55c2799 ksmbd: call putname after using the last component
last component point filename struct. Currently putname is called after
vfs_path_parent_lookup(). And then last component is used for
lookup_one_qstr_excl(). name in last component is freed by previous
calling putname(). And It cause file lookup failure when testing
generic/464 test of xfstest.

Fixes: 74d7970feb ("ksmbd: fix racy issue from using ->d_parent and ->d_name")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Namjae Jeon
6cc2268f56 ksmbd: fix incorrect AllocationSize set in smb2_get_info
If filesystem support sparse file, ksmbd should return allocated size
using ->i_blocks instead of stat->size. This fix generic/694 xfstests.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Namjae Jeon
36322523dd ksmbd: fix UAF issue from opinfo->conn
If opinfo->conn is another connection and while ksmbd send oplock break
request to cient on current connection, The connection for opinfo->conn
can be disconnect and conn could be freed. When sending oplock break
request, this ksmbd_conn can be used and cause user-after-free issue.
When getting opinfo from the list, ksmbd check connection is being
released. If it is not released, Increase ->r_count to wait that connection
is freed.

Cc: stable@vger.kernel.org
Reported-by: Per Forlin <per.forlin@axis.com>
Tested-by: Per Forlin <per.forlin@axis.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Kuan-Ting Chen
0512a5f89e ksmbd: fix multiple out-of-bounds read during context decoding
Check the remaining data length before accessing the context structure
to ensure that the entire structure is contained within the packet.
Additionally, since the context data length `ctxt_len` has already been
checked against the total packet length `len_of_ctxts`, update the
comparison to use `ctxt_len`.

Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Kuan-Ting Chen
d738950f11 ksmbd: fix slab-out-of-bounds read in smb2_handle_negotiate
Check request_buf length first to avoid out-of-bounds read by
req->DialectCount.

[ 3350.990282] BUG: KASAN: slab-out-of-bounds in smb2_handle_negotiate+0x35d7/0x3e60
[ 3350.990282] Read of size 2 at addr ffff88810ad61346 by task kworker/5:0/276
[ 3351.000406] Workqueue: ksmbd-io handle_ksmbd_work
[ 3351.003499] Call Trace:
[ 3351.006473]  <TASK>
[ 3351.006473]  dump_stack_lvl+0x8d/0xe0
[ 3351.006473]  print_report+0xcc/0x620
[ 3351.006473]  kasan_report+0x92/0xc0
[ 3351.006473]  smb2_handle_negotiate+0x35d7/0x3e60
[ 3351.014760]  ksmbd_smb_negotiate_common+0x7a7/0xf00
[ 3351.014760]  handle_ksmbd_work+0x3f7/0x12d0
[ 3351.014760]  process_one_work+0xa85/0x1780

Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Namjae Jeon
84c5aa4792 ksmbd: fix credit count leakage
This patch fix the failure from smb2.credits.single_req_credits_granted
test. When client send 8192 credit request, ksmbd return 8191 credit
granted. ksmbd should give maximum possible credits that must be granted
within the range of not exceeding the max credit to client.

Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Namjae Jeon
df14afeed2 ksmbd: fix uninitialized pointer read in smb2_create_link()
There is a case that file_present is true and path is uninitialized.
This patch change file_present is set to false by default and set to
true when patch is initialized.

Fixes: 74d7970feb ("ksmbd: fix racy issue from using ->d_parent and ->d_name")
Reported-by: Coverity Scan <scan-admin@coverity.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
Namjae Jeon
48b47f0caa ksmbd: fix uninitialized pointer read in ksmbd_vfs_rename()
Uninitialized rd.delegated_inode can be used in vfs_rename().
Fix this by setting rd.delegated_inode to NULL to avoid the uninitialized
read.

Fixes: 74d7970feb ("ksmbd: fix racy issue from using ->d_parent and ->d_name")
Reported-by: Coverity Scan <scan-admin@coverity.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-26 20:27:46 -05:00
pengfuyuan
5ad9b4719f btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds
When compiling on a MIPS 64-bit machine we get these warnings:

    In file included from ./arch/mips/include/asm/cacheflush.h:13,
	             from ./include/linux/cacheflush.h:5,
	             from ./include/linux/highmem.h:8,
		     from ./include/linux/bvec.h:10,
		     from ./include/linux/blk_types.h:10,
                     from ./include/linux/blkdev.h:9,
	             from fs/btrfs/disk-io.c:7:
    fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
    fs/btrfs/disk-io.c💯34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds]
      100 |   kaddr = page_address(buf->pages[i]);
          |                        ~~~~~~~~~~^~~
    ./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’
     2135 | #define page_address(page) lowmem_page_address(page)
          |                                                ^~~~
    cc1: all warnings being treated as errors

We can check if i overflows to solve the problem. However, this doesn't make
much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop.
In addition, i < num_pages can also ensure that buf->pages[i] will not cross
the boundary. Unfortunately, this doesn't help with the problem observed here:
gcc still complains.

To fix this add a compile-time condition for the extent buffer page
array size limit, which would eventually lead to eliminating the whole
for loop.

CC: stable@vger.kernel.org # 5.10+
Signed-off-by: pengfuyuan <pengfuyuan@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:55 +02:00
Shida Zhang
8fd9f4232d btrfs: fix an uninitialized variable warning in btrfs_log_inode
This fixes the following warning reported by gcc 10.2.1 under x86_64:

../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’:
../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 6211 |   ret = insert_dir_log_key(trans, log, path, key.objectid,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 6212 |       first_dir_index, last_dir_index);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here
 6161 |  u64 last_range_start;
      |      ^~~~~~~~~~~~~~~~

This might be a false positive fixed in later compiler versions but we
want to have it fixed.

Reported-by: k2ci <kernel-bot@kylinos.cn>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:04 +02:00
Christoph Hellwig
45c2f36871 btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
When I implemented the storage layer bio splitting, I was under the
assumption that we'll never split metadata bios.  But Qu reminded me that
this can actually happen with very old file systems with unaligned
metadata chunks and RAID0.

I still haven't seen such a case in practice, but we better handled this
case, especially as it is fairly easily to do not calling the ->end_іo
method directly in btrfs_end_io_work, and using the proper
btrfs_orig_bbio_end_io helper instead.

In addition to the old file system with unaligned metadata chunks case
documented in the commit log, the combination of the new scrub code
with Johannes pending raid-stripe-tree also triggers this case.  We
spent some time debugging it and found that this patch solves
the problem.

Fixes: 103c19723c ("btrfs: split the bio submission path into a separate file")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26 23:24:01 +02:00
Kees Cook
d67790ddf0 overflow: Add struct_size_t() helper
While struct_size() is normally used in situations where the structure
type already has a pointer instance, there are places where no variable
is available. In the past, this has been worked around by using a typed
NULL first argument, but this is a bit ugly. Add a helper to do this,
and replace the handful of instances of the code pattern with it.

Instances were found with this Coccinelle script:

@struct_size_t@
identifier STRUCT, MEMBER;
expression COUNT;
@@

-       struct_size((struct STRUCT *)\(0\|NULL\),
+       struct_size_t(struct STRUCT,
                MEMBER, COUNT)

Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: HighPoint Linux Team <linux@highpoint-tech.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Don Brace <don.brace@microchip.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Guo Xuenan <guoxuenan@huawei.com>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Daniel Latypov <dlatypov@google.com>
Cc: kernel test robot <lkp@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: netdev@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: megaraidlinux.pdl@broadcom.com
Cc: storagedev@microchip.com
Cc: linux-xfs@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230522211810.never.421-kees@kernel.org
2023-05-26 13:52:19 -07:00
Linus Torvalds
b158dd941b for-6.4-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRwqRsACgkQxWXV+ddt
 WDuCPQ//T8JVY6usnGF/Fw/3zbtDNvrdQLDfp3HovIg7gmLIBda0bT05w4Q46FUU
 l4BV0bHyTUNWPlXUmrrSmt8HipRe2z4Wjwc16azdLmSs5zf0FO1LbsCKDmM8Ncid
 LTi2jzyyb3E44ZzC/i7RCaBt+vYRb2ZmtZ/glh3K4H0GgTAYl1GxZoAoYgBnvmlG
 nvmlWWDaM2cRKaUREm75il37LKLIlW5jvdUFQrqwWNgUH72ay5/7SZxHywlk8x6b
 qwhhp+s6bMUNzi6CqE2SLnESjI9yl0l/0gLebhDXVulo0BiCrti+YLpueP4eQs1B
 yYXX3PvHOXhoN4tUQ4yDF9G57To4Gw1aiQOnWOOLcbyGG1ZgyekpoRRXh6r74LKt
 FDyWT+u/xd78by1km3VzqmvKtqHnRFNMYfP+MMDIhyhy5prKCWeVo7bC+2FP+89o
 kv9+0Z0w0lkLycFfLaewZkEv0/WY8GMuT7kptHQ2Ao6ulAvG+j97sgVBFGXJjeCr
 B1OAGdeTF79IV139bCxPA62cat87Zrh15mZN+y7U32Vs2JkOqbT0LTQGKoVs/TCI
 AyHCDb8oOfGiebibnEDrDNtubz7NFCq4ntZRmuv5FJ+l2d1wl6ZvsI+DoYP7Zide
 DLR7ZtPs1Yvm27xDjs+fVmMx4nuNGikEbPZPxJro1CjLVzCEt7k=
 =elHB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - handle memory allocation error in checksumming helper (reported by
   syzbot)

 - fix lockdep splat when aborting a transaction, add NOFS protection
   around invalidate_inode_pages2 that could allocate with GFP_KERNEL

 - reduce chances to hit an ENOSPC during scrub with RAID56 profiles

* tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use nofs when cleaning up aborted transactions
  btrfs: handle memory allocation failure in btrfs_csum_one_bio
  btrfs: scrub: try harder to mark RAID56 block groups read-only
2023-05-26 13:21:38 -07:00
Steve French
b535cc796a smb3: missing null check in SMB2_change_notify
If plen is null when passed in, we only checked for null
in one of the two places where it could be used. Although
plen is always valid (not null) for current callers of the
SMB2_change_notify function, this change makes it more consistent.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/all/202305251831.3V1gbbFs-lkp@intel.com/
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-25 21:42:20 -05:00
Linus Torvalds
0d85b27b0c four smb3 client server fixes (3 also for stable) and 3 patches related to move of fs/cifs and fs/ksmbd directories to common fs/smb parent directory
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRv7UIACgkQiiy9cAdy
 T1GzUQv+KF/IDyb5wzxamh35hSDLzQo1KKaGPdumN+xyQXUhiaml3XqWfQPWC3EO
 vDF4a5zvi5Wm0TNwwYqUFwgMBKFqlUHw64qEkIv6MW/IHOv8/CYepBIeTLwIQCyr
 REJgfU1oJJLa0U4DsPYpwgEVqnuFdb20oaKMPVTgCAHnnpKsBTtKa7ZDbCBZtHOV
 URg7at6c/Dc6uWGOWRif++llmq5a5b6sBxtZ+C99dQGDKvSqbFTOf6If1u6HAaO0
 m75DPcb9o2IA2lLjxALbbIeofPeEphkcH2WBUNHC2tfFo91EcndVxfKB/jo8wzP7
 /5MGHlFEjmupPsPJq6bbMNj+jyPa3UM/CGqsw8ij4SmmIIt0FbBsFPuEMIPmtAsW
 GJL0/Nf1cDiJJIeMahaW936VRK66VLkEGvhKFCxVpPA93IN0eNh1E0HSXsGrpYJ+
 lp4edXJam/2rHngbPgB+LUaPHoVTj0xZwTDGDzNlyI6S5HWcBv43CgzlFF1sWtpD
 4CNjpqzS
 =sHwf
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb directory moves and client fixes from Steve French:
 "Four smb3 client fixes (three of which marked for stable) and three
  patches to move of fs/cifs and fs/ksmbd to a new common "fs/smb"
  parent directory

   - Move the client and server source directories to a common parent
     directory:

       fs/cifs -> fs/smb/client
       fs/ksmbd -> fs/smb/server
       fs/smbfs_common -> fs/smb/common

   - important readahead fix

   - important fix for SMB1 regression

   - fix for missing mount option ("mapchars") in mount API conversion

   - minor debugging improvement"

* tag '6.4-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: move Documentation/filesystems/cifs to Documentation/filesystems/smb
  cifs: correct references in Documentation to old fs/cifs path
  smb: move client and server files to common directory fs/smb
  cifs: mapchars mount option ignored
  smb3: display debug information better for encryption
  cifs: fix smb1 mount regression
  cifs: Fix cifs_limit_bvec_subset() to correctly check the maxmimum size
2023-05-25 19:23:18 -07:00
Tetsuo Handa
d031f4e8b4 reiserfs: Initialize sec->length in reiserfs_security_init().
syzbot is reporting that sec->length is not initialized.

Since security_inode_init_security() returns 0 when initxattrs is provided
but call_int_hook(inode_init_security) returned -EOPNOTSUPP, control will
reach to "if (sec->length && ...) {" without initializing sec->length.

Reported-by: syzbot <syzbot+00a3779539a23cbee38c@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=00a3779539a23cbee38c
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 52ca4b6435 ("reiserfs: Switch to security_inode_init_security()")
Signed-off-by: Paul Moore <paul@paul-moore.com>
2023-05-25 17:44:44 -04:00
Linus Torvalds
9db898594c vfs/v6.4-rc2/misc.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZG9CygAKCRCRxhvAZXjc
 opSUAP94up0d2bhB4CDRGkszpBefogBXyEylT8v+1EPtzs8K6QEA9OEbn4wWsIlh
 vYLUjejArgUGuxDl7iiZzAx8p6n9qws=
 =lEs3
 -----END PGP SIGNATURE-----

Merge tag 'vfs/v6.4-rc3/misc.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - During the acl rework we merged this cycle the generic_listxattr()
   helper had to be modified in a way that in principle it would allow
   for POSIX ACLs to be reported. At least that was the impression we
   had initially. Because before the acl rework POSIX ACLs would be
   reported if the filesystem did have POSIX ACL xattr handlers in
   sb->s_xattr. That logic changed and now we can simply check whether
   the superblock has SB_POSIXACL set and if the inode has
   inode->i_{default_}acl set report the appropriate POSIX ACL name.

   However, we didn't realize that generic_listxattr() was only ever
   used by two filesystems. Both of them don't support POSIX ACLs via
   sb->s_xattr handlers and so never reported POSIX ACLs via
   generic_listxattr() even if they raised SB_POSIXACL and did contain
   inodes which had acls set. The example here is nfs4.

   As a result, generic_listxattr() suddenly started reporting POSIX
   ACLs when it wouldn't have before. Since SB_POSIXACL implies that the
   umask isn't stripped in the VFS nfs4 can't just drop SB_POSIXACL from
   the superblock as it would also alter umask handling for them.

   So just have generic_listxattr() not report POSIX ACLs as it never
   did anyway. It's documented as such.

 - Our SB_* flags currently use a signed integer and we shift the last
   bit causing UBSAN to complain about undefined behavior. Switch to
   using unsigned. While the original patch used an explicit unsigned
   bitshift it's now pretty common to rely on the BIT() macro in a lot
   of headers nowadays. So the patch has been adjusted to use that.

 - Add Namjae as ntfs reviewer. They're already active this cycle so
   let's make it explicit right now.

* tag 'vfs/v6.4-rc3/misc.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  ntfs: Add myself as a reviewer
  fs: don't call posix_acl_listxattr in generic_listxattr
  fs: fix undefined behavior in bit shift for SB_NOUSER
2023-05-25 11:03:58 -07:00
Amir Goldstein
7cdafe6cc4 exportfs: check for error return value from exportfs_encode_*()
The exportfs_encode_*() helpers call the filesystem ->encode_fh()
method which returns a signed int.

All the in-tree implementations of ->encode_fh() return a positive
integer and FILEID_INVALID (255) for error.

Fortify the callers for possible future ->encode_fh() implementation
that will return a negative error value.

name_to_handle_at() would propagate the returned error to the users
if filesystem ->encode_fh() method returns an error.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/linux-fsdevel/ca02955f-1877-4fde-b453-3c1d22794740@kili.mountain/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230524154825.881414-1-amir73il@gmail.com>
2023-05-25 13:17:04 +02:00
Amir Goldstein
a95aef69a7 fanotify: support reporting non-decodeable file handles
fanotify users do not always need to decode the file handles reported
with FAN_REPORT_FID.

Relax the restriction that filesystem needs to support NFS export and
allow reporting file handles from filesystems that only support ecoding
unique file handles.

Even filesystems that do not have export_operations at all can fallback
to use the default FILEID_INO32_GEN encoding, but we use the existence
of export_operations as an indication that the encoded file handles will
be sufficiently unique and that user will be able to compare them to
filesystem objects using AT_HANDLE_FID flag to name_to_handle_at(2).

For filesystems that do not support NFS export, users will have to use
the AT_HANDLE_FID of name_to_handle_at(2) if they want to compare the
object in path to the object fid reported in an event.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230502124817.3070545-5-amir73il@gmail.com>
2023-05-25 13:17:04 +02:00
Amir Goldstein
96b2b072ee exportfs: allow exporting non-decodeable file handles to userspace
Some userspace programs use st_ino as a unique object identifier, even
though inode numbers may be recycable.

This issue has been addressed for NFS export long ago using the exportfs
file handle API and the unique file handle identifiers are also exported
to userspace via name_to_handle_at(2).

fanotify also uses file handles to identify objects in events, but only
for filesystems that support NFS export.

Relax the requirement for NFS export support and allow more filesystems
to export a unique object identifier via name_to_handle_at(2) with the
flag AT_HANDLE_FID.

A file handle requested with the AT_HANDLE_FID flag, may or may not be
usable as an argument to open_by_handle_at(2).

To allow filesystems to opt-in to supporting AT_HANDLE_FID, a struct
export_operations is required, but even an empty struct is sufficient
for encoding FIDs.

Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230502124817.3070545-4-amir73il@gmail.com>
2023-05-25 13:16:57 +02:00
Steve French
bf8a352d49 cifs: correct references in Documentation to old fs/cifs path
The fs/cifs directory has moved to fs/smb/client, correct mentions
of this in Documentation and comments.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24 16:29:21 -05:00
Steve French
38c8a9a520 smb: move client and server files to common directory fs/smb
Move CIFS/SMB3 related client and server files (cifs.ko and ksmbd.ko
and helper modules) to new fs/smb subdirectory:

   fs/cifs --> fs/smb/client
   fs/ksmbd --> fs/smb/server
   fs/smbfs_common --> fs/smb/common

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24 16:29:21 -05:00
Steve French
cb8b02fd63 cifs: mapchars mount option ignored
There are two ways that special characters (not allowed in some
other operating systems like Windows, but allowed in POSIX) have
been mapped in the past ("SFU" and "SFM" mappings) to allow them
to be stored in a range reserved for special chars. The default
for Linux has been to use "mapposix" (ie the SFM mapping) but
the conversion to the new mount API in the 5.11 kernel broke
the ability to override the default mapping of the reserved
characters (like '?' and '*' and '\') via "mapchars" mount option.

This patch fixes that - so can now mount with "mapchars"
mount option to override the default ("mapposix" ie SFM) mapping.

Reported-by: Tyler Spivey <tspivey8@gmail.com>
Fixes: 24e0a1eff9 ("cifs: switch to new mount api")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24 16:26:44 -05:00
Steve French
8b4dd44f9b smb3: display debug information better for encryption
Fix /proc/fs/cifs/DebugData to use the same case for "encryption"
(ie "Encryption" with init capital letter was used in one place).
In addition, if gcm256 encryption (intead of gcm128) is used on
a connection to a server, note that in the DebugData as well.

It now displays (when gcm256 negotiated):
 Security type: RawNTLMSSP  SessionId: 0x86125800bc000b0d encrypted(gcm256)

Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24 16:23:33 -05:00
Paulo Alcantara
72a7804a66 cifs: fix smb1 mount regression
cifs.ko maps NT_STATUS_NOT_FOUND to -EIO when SMB1 servers couldn't
resolve referral paths.  Proceed to tree connect when we get -EIO from
dfs_get_referral() as well.

Reported-by: Kris Karas (Bug Reporting) <bugs-a21@moonlit-rail.com>
Tested-by: Woody Suwalski <terraluna977@gmail.com>
Fixes: 8e3554150d ("cifs: fix sharing of DFS connections")
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-24 16:22:44 -05:00
Alexander Aring
57e2c2f2d9 fs: dlm: fix mismatch of plock results from userspace
When a waiting plock request (F_SETLKW) is sent to userspace
for processing (dlm_controld), the result is returned at a
later time. That result could be incorrectly matched to a
different waiting request in cases where the owner field is
the same (e.g. different threads in a process.) This is fixed
by comparing all the properties in the request and reply.

The results for non-waiting plock requests are now matched
based on list order because the results are returned in the
same order they were sent.

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-24 14:27:07 -05:00
Christoph Hellwig
e51bab4e20 block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:44 -06:00
David Howells
a450f49708 iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-2-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:44 -06:00
Jens Axboe
bbeb087e5a Merge branch 'for-6.5/splice' into for-6.5/block
Merge splice bits as subsequent block cleanups and improvements for DIO
depend on them.

* for-6.5/splice: (31 commits)
  splice: kdoc for filemap_splice_read() and copy_splice_read()
  iov_iter: Kill ITER_PIPE
  splice: Remove generic_file_splice_read()
  splice: Use filemap_splice_read() instead of generic_file_splice_read()
  cifs: Use filemap_splice_read()
  trace: Convert trace/seq to use copy_splice_read()
  zonefs: Provide a splice-read wrapper
  xfs: Provide a splice-read wrapper
  orangefs: Provide a splice-read wrapper
  ocfs2: Provide a splice-read wrapper
  ntfs3: Provide a splice-read wrapper
  nfs: Provide a splice-read wrapper
  f2fs: Provide a splice-read wrapper
  ext4: Provide a splice-read wrapper
  ecryptfs: Provide a splice-read wrapper
  ceph: Provide a splice-read wrapper
  afs: Provide a splice-read wrapper
  9p: Add splice_read wrapper
  net: Make sock_splice_read() use copy_splice_read() by default
  tty, proc, kernfs, random: Use copy_splice_read()
  ...
2023-05-24 08:42:22 -06:00
David Howells
9eee8bd814 splice: kdoc for filemap_splice_read() and copy_splice_read()
Provide kerneldoc comments for filemap_splice_read() and
copy_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-32-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:17 -06:00
David Howells
c6585011bc splice: Remove generic_file_splice_read()
Remove generic_file_splice_read() as it has been replaced with calls to
filemap_splice_read() and copy_splice_read().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-30-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:17 -06:00
David Howells
2cb1e08985 splice: Use filemap_splice_read() instead of generic_file_splice_read()
Replace pointers to generic_file_splice_read() with calls to
filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-29-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:17 -06:00
David Howells
ab82513126 cifs: Use filemap_splice_read()
Make cifs use filemap_splice_read() rather than doing its own version of
generic_file_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-28-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
6ef48ec391 zonefs: Provide a splice-read wrapper
Provide a splice_read wrapper for zonefs.  This does some checks before
proceeding and locks the inode across the call to filemap_splice_read() and
a size check in case of truncation.  Splicing from direct I/O is handled by
the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230522135018.2742245-26-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
54919f94ec xfs: Provide a splice-read wrapper
Provide a splice_read wrapper for XFS.  This does a stat count and a
shutdown check before proceeding, then emits a new trace line and locks the
inode across the call to filemap_splice_read() and adds to the stats
afterwards.  Splicing from direct I/O or DAX is handled by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-25-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
6bbf64beab orangefs: Provide a splice-read wrapper
Provide a splice_read wrapper for ocfs2.  This increments the read stats
and then locks the inode across the call to filemap_splice_read() and a
revalidation of the mapping.  Splicing from direct I/O is done by the
caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Mike Marshall <hubcap@omnibond.com>
cc: Martin Brandenburg <martin@omnibond.com>
cc: devel@lists.orangefs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-24-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
94aca682a4 ocfs2: Provide a splice-read wrapper
Provide a splice_read wrapper for ocfs2.  This emits trace lines and does
an atime lock/update before calling filemap_splice_read().  Splicing from
direct I/O is handled by the caller.

A couple of new tracepoints are added for this purpose.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Mark Fasheh <mark@fasheh.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: ocfs2-devel@oss.oracle.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-23-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
5149439880 ntfs3: Provide a splice-read wrapper
Provide a splice_read wrapper for NTFS3 to perform various checks before
allowing the operation to proceed.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
cc: ntfs3@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-22-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
a7db503401 nfs: Provide a splice-read wrapper
Provide a splice_read wrapper for NFS.  This locks the inode around
filemap_splice_read() and revalidates the mapping.  Splicing from direct
I/O is handled by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-21-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
ceb11d0e2d f2fs: Provide a splice-read wrapper
Provide a splice_read wrapper for f2fs.  This does some checks and tracing
before calling filemap_splice_read() and will update the iostats
afterwards.  Direct I/O is handled by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jaegeuk Kim <jaegeuk@kernel.org>
cc: Chao Yu <chao@kernel.org>
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-20-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
fa6c46e7c2 ext4: Provide a splice-read wrapper
Provide a splice_read wrapper for Ext4.  This does the inode shutdown check
before proceeding.  Splicing from DAX files and O_DIRECT fds is handled by
the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-19-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
390df3b830 ecryptfs: Provide a splice-read wrapper
Provide a splice_read wrapper for ecryptfs to update the access time on the
lower file after the operation.  Splicing from a direct I/O fd will update
the access time when ->read_iter() is called.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Tyler Hicks <code@tyhicks.com>
cc: ecryptfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-18-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
ccfdf7cbb5 ceph: Provide a splice-read wrapper
Provide a splice_read wrapper for Ceph.  This does the inode shutdown check
before proceeding and jumps to copy_splice_read() if the file has inline
data or is a synchronous file.

We try and get FILE_RD and either FILE_CACHE and/or FILE_LAZYIO caps and
hold them across filemap_splice_read().  If we fail to get FILE_CACHE or
FILE_LAZYIO capabilities, we use copy_splice_read() instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-17-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
d96d96eebb afs: Provide a splice-read wrapper
Provide a splice_read wrapper for AFS to call afs_validate() before going
into generic_file_splice_read() so that we're likely to have a callback
promise from the server.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-16-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
c829d0bd33 9p: Add splice_read wrapper
Add a splice_read wrapper for 9p.  We should use copy_splice_read() if
9PL_DIRECT is set and filemap_splice_read() otherwise.  Note that this
doesn't seem to be particularly related to O_DIRECT.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-15-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
b0072734ff tty, proc, kernfs, random: Use copy_splice_read()
Use copy_splice_read() for tty, procfs, kernfs and random files rather
than going through generic_file_splice_read() as they just copy the file
into the output buffer and don't splice pages.  This avoids the need for
them to have a ->read_folio() to satisfy filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Arnd Bergmann <arnd@arndb.de>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-13-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
a1be2935d0 coda: Implement splice-read
Implement splice-read for coda by passing the request down a layer rather
than going through generic_file_splice_read() which is going to be changed
to assume that ->read_folio() is present on buffered files.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: coda@cs.cmu.edu
cc: codalist@coda.cs.cmu.edu
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-12-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
d4120d87a0 overlayfs: Implement splice-read
Implement splice-read for overlayfs by passing the request down a layer
rather than going through generic_file_splice_read() which is going to be
changed to assume that ->read_folio() is present on buffered files.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Christian Brauner <brauner@kernel.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Amir Goldstein <amir73il@gmail.com>
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-11-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
b85930a077 splice: Make splice from a DAX file use copy_splice_read()
Make a read splice from a DAX file go directly to copy_splice_read() to do
the reading as filemap_splice_read() is unlikely to find any pagecache to
splice.

I think this affects only erofs, Ext2, Ext4, fuse and XFS.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-erofs@lists.ozlabs.org
cc: linux-ext4@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-9-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
aa3dbde878 splice: Make splice from an O_DIRECT fd use copy_splice_read()
Make a read splice from a file descriptor that's open O_DIRECT use
copy_splice_read() to do the reading as filemap_splice_read() is unlikely
to find any pagecache to splice.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-8-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
123856f0e8 splice: Check for zero count in vfs_splice_read()
Make vfs_splice_read() return immediately if the length is 0.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230522135018.2742245-7-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
6a3f30b8bd splice: Make do_splice_to() generic and export it
Rename do_splice_to() to vfs_splice_read() and export it so that it can be
used as a helper when calling down to a lower layer filesystem as it
performs all the necessary checks[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/CAJfpeguGksS3sCigmRi9hJdUec8qtM9f+_9jC1rJhsXT+dV01w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/r/20230522135018.2742245-6-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:16 -06:00
David Howells
e69f37bce1 splice: Clean up copy_splice_read() a bit
Do a couple of cleanups to copy_splice_read():

 (1) Cast to struct page **, not void *.

 (2) Simplify the calculation of the number of pages to keep/reclaim in
     copy_splice_read().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-5-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:15 -06:00
David Howells
69df79a451 splice: Rename direct_splice_read() to copy_splice_read()
Rename direct_splice_read() to copy_splice_read() to better reflect as to
what it does.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24 08:42:15 -06:00
Danila Chernetsov
aa4b92c523
ntfs: do not dereference a null ctx on error
In ntfs_mft_data_extend_allocation_nolock(), if an error condition occurs
prior to 'ctx' being set to a non-NULL value, avoid dereferencing the NULL
'ctx' pointer in error handling.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Danila Chernetsov <listdansp@mail.ru>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-24 11:10:14 +02:00
David Sterba
b7a9a503c3 fs: use UB-safe check for signed addition overflow in remap_verify_area
The following warning pops up with enabled UBSAN in tests fstests/generic/303:

  [23127.529395] UBSAN: Undefined behaviour in fs/read_write.c:1725:7
  [23127.529400] signed integer overflow:
  [23127.529403] 4611686018427322368 + 9223372036854775807 cannot be represented in type 'long long int'
  [23127.529412] CPU: 4 PID: 26180 Comm: xfs_io Not tainted 5.2.0-rc2-1.ge195904-vanilla+ #450
  [23127.556999] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [23127.557001] Call Trace:
  [23127.557060]  dump_stack+0x67/0x9b
  [23127.557070]  ubsan_epilogue+0x9/0x40
  [23127.573496]  handle_overflow+0xb3/0xc0
  [23127.573514]  do_clone_file_range+0x28f/0x2a0
  [23127.573547]  vfs_clone_file_range+0x35/0xb0
  [23127.573564]  ioctl_file_clone+0x8d/0xc0
  [23127.590144]  do_vfs_ioctl+0x300/0x700
  [23127.590160]  ksys_ioctl+0x70/0x80
  [23127.590203]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [23127.590210]  __x64_sys_ioctl+0x16/0x20
  [23127.590215]  do_syscall_64+0x5c/0x1d0
  [23127.590224]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [23127.590231] RIP: 0033:0x7ff6d7250327
  [23127.590241] RSP: 002b:00007ffe3a38f1d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [23127.590246] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ff6d7250327
  [23127.590249] RDX: 00007ffe3a38f220 RSI: 000000004020940d RDI: 0000000000000003
  [23127.590252] RBP: 0000000000000000 R08: 00007ffe3a3c80a0 R09: 00007ffe3a3c8080
  [23127.590255] R10: 000000000fa99fa0 R11: 0000000000000206 R12: 0000000000000000
  [23127.590260] R13: 0000000000000000 R14: 3fffffffffff0000 R15: 00007ff6d750a20c

As loff_t is a signed type, we should use the safe overflow checks
instead of relying on compiler implementation.

The bogus values are intentional and the test is supposed to verify the
boundary conditions.

Signed-off-by: David Sterba <dsterba@suse.com>
Message-Id: <20230523162628.17071-1-dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-24 11:03:59 +02:00
Joel Granados
b8cbc0855a sysctl: Remove register_sysctl_table
This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. After removing all the calling functions, we
remove both the register_sysctl_table function and the documentation
check that appeared in check-sysctl-docs awk script.

We save 595 bytes with this change:

./scripts/bloat-o-meter vmlinux.1.refactor-base-paths vmlinux.2.remove-sysctl-table
add/remove: 2/8 grow/shrink: 1/0 up/down: 1154/-1749 (-595)
Function                                     old     new   delta
count_subheaders                               -     983    +983
unregister_sysctl_table                       29     184    +155
__pfx_count_subheaders                         -      16     +16
__pfx_unregister_sysctl_table.part            16       -     -16
__pfx_register_leaf_sysctl_tables.constprop   16       -     -16
__pfx_count_subheaders.part                   16       -     -16
__pfx___register_sysctl_base                  16       -     -16
unregister_sysctl_table.part                 136       -    -136
__register_sysctl_base                       478       -    -478
register_leaf_sysctl_tables.constprop        524       -    -524
count_subheaders.part                        547       -    -547
Total: Before=21257652, After=21257057, chg -0.00%

[mcgrof: remove register_leaf_sysctl_tables and append_path too and
 add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
2023-05-23 21:43:26 -07:00
Joel Granados
2f5edd03ca sysctl: Refactor base paths registrations
This is part of the general push to deprecate register_sysctl_paths and
register_sysctl_table. The old way of doing this through
register_sysctl_base and DECLARE_SYSCTL_BASE macro is replaced with a
call to register_sysctl_init. The 5 base paths affected are: "kernel",
"vm", "debug", "dev" and "fs".

We remove the register_sysctl_base function and the DECLARE_SYSCTL_BASE
macro since they are no longer needed.

In order to quickly acertain that the paths did not actually change I
executed `find /proc/sys/ | sha1sum` and made sure that the sha was the
same before and after the commit.

We end up saving 563 bytes with this change:

./scripts/bloat-o-meter vmlinux.0.base vmlinux.1.refactor-base-paths
add/remove: 0/5 grow/shrink: 2/0 up/down: 77/-640 (-563)
Function                                     old     new   delta
sysctl_init_bases                             55     111     +56
init_fs_sysctls                               12      33     +21
vm_base_table                                128       -    -128
kernel_base_table                            128       -    -128
fs_base_table                                128       -    -128
dev_base_table                               128       -    -128
debug_base_table                             128       -    -128
Total: Before=21258215, After=21257652, chg -0.00%

[mcgrof: modified to use register_sysctl_init() over register_sysctl()
 and add bloat-o-meter stats]

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Christian Brauner <brauner@kernel.org>
2023-05-23 21:43:26 -07:00
Joel Granados
19c4e618a1 sysctl: stop exporting register_sysctl_table
We make register_sysctl_table static because the only function calling
it is in fs/proc/proc_sysctl.c (__register_sysctl_base). We remove it
from the sysctl.h header and modify the documentation in both the header
and proc_sysctl.c files to mention "register_sysctl" instead of
"register_sysctl_table".

This plus the commits that remove register_sysctl_table from parport
save 217 bytes:

./scripts/bloat-o-meter .bsysctl/vmlinux.old .bsysctl/vmlinux.new
add/remove: 0/1 grow/shrink: 5/1 up/down: 458/-675 (-217)
Function                                     old     new   delta
__register_sysctl_base                         8     286    +278
parport_proc_register                        268     379    +111
parport_device_proc_register                 195     247     +52
kzalloc.constprop                            598     608     +10
parport_default_proc_register                 62      69      +7
register_sysctl_table                        291       -    -291
parport_sysctl_template                     1288     904    -384
Total: Before=8603076, After=8602859, chg -0.00%

Signed-off-by: Joel Granados <j.granados@samsung.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-23 21:43:26 -07:00
Kees Cook
d617ef039f fscrypt: Replace 1-element array with flexible array
1-element arrays are deprecated and are being replaced with C99
flexible arrays[1].

As sizes were being calculated with the extra byte intentionally,
propagate the difference so there is no change in binary output.

[1] https://github.com/KSPP/linux/issues/79

Cc: Eric Biggers <ebiggers@kernel.org>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: linux-fscrypt@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230523165458.gonna.580-kees@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-05-23 19:46:09 -07:00
Jaegeuk Kim
633c8b9409 f2fs: fix the wrong condition to determine atomic context
Should use !in_task for irq context.

Cc: stable@vger.kernel.org
Fixes: 1aa161e431 ("f2fs: fix scheduling while atomic in decompression path")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-23 18:37:42 -07:00
Daeho Jeong
e067dc3c6b f2fs: maintain six open zones for zoned devices
To keep six open zone constraints, make them not to be open over six
open zones.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-23 18:37:38 -07:00
David Howells
4ef4aee67e cifs: Fix cifs_limit_bvec_subset() to correctly check the maxmimum size
Fix cifs_limit_bvec_subset() so that it limits the span to the maximum
specified and won't return with a size greater than max_size.

Fixes: d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")
Cc: stable@vger.kernel.org # 6.3
Reported-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-23 15:35:44 -05:00
Alexander Aring
0f2b1cb89c fs: dlm: make F_SETLK use unkillable wait_event
While a non-waiting posix lock request (F_SETLK) is waiting for
user space processing (in dlm_controld), wait for that processing
to complete with an unkillable wait_event(). This makes F_SETLK
behave the same way for F_RDLCK, F_WRLCK and F_UNLCK. F_SETLKW
continues to use wait_event_killable().

Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-23 12:56:58 -05:00
Linus Torvalds
5fe326b446 Changes since last update:
- Fix null-ptr-deref related to long xattr name prefixes;
 
  - Avoid pcpubuf compilation if CONFIG_EROFS_FS_ZIP is off;
 
  - Use high priority kthreads by default if per-cpu kthread workers are
    enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZGzgSBEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBP4SAP9l5ct5U/aqteASSm+VkEjtZe546A3WwoYK
 dXgY8LzKAAD/QfWVpBocK605rbEBb2KfJMnvgQ20Pvzd2jQhox8x7Qg=
 =CaUC
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.4-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "One patch addresses a null-ptr-deref issue reported by syzbot weeks
  ago, which is caused by the new long xattr name prefix feature and
  needs to be fixed.

  The remaining two patches are minor cleanups to avoid unnecessary
  compilation and adjust per-cpu kworker configuration.

  Summary:

   - Fix null-ptr-deref related to long xattr name prefixes

   - Avoid pcpubuf compilation if CONFIG_EROFS_FS_ZIP is off

   - Use high priority kthreads by default if per-cpu kthread workers
     are enabled"

* tag 'erofs-for-6.4-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: use HIPRI by default if per-cpu kthreads are enabled
  erofs: avoid pcpubuf.c inclusion if CONFIG_EROFS_FS_ZIP is off
  erofs: fix null-ptr-deref caused by erofs_xattr_prefixes_init
2023-05-23 10:47:32 -07:00
Jeff Layton
d53d70084d nfsd: make a copy of struct iattr before calling notify_change
notify_change can modify the iattr structure. In particular it can
end up setting ATTR_MODE when ATTR_KILL_SUID is already set, causing
a BUG() if the same iattr is passed to notify_change more than once.

Make a copy of the struct iattr before calling notify_change.

Reported-by: Zhi Li <yieli@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2207969
Tested-by: Zhi Li <yieli@redhat.com>
Fixes: 34b91dda71 ("NFSD: Make nfsd4_setattr() wait before returning NFS4ERR_DELAY")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-23 09:46:26 -04:00
Gao Xiang
cf7f2732b4 erofs: use HIPRI by default if per-cpu kthreads are enabled
As Sandeep shown [1], high priority RT per-cpu kthreads are
typically helpful for Android scenarios to minimize the scheduling
latencies.

Switch EROFS_FS_PCPU_KTHREAD_HIPRI on by default if
EROFS_FS_PCPU_KTHREAD is on since it's the typical use cases for
EROFS_FS_PCPU_KTHREAD.

Also clean up unneeded sched_set_normal().

[1] https://lore.kernel.org/r/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com

Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230522092141.124290-1-hsiangkao@linux.alibaba.com
2023-05-23 16:57:08 +08:00
Yue Hu
285d0f85da erofs: avoid pcpubuf.c inclusion if CONFIG_EROFS_FS_ZIP is off
The function of pcpubuf.c is just for low-latency decompression
algorithms (e.g. lz4).

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230515095758.10391-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-23 16:56:40 +08:00
Jingbo Xu
0a17567b4a erofs: fix null-ptr-deref caused by erofs_xattr_prefixes_init
Fragments and dedupe share one feature bit, and thus packed inode may not
exist when fragment feature bit (dedupe feature bit exactly) is set, e.g.
when deduplication feature is in use while fragments feature is not.  In
this case, sbi->packed_inode could be NULL while fragments feature bit
is set.

Fix this by accessing packed inode only when it exists.

Reported-by: syzbot+902d5a9373ae8f748a94@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=902d5a9373ae8f748a94
Reported-and-tested-by: syzbot+bbb353775d51424087f2@syzkaller.appspotmail.com
Fixes: 9e38291461 ("erofs: add helpers to load long xattr name prefixes")
Fixes: 6a318ccd7e ("erofs: enable long extended attribute name prefixes")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230515103941.129784-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2023-05-23 16:56:21 +08:00
Azeem Shaikh
883f8fe876 vboxsf: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230510211146.3486600-1-azeemshaikh38@gmail.com
2023-05-22 12:35:14 -07:00
Azeem Shaikh
8ca25e00cf NFS: Prefer strscpy over strlcpy calls
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
Check for strscpy()'s return value of -E2BIG on truncate for safe
replacement with strlcpy().

This is part of a tree-wide cleanup to remove the strlcpy() function
entirely from the kernel [2].

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230512155749.1356958-1-azeemshaikh38@gmail.com
2023-05-22 12:34:41 -07:00
Alexander Aring
59e45c758c fs: dlm: interrupt posix locks only when process is killed
If a posix lock request is waiting for a result from user space
(dlm_controld), do not let it be interrupted unless the process
is killed. This reverts commit a6b1533e9a ("dlm: make posix locks
interruptible"). The problem with the interruptible change is
that all locks were cleared on any signal interrupt. If a signal
was received that did not terminate the process, the process
could continue running after all its dlm posix locks had been
cleared. A future patch will add cancelation to allow proper
interruption.

Cc: stable@vger.kernel.org
Fixes: a6b1533e9a ("dlm: make posix locks interruptible")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22 14:34:32 -05:00
Azeem Shaikh
30ad0627f1 dlm: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230510221237.3509484-1-azeemshaikh38@gmail.com
2023-05-22 12:34:09 -07:00
Linus Torvalds
421ca22e31 NFS Client Bugfixes for Linux 6.4-rc
Stable Fix:
   * Don't change task->tk_status after the call to rpc_exit_task
 
 Other Bugfixes:
   * Convert kmap_atomic() to kmap_local_folio()
   * Fix a potential double free with READ_PLUS
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRrttUACgkQ18tUv7Cl
 QOuhaA//QFHklXZk/vCkQnNQMYWL11GJliWawLoDfcZal6uQ/a2QCQV1Cbmav62B
 FR2BmXDxzM2PRdLu2VHGpkn0CQW3M1tvgaNjGD1xdOxpyIkn47T5lfAd/4X2XPiU
 M1ck2Usc258UB1yoKV+jbUD3ptn2BvC+VMWJInaA578hv8TA6Ouh7lP7rPJfDHoJ
 OfoLxx9/VqGqMWzfExAHnGw328oieXNnOwynETAdapVwjQeiEcYAED82pJmVsD7+
 m++6dRVQRA2bMIMRFWmW8HsO08sR32wzy76XgKws4Xu59Fiy+TQ8PoeUjCtTNq6/
 9ibPwH4R7VbcxXa2eT23EbtO2nSkZw/dFiL0s5VNYqeVrBwwlzyklU1uSvIEPegk
 zHamqxMMlVLkoMwJa83wIKB8/viPKwV5zcF9UjmrJy67+wXZet6M0c7S9HyiTj9U
 NzVbqyK3KhMtsD4ps/EGVWsgGKAIeWbE8wPlP7GF7PHwEw+hWa9pHir6L6BizNqG
 DJ/2zfZxDvOGy2r5OvSqGn07/zsj+0URixzEq0IOn1Li/osFZpvK3EVFncd/qsvW
 NwPRoF+70skFRdXhbdWa/HEUZlyN2uiIU24luraMrN0U4b4X7aw+EMnMekBi+Vec
 bEtWEUJ/vK3mlsOde4gVW0PZBhe8JE6PHlqkQBn5zobV3/cXXCw=
 =6xFZ
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Stable Fix:

   - Don't change task->tk_status after the call to rpc_exit_task

  Other Bugfixes:

   - Convert kmap_atomic() to kmap_local_folio()

   - Fix a potential double free with READ_PLUS"

* tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Fix a potential double free with READ_PLUS
  SUNRPC: Don't change task->tk_status after the call to rpc_exit_task
  NFS: Convert kmap_atomic() to kmap_local_folio()
2023-05-22 12:01:13 -07:00
Alexander Aring
c847f4e203 fs: dlm: fix cleanup pending ops when interrupted
Immediately clean up a posix lock request if it is interrupted
while waiting for a result from user space (dlm_controld.)  This
largely reverts the recent commit b92a4e3f86 ("fs: dlm: change
posix lock sigint handling"). That previous commit attempted
to defer lock cleanup to the point in time when a result from
user space arrived. The deferred approach was not reliable
because some dlm plock ops may not receive replies.

Cc: stable@vger.kernel.org
Fixes: b92a4e3f86 ("fs: dlm: change posix lock sigint handling")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22 13:49:55 -05:00
Alexander Aring
92655fbda5 fs: dlm: return positive pid value for F_GETLK
The GETLK pid values have all been negated since commit 9d5b86ac13
("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks").
Revert this for local pids, and leave in place negative pids for remote
owners.

Cc: stable@vger.kernel.org
Fixes: 9d5b86ac13 ("fs/locks: Remove fl_nspid and use fs-specific l_pid for remote locks")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22 13:34:47 -05:00
Azeem Shaikh
35d6df53b9 dlm: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2023-05-22 11:53:22 -05:00
Amir Goldstein
304e9c83e8 exportfs: add explicit flag to request non-decodeable file handles
So far, all callers of exportfs_encode_inode_fh(), except for fsnotify's
show_mark_fhandle(), check that filesystem can decode file handles, but
we would like to add more callers that do not require a file handle that
can be decoded.

Introduce a flag to explicitly request a file handle that may not to be
decoded later and a wrapper exportfs_encode_fid() that sets this flag
and convert show_mark_fhandle() to use the new wrapper.

This will be used to allow adding fanotify support to filesystems that
do not support NFS export.

Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230502124817.3070545-3-amir73il@gmail.com>
2023-05-22 18:08:37 +02:00
Amir Goldstein
b52878275c exportfs: change connectable argument to bit flags
Convert the bool connectable arguemnt into a bit flags argument and
define the EXPORT_FS_CONNECTABLE flag as a requested property of the
file handle.

We are going to add a flag for requesting non-decodeable file handles.

Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230502124817.3070545-2-amir73il@gmail.com>
2023-05-22 18:08:37 +02:00
Linus Torvalds
e2065b8c1b four ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRppOoACgkQiiy9cAdy
 T1GXBAv9FP5orZJKZ2yR+k/xAccodIPUlAx9ZcfBw9rV8dihny0RzOhafRm4FUln
 EuXoS+nWAxNiaOfLZDQ6PezzeVtYbNvlx5EOZ3tZt2I4tb65hdgdiP9axgo6KtfY
 dXMH+Ml2wNxgey9HOfDzDnxdGpBXiNaKlIMbBf0BdtTzvo+BNQulP21P/8SLJg11
 mbHj9XBouae5D7yakJlefq09wKgzolK5ZYqQyLSF2gpVPzQHB+m0zNXBaaHFQbdC
 7xHr+wPBLERyNnEW6F9WBZ9d5ayqdt+UE6HjxeQtnXzkQgrWHKMqJfdEcwjitYCN
 CNTpGdJGxoi7JjbJczPcG3bglJPpOPwbOdu7MTMvom/o4DhR8jrxjtv69k8Kt8ZH
 WSHsS/740psJFnRf9nY82DHEY1Hy27V/5xtLOjvV2C2nR/Z0KUDIR6/lWnpuWUyU
 is/pTbTFGOqQ6xtxnfIFgSx6aYRgbR1chljBzalPKtzuNLipyAKNePRBELYo9hko
 y+M7HtAQ
 =ZNmq
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

 - two fixes for incorrect SMB3 message validation (one for client which
   uses 8 byte padding, and one for empty bcc)

 - two fixes for out of bounds bugs: one for username offset checks (in
   session setup) and the other for create context name length checks in
   open requests

* tag '6.4-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: smb2: Allow messages padded to 8byte boundary
  ksmbd: allocate one more byte for implied bcc[0]
  ksmbd: fix wrong UserName check in session_user
  ksmbd: fix global-out-of-bounds in smb2_find_context_vals
2023-05-21 10:55:31 -07:00
Linus Torvalds
0c9dcf128e 2 smb3 client fixes, both related to deferred close, and also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRpn+4ACgkQiiy9cAdy
 T1F07gv/dvtE23DaAsTtOsXMzc2fQ9jyQiexgUUMWjYWeWJS06r2o3QMWsSV86QT
 z645h6jYgUBeuWVFPF/h0WYjGn/C35Fy08SRuNSReNNahYbNh0A5fe+ic8AoA+f1
 LWQYOqRkAaZdcfuOP2Cg2OiNDswxLln4L0eTlJu7Hrdi/xUM5qa66VmFfvfVsu3/
 nUlV9KGV6lVoEJbD2Oy+9pfB/2ltgmauQqofXAh35BHSah8Q5U2E2QHHhyMwRBBc
 qSINxSoNDDyoW5sCXxzgBPH23lzlMNo0tHVRSqPMtLypzoehzwHmkFJVuGv2F82n
 Mj+pMD7As4d7/82IpmCMkhkOcUCRLa/d3gHqZMZVCFSXJ8tpTbRTBiiervJ3/94M
 IYfZiBuKy6z2mYdE8sW0zXCXzYE9+iAgySER5Ey2IXlbCSN7N81lV2KE8E4jjKhM
 Qoe5DL/AGSjDW0RFSOC7PPRpOqpc//PV2JpPmoYodV1i1nWq5dC1DhQcbXjg/r7c
 0fABdS0y
 =hi0y
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Two smb3 client fixes, both related to deferred close, and also for
  stable:

   - send close for deferred handles before not after lease break
     response to avoid possible sharing violations

   - check all opens on an inode (looking for deferred handles) when
     lease break is returned not just the handle the lease break came in
     on"

* tag '6.4-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  SMB3: drop reference to cfile before sending oplock break
  SMB3: Close all deferred handles of inode in case of handle lease break
2023-05-21 10:20:58 -07:00
Christoph Hellwig
bda2795a63 fs: remove the special !CONFIG_BLOCK def_blk_fops
def_blk_fops always returns -ENODEV, which dosn't match the return value
of a non-existing block device with CONFIG_BLOCK, which is -ENXIO.
Just remove the extra implementation and fall back to the default
no_open_fops that always returns -ENXIO.

Fixes: 9361401eb7 ("[PATCH] BLOCK: Make it possible to disable the block layer [try #6]")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230508144405.41792-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19 19:48:53 -06:00
Anna Schumaker
43439d858b NFSv4.2: Fix a potential double free with READ_PLUS
kfree()-ing the scratch page isn't enough, we also need to set the pointer
back to NULL to avoid a double-free in the case of a resend.

Fixes: fbd2a05f29 (NFSv4.2: Rework scratch handling for READ_PLUS)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 17:11:59 -04:00
Fabio M. De Francesco
4b71e2416e NFS: Convert kmap_atomic() to kmap_local_folio()
kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().

Therefore, replace kmap_atomic() with kmap_local_folio() in
nfs_readdir_folio_array_append().

kmap_atomic() disables page-faults and preemption (the latter only for
!PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
nfs_readdir_folio_array_append() does not depend on the above-mentioned
side effects.

Therefore, a mere replacement of the old API with the new one is all that
is required (i.e., there is no need to explicitly add any calls to
pagefault_disable() and/or preempt_disable()).

Tested with (x)fstests in a QEMU/KVM x86_32 VM, 6GB RAM, booting a kernel
with HIGHMEM64GB enabled.

Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Fixes: ec108d3cc7 ("NFS: Convert readdir page array functions to use a folio")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19 16:50:05 -04:00
Linus Torvalds
a594874588 A workaround for a just discovered bug in MClientSnap encoding which
goes back to 2017 (marked for stable) and a fixup to quieten a static
 checker.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmRnnmQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+UGB/9b2jo9bvRJXm3Z9baTyCYGCLmpOMYB
 gUDAHY9iTZBWdxbk+YppCWyh20oXz1082DV6vMn2FBhFgv4um/7GXesoVMGin73n
 5w3YB8nBW0LeFsuuLMp+tnWnsIbYxEdVmNSe5lNZX16UVRW+GUBJeLPeiJrB2YCE
 NuCWw4SUxRDKU1cCHWIBjIz0qJmvbW+8U7f0OwPqk1e5QmoE9Fs44sfJ9aBX4ap7
 nlPWsoNX0fRixKNcsueBHLr4xEqYG0qqyvCiZnz3r59Zlcs2HwcfixBfNnJPjDeu
 3ijPm+mYjAT8Vg2mVwf2fCXAtdXlzX9+ULHZDp2VoD/0LB+E5ep08HAO
 =Vixp
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.4-rc3' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A workaround for a just discovered bug in MClientSnap encoding which
  goes back to 2017 (marked for stable) and a fixup to quieten a static
  checker"

* tag 'ceph-for-6.4-rc3' of https://github.com/ceph/ceph-client:
  ceph: force updating the msg pointer in non-split case
  ceph: silence smatch warning in reconnect_caps_cb()
2023-05-19 12:02:12 -07:00
Linus Torvalds
ac92c27935 s390 updates for 6.4-rc3
- Add check whether the required facilities are installed
   before using the s390-specific ChaCha20 implementation.
 
 - Key blobs for s390 protected key interface IOCTLs commands
   PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key
   material. Zeroize copies of these keys in kernel memory
   after creating protected keys.
 
 - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra
   overhead of initializing all stack variables by default.
 
 - Make sure that when a new channel-path is enabled all
   subchannels are evaluated: with and without any devices
   connected on it.
 
 - When SMT thread CPUs are added to CPU topology masks the
   nr_cpu_ids limit is not checked and could be exceeded.
   Respect the nr_cpu_ids limit and avoid a warning when
   CONFIG_DEBUG_PER_CPU_MAPS is set.
 
 - The pointer to IPL Parameter Information Block is stored
   in the absolute lowcore as a virtual address. Save it as
   the physical address for later use by dump tools.
 
 - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using
   QIOASSIST with dedicated (pass through) QDIO-based devices
   such as FCP, real OSA or HiperSockets.
 
 - s390's struct statfs and struct statfs64 contain padding,
   which field-by-field copying does not set. Initialize the
   respective structures with zeros before filling them and
   copying to userspace.
 
 - Grow s390 compat_statfs64, statfs and statfs64 structures
   f_spare array member to cover padding and simplify things.
 
 - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs.
 
 - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZGd5BRccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8OqMAQCsdBG7eR3dp3mY8ao34dqlWt98
 rDQD8oiMgCkFyn77jQEAoo3HhqWY8oTu88fl82dkF0OpGW+7zgoNHUYhH8Z0gAY=
 =wtTO
 -----END PGP SIGNATURE-----

Merge tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

 - Add check whether the required facilities are installed before using
   the s390-specific ChaCha20 implementation

 - Key blobs for s390 protected key interface IOCTLs commands
   PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material.
   Zeroize copies of these keys in kernel memory after creating
   protected keys

 - Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of
   initializing all stack variables by default

 - Make sure that when a new channel-path is enabled all subchannels are
   evaluated: with and without any devices connected on it

 - When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids
   limit is not checked and could be exceeded. Respect the nr_cpu_ids
   limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set

 - The pointer to IPL Parameter Information Block is stored in the
   absolute lowcore as a virtual address. Save it as the physical
   address for later use by dump tools

 - Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST
   with dedicated (pass through) QDIO-based devices such as FCP, real
   OSA or HiperSockets

 - s390's struct statfs and struct statfs64 contain padding, which
   field-by-field copying does not set. Initialize the respective
   structures with zeros before filling them and copying to userspace

 - Grow s390 compat_statfs64, statfs and statfs64 structures f_spare
   array member to cover padding and simplify things

 - Remove obsolete SCHED_BOOK and SCHED_DRAWER configs

 - Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs

* tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU
  s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER}
  s390/uapi: cover statfs padding by growing f_spare
  statfs: enforce statfs[64] structure initialization
  s390/qdio: fix do_sqbs() inline assembly constraint
  s390/ipl: fix IPIB virtual vs physical address confusion
  s390/topology: honour nr_cpu_ids when adding CPUs
  s390/cio: include subchannels without devices also for evaluation
  s390/defconfigs: set CONFIG_INIT_STACK_NONE=y
  s390/pkey: zeroize key blobs
  s390/crypto: use vector instructions only if available for ChaCha20
2023-05-19 11:11:04 -07:00
Shaomin Deng
6405fee9b0
ntfs: Remove unneeded semicolon
Remove the unneeded semicolon after curly braces.

Signed-off-by: Shaomin Deng <dengshaomin@cdjrlc.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Message-Id: <20221105153135.5975-1-dengshaomin@cdjrlc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 12:00:57 +02:00
Deming Wang
253f3137eb
ntfs: Correct spelling
We should use this replace thie.

Signed-off-by: Deming Wang <wangdeming@inspur.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Message-Id: <20230206091815.1687-1-wangdeming@inspur.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 11:59:38 +02:00
Colin Ian King
04faa6cfd4
ntfs: remove redundant initialization to pointer cb_sb_start
The pointer cb_sb_start is being initialized with a value that is never
read, it is being re-assigned the same value later on when it is first
being used. The initialization is redundant and can be removed.

Cleans up clang scan build warning:
fs/ntfs/compress.c:164:6: warning: Value stored to 'cb_sb_start' during its initialization is never read [deadcode.DeadStores]
        u8 *cb_sb_start = cb;   /* Beginning of the current sb in the cb. */

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Message-Id: <20230418153607.3125704-1-colin.i.king@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 11:58:14 +02:00
Christian Brauner
6ac3928156
fs: allow to mount beneath top mount
Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].

System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.

When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.

System configuration images are similar but operate on directories
containing system or service configuration.

On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):

       TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /       /       ext4    shared:1     29      1

On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.

Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.

In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:

       TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
       /       /       ext4    shared:24 master:1  71      47

For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.

A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.

For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:

       TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /home/ubuntu/debian-tree  /       ext4    shared:99    61      60

So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:

       TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
       /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68

Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.

Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.

The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.

The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.

The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.

Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:

(1) When a service or container is started in a new mount namespace and
    pivot_root()s into its new rootfs. The way this is done is by
    mounting the new rootfs beneath the old rootfs:

            fd_newroot = open("/var/lib/machines/fedora", ...);
            fd_oldroot = open("/", ...);
            fchdir(fd_newroot);
            pivot_root(".", ".");

    After the pivot_root(".", ".") call the new rootfs is mounted
    beneath the old rootfs which can then be unmounted to reveal the
    underlying mount:

            fchdir(fd_oldroot);
            umount2(".", MNT_DETACH);

    Since pivot_root() moves the caller into a new rootfs no mounts must
    be propagated out of the new rootfs as a consequence of the
    pivot_root() call. Thus, the mounts cannot be shared.

(2) When a mount is propagated to a mount that already has another mount
    mounted on the same dentry.

    The easiest example for this is to create a new mount namespace. The
    following commands will create a mount namespace where the rootfs
    mount / will be a slave to the peer group of the host rootfs /
    mount's peer group. IOW, it will receive propagation from the host:

            mount --make-shared /
            unshare --mount --propagation=slave

    Now a new mount on the /mnt dentry in that mount namespace is
    created. (As it can be confusing it should be spelled out that the
    tmpfs mount on the /mnt dentry that was just created doesn't
    propagate back to the host because the rootfs mount / of the mount
    namespace isn't a peer of the host rootfs.):

            mount -t tmpfs tmpfs /mnt

            TARGET  SOURCE  FSTYPE  PROPAGATION
            └─/mnt  tmpfs   tmpfs

    Now another terminal in the host mount namespace can observe that
    the mount indeed hasn't propagated back to into the host mount
    namespace. A new mount can now be created on top of the /mnt dentry
    with the rootfs mount / as its parent:

            mount --bind /opt /mnt

            TARGET  SOURCE           FSTYPE  PROPAGATION
            └─/mnt  /dev/sda2[/opt]  ext4    shared:1

    The mount namespace that was created earlier can now observe that
    the bind mount created on the host has propagated into it:

            TARGET    SOURCE           FSTYPE  PROPAGATION
            └─/mnt    /dev/sda2[/opt]  ext4    master:1
              └─/mnt  tmpfs            tmpfs

    But instead of having been mounted on top of the tmpfs mount at the
    /mnt dentry the /opt mount has been mounted on top of the rootfs
    mount at the /mnt dentry. And the tmpfs mount has been remounted on
    top of the propagated /opt mount at the /opt dentry. So in other
    words, the propagated mount has been mounted beneath the preexisting
    mount in that mount namespace.

    Mount namespaces make this easy to illustrate but it's also easy to
    mount beneath an existing mount in the same mount namespace
    (The following example assumes a shared rootfs mount / with peer
     group id 1):

            mount --bind /opt /opt

            TARGET   SOURCE          FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt  /dev/sda2[/opt]  ext4    188     29         shared:1

    If another mount is mounted on top of the /opt mount at the /opt
    dentry:

            mount --bind /tmp /opt

    The following clunky mount tree will result:

            TARGET      SOURCE           FSTYPE  MNT_ID  PARENT_ID  PROPAGATION
            └─/opt      /dev/sda2[/tmp]  ext4    405      29        shared:1
              └─/opt    /dev/sda2[/opt]  ext4    188     405        shared:1
                └─/opt  /dev/sda2[/tmp]  ext4    404     188        shared:1

    The /tmp mount is mounted beneath the /opt mount and another copy is
    mounted on top of the /opt mount. This happens because the rootfs /
    and the /opt mount are shared mounts in the same peer group.

    When the new /tmp mount is supposed to be mounted at the /opt dentry
    then the /tmp mount first propagates to the root mount at the /opt
    dentry. But there already is the /opt mount mounted at the /opt
    dentry. So the old /opt mount at the /opt dentry will be mounted on
    top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
    is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
    is /opt which is what shows up as /opt under SOURCE). So again, a
    mount will be mounted beneath a preexisting mount.

    (Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
     shared rootfs is a good example of what could be referred to as
     mount explosion.)

The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.

In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.

This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.

These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.

More common use-cases will just be things like:

        mount -t btrfs /dev/sdA /mnt
        mount -t xfs   /dev/sdB --beneath /mnt
        umount /mnt

after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.

The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.

This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:

* Mounts cannot be mounted beneath the rootfs. This restriction
  encompasses the rootfs but also chroots via chroot() and pivot_root().
  To mount a mount beneath the rootfs or a chroot, pivot_root() can be
  used as illustrated above.
* The source mount must be a private mount to force the kernel to
  allocate a new, unused peer group id. This isn't a required
  restriction but a voluntary one. It avoids repeating a semantical
  quirk that already exists today. If bind mounts which already have a
  peer group id are inserted into mount trees that have the same peer
  group id this can cause a lot of mount propagation events to be
  generated (For example, consider running mount --bind /opt /opt in a
  loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
  need to be able to unmount the top mount themselves.
  This also avoids a good deal of additional complexity. The umount
  would have to be propagated which would be another rather expensive
  operation. So namespace_lock() and lock_mount_hash() would potentially
  have to be held for a long time for both a mount and umount
  propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
  and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
  could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
  parent mount and the destination mount's mountpoint on the parent
  mount. Of course, if the parent of the destination mount and the
  destination mount are shared mounts in the same peer group and the
  mountpoint of the new mount to be mounted is a subdir of their
  ->mnt_root then both will receive a mount of /opt. That's probably
  easier to understand with an example. Assuming a standard shared
  rootfs /:

          mount --bind /opt /opt
          mount --bind /tmp /opt

  will cause the same mount tree as:

          mount --bind /opt /opt
          mount --beneath /tmp /opt

  because both / and /opt are shared mounts/peers in the same peer
  group and the /opt dentry is a subdirectory of both the parent's and
  the child's ->mnt_root. If a mount tree like that is created it almost
  always is an accident or abuse of mount propagation. Realistically
  what most people probably mean in this scenarios is:

          mount --bind /opt /opt
          mount --make-private /opt
          mount --make-shared /opt

  This forces the allocation of a new separate peer group for the /opt
  mount. Aferwards a mount --bind or mount --beneath actually makes
  sense as the / and /opt mount belong to different peer groups. Before
  that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
  (1) the @mnt_from has been overmounted in between path resolution and
      acquiring @namespace_sem when locking @mnt_to. This avoids the
      proliferation of shadow mounts.
  (2) if @to_mnt is moved to a different mountpoint while acquiring
      @namespace_sem to lock @to_mnt.
  (3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
      @to_mnt.
  (4) if the parent of the target mount propagates to the target mount
      at the same mountpoint.
      This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
      propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
      whole purpose of mounting @mnt_from beneath @mnt_to.
  (5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
      the same mountpoint.
      If @mnt_to->mnt_parent propagates to @mnt_from this would mean
      propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
      @mnt_from would be mounted on top of @mnt_to->mnt_parent and
      @mnt_to would be unmounted from @mnt->mnt_parent and remounted on
      @mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
      would ultimately be remounted on top of @c. Afterwards, @mnt_from
      would be covered by a copy @c of @mnt_from and @c would be covered
      by @mnt_from itself. This defeats the whole purpose of mounting
      @mnt_from beneath @mnt_to.
  Cases (1) to (3) are required as they deal with races that would cause
  bugs or unexpected behavior for users. Cases (4) and (5) refuse
  semantical quirks that would not be a bug but would cause weird mount
  trees to be created. While they can already be created via other means
  (mount --bind /opt /opt x n) there's no reason to repeat past mistakes
  in new features.

Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 04:30:22 +02:00
Christian Brauner
64f44b27ae
fs: use a for loop when locking a mount
Currently, lock_mount() uses a goto to retry the lookup until it
succeeded in acquiring the namespace_lock() preventing the top mount
from being overmounted. While that's perfectly fine we want to lookup
the mountpoint on the parent of the top mount in later patches. So adapt
the code to make this easier to implement. Also, the for loop is
arguably a little cleaner and makes the code easier to follow. No
functional changes intended.

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-3-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 04:30:22 +02:00
Christian Brauner
104026c2e4
fs: properly document __lookup_mnt()
The comment on top of __lookup_mnt() states that it finds the first
mount implying that there could be multiple mounts mounted at the same
dentry with the same parent.

On older kernels "shadow mounts" could be created during mount
propagation. So if a mount @m in the destination propagation tree
already had a child mount @p mounted at @mp then any mount @n we
propagated to @m at the same @mp would be appended after the preexisting
mount @p in @mount_hashtable. This was a completely direct way of
creating shadow mounts.

That direct way is gone but there are still subtle ways to create shadow
mounts. For example, when attaching a source mnt @mnt to a shared mount.
The root of the source mnt @mnt might be overmounted by a mount @o after
we finished path lookup but before we acquired the namespace semaphore
to copy the source mount tree @mnt.

After we acquired the namespace lock @mnt is copied including @o
covering it. After we attach @mnt to a shared mount @dest_mnt we end up
propagation it to all it's peer and slaves @d. If @d already has a mount
@n mounted on top of it we tuck @mnt beneath @n. This means, we mount
@mnt at @d and mount @n on @mnt. Now we have both @o and @n mounted on
the same mountpoint at @mnt.

Explain this in the documentation as this is pretty subtle.

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-2-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 04:30:21 +02:00
Christian Brauner
78aa08a8ca
fs: add path_mounted()
Add a small helper to check whether a path refers to the root of the
mount instead of open-coding this everywhere.

Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-1-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-19 04:30:07 +02:00
Linus Torvalds
f4a8871f9f Eight hotfixes. Four are cc:stable, the other four are for post-6.4
issues, or aren't considered suitable for backporting.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZGasdgAKCRDdBJ7gKXxA
 jpTFAQC2WlV6CbEsy46jJK2XzCypzLLxHiRmVCw5pmAucki4awEAjllEuzK6vw61
 ytBZ/O2sMB5AbCf31c6UYxgLS32oyAo=
 =IDcO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-05-18-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Eight hotfixes. Four are cc:stable, the other four are for post-6.4
  issues, or aren't considered suitable for backporting"

* tag 'mm-hotfixes-stable-2023-05-18-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: Cleanup Arm Display IP maintainers
  MAINTAINERS: repair pattern in DIALOG SEMICONDUCTOR DRIVERS
  nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode()
  mm: fix zswap writeback race condition
  mm: kfence: fix false positives on big endian
  zsmalloc: move LRU update from zs_map_object() to zs_malloc()
  mm: shrinkers: fix race condition on debugfs cleanup
  maple_tree: make maple state reusable after mas_empty_area()
2023-05-18 17:06:04 -07:00
Xiubo Li
4cafd0400b ceph: force updating the msg pointer in non-split case
When the MClientSnap reqeust's op is not CEPH_SNAP_OP_SPLIT the
request may still contain a list of 'split_realms', and we need
to skip it anyway. Or it will be parsed as a corrupt snaptrace.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/61200
Reported-by: Frank Schilder <frans@dtu.dk>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-18 11:15:28 +02:00
Xiubo Li
9aaa7eb018 ceph: silence smatch warning in reconnect_caps_cb()
Smatch static checker warning:

  fs/ceph/mds_client.c:3968 reconnect_caps_cb()
  warn: missing error code here? '__get_cap_for_mds()' failed. 'err' = '0'

[ idryomov: Dan says that Smatch considers it intentional only if the
  "ret = 0;" assignment is within 4 or 5 lines of the goto. ]

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-05-18 11:15:28 +02:00
Ryusuke Konishi
9b5a04ac3a nilfs2: fix use-after-free bug of nilfs_root in nilfs_evict_inode()
During unmount process of nilfs2, nothing holds nilfs_root structure after
nilfs2 detaches its writer in nilfs_detach_log_writer().  However, since
nilfs_evict_inode() uses nilfs_root for some cleanup operations, it may
cause use-after-free read if inodes are left in "garbage_list" and
released by nilfs_dispose_list() at the end of nilfs_detach_log_writer().

Fix this issue by modifying nilfs_evict_inode() to only clear inode
without additional metadata changes that use nilfs_root if the file system
is degraded to read-only or the writer is detached.

Link: https://lkml.kernel.org/r/20230509152956.8313-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+78d4495558999f55d1da@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/00000000000099e5ac05fb1c3b85@google.com
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-17 15:24:34 -07:00
Bharath SM
59a556aebc SMB3: drop reference to cfile before sending oplock break
In cifs_oplock_break function we drop reference to a cfile at
the end of function, due to which close command goes on wire
after lease break acknowledgment even if file is already closed
by application but we had deferred the handle close.
If other client with limited file shareaccess waiting on lease
break ack proceeds operation on that file as soon as first client
sends ack, then we may encounter status sharing violation error
because of open handle.
Solution is to put reference to cfile(send close on wire if last ref)
and then send oplock acknowledgment to server.

Fixes: 9e31678fb4 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
Cc: stable@kernel.org
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-17 12:04:41 -05:00
Bharath SM
47592fa8eb SMB3: Close all deferred handles of inode in case of handle lease break
Oplock break may occur for different file handle than the deferred
handle. Check for inode deferred closes list, if it's not empty then
close all the deferred handles of inode because we should not cache
handles if we dont have handle lease.

Eg: If openfilelist has one deferred file handle and another open file
handle from app for a same file, then on a lease break we choose the
first handle in openfile list. The first handle in list can be deferred
handle or actual open file handle from app. In case if it is actual open
handle then today, we don't close deferred handles if we lose handle lease
on a file. Problem with this is, later if app decides to close the existing
open handle then we still be caching deferred handles until deferred close
timeout. Leaving open handle may result in sharing violation when windows
client tries to open a file with limited file share access.

So we should check for deferred list of inode and walk through the list of
deferred files in inode and close all deferred files.

Fixes: 9e31678fb4 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
Cc: stable@kernel.org
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-17 12:04:31 -05:00
Linus Torvalds
1b66c114d1 nfsd-6.4 fixes:
- A collection of minor bug fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRiQVEACgkQM2qzM29m
 f5eYfBAAg5Qz45PL+fo1qWxkJ1ZKaNV1vPdi4tqCt9NEItDTjAnjj0am+rKNGZAz
 EOM2yFt4xaZGyMYgXe4VnYl0N+rSpbI/H+Rk/wOq4OHPURQD5EO9VeP86qZ7rmGl
 ECPqb39TFTwAiRomC/DHO4eNpoe1rQuXu0tW9+GmqDGxeuh8xdxTk33g17ZXwCFN
 tdPkkPjxVPdWd8X7HQg9kWm8AWfV+GyuzE2rKAoOjbs6Wv6d9GCY8Cb5HXkRsQhF
 4Zh0PVQuTuXurZwtPXwnS0k4kfvQwjlTIKHlXuo0ZLh+SuFbrWHzv0fVyD+kUpSK
 HtWbJ8JcruUvz0WGMtZatzRLHCZLDguV6oVXPp7rtmuxTj4szzHSFpEeAV901sIm
 Nkvuomvd02K/fiTo7s3yr6t1VG2vju9LDwhBe197iA3leHAlockfbbxE3NJMGbzQ
 NoOPd+lu95cfsanOM1LZZLNfbLrZofoSLK9K1+HD0yAVdyq7u47FyHRrymvCaMrj
 GiheuqrBfBMEq+2mCwUn37aM0FblYEXQM0xTVXPQcHtBBN/nGZxPJukmpr7ScNlR
 aqMtDoOLu4OEFuo6fe2/94eNi+N5XAZgWmx/mSyaytE8Xw9LJxeQ83UTigaGcYKc
 3YIuG1YXg9IyIoIdLkghB+Aj/6fivsGFK9Gud6g7I3xw4f15noA=
 =PiNG
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - A collection of minor bug fixes

* tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Remove open coding of string copy
  SUNRPC: Fix trace_svc_register() call site
  SUNRPC: always free ctxt when freeing deferred request
  SUNRPC: double free xprt_ctxt while still in use
  SUNRPC: Fix error handling in svc_setup_socket()
  SUNRPC: Fix encoding of accepted but unsuccessful RPC replies
  lockd: define nlm_port_min,max with CONFIG_SYSCTL
  nfsd: define exports_proc_ops with CONFIG_PROC_FS
  SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17 09:56:01 -07:00
Anisse Astier
d86ff3333c efivarfs: expose used and total size
When writing EFI variables, one might get errors with no other message
on why it fails. Being able to see how much is used by EFI variables
helps analyzing such issues.

Since this is not a conventional filesystem, block size is intentionally
set to 1 instead of PAGE_SIZE.

x86 quirks of reserved size are taken into account; so that available
and free size can be different, further helping debugging space issues.

With this patch, one can see the remaining space in EFI variable storage
via efivarfs, like this:

   $ df -h /sys/firmware/efi/efivars/
   Filesystem      Size  Used Avail Use% Mounted on
   efivarfs        176K  106K   66K  62% /sys/firmware/efi/efivars

Signed-off-by: Anisse Astier <an.astier@criteo.com>
[ardb: - rename efi_reserved_space() to efivar_reserved_space()
       - whitespace/coding style tweaks]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-05-17 18:21:34 +02:00
Jeff Layton
3a7bb21b6f
fs: don't call posix_acl_listxattr in generic_listxattr
Commit f2620f166e caused the kernel to start emitting POSIX ACL xattrs
for NFSv4 inodes, which it doesn't support. The only other user of
generic_listxattr is HFS (classic) and it doesn't support POSIX ACLs
either.

Fixes: f2620f166e xattr: simplify listxattr helpers
Reported-by: Ondrej Valousek <ondrej.valousek.xm@renesas.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230516124655.82283-1-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17 15:25:20 +02:00
Ilya Leoshkevich
ed40866ec7 statfs: enforce statfs[64] structure initialization
s390's struct statfs and struct statfs64 contain padding, which
field-by-field copying does not set. Initialize the respective structs
with zeros before filling them and copying them to userspace, like it's
already done for the compat versions of these structs.

Found by KMSAN.

[agordeev@linux.ibm.com: fixed typo in patch description]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20230504144021.808932-2-iii@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-05-17 15:20:17 +02:00
Josef Bacik
597441b343 btrfs: use nofs when cleaning up aborted transactions
Our CI system caught a lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  6.3.0-rc7+ #1167 Not tainted
  ------------------------------------------------------
  kswapd0/46 is trying to acquire lock:
  ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120

  but task is already holding lock:
  ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0xa5/0xe0
	 kmem_cache_alloc+0x31/0x2c0
	 alloc_extent_state+0x1d/0xd0
	 __clear_extent_bit+0x2e0/0x4f0
	 try_release_extent_mapping+0x216/0x280
	 btrfs_release_folio+0x2e/0x90
	 invalidate_inode_pages2_range+0x397/0x470
	 btrfs_cleanup_dirty_bgs+0x9e/0x210
	 btrfs_cleanup_one_transaction+0x22/0x760
	 btrfs_commit_transaction+0x3b7/0x13a0
	 create_subvol+0x59b/0x970
	 btrfs_mksubvol+0x435/0x4f0
	 __btrfs_ioctl_snap_create+0x11e/0x1b0
	 btrfs_ioctl_snap_create_v2+0xbf/0x140
	 btrfs_ioctl+0xa45/0x28f0
	 __x64_sys_ioctl+0x88/0xc0
	 do_syscall_64+0x38/0x90
	 entry_SYSCALL_64_after_hwframe+0x72/0xdc

  -> #0 (sb_internal#2){++++}-{0:0}:
	 __lock_acquire+0x1435/0x21a0
	 lock_acquire+0xc2/0x2b0
	 start_transaction+0x401/0x730
	 btrfs_commit_inode_delayed_inode+0x5f/0x120
	 btrfs_evict_inode+0x292/0x3d0
	 evict+0xcc/0x1d0
	 inode_lru_isolate+0x14d/0x1e0
	 __list_lru_walk_one+0xbe/0x1c0
	 list_lru_walk_one+0x58/0x80
	 prune_icache_sb+0x39/0x60
	 super_cache_scan+0x161/0x1f0
	 do_shrink_slab+0x163/0x340
	 shrink_slab+0x1d3/0x290
	 shrink_node+0x300/0x720
	 balance_pgdat+0x35c/0x7a0
	 kswapd+0x205/0x410
	 kthread+0xf0/0x120
	 ret_from_fork+0x29/0x50

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(sb_internal#2);
				 lock(fs_reclaim);
    lock(sb_internal#2);

   *** DEADLOCK ***

  3 locks held by kswapd0/46:
   #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
   #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290
   #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0

  stack backtrace:
  CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x58/0x90
   check_noncircular+0xd6/0x100
   ? save_trace+0x3f/0x310
   ? add_lock_to_list+0x97/0x120
   __lock_acquire+0x1435/0x21a0
   lock_acquire+0xc2/0x2b0
   ? btrfs_commit_inode_delayed_inode+0x5f/0x120
   start_transaction+0x401/0x730
   ? btrfs_commit_inode_delayed_inode+0x5f/0x120
   btrfs_commit_inode_delayed_inode+0x5f/0x120
   btrfs_evict_inode+0x292/0x3d0
   ? lock_release+0x134/0x270
   ? __pfx_wake_bit_function+0x10/0x10
   evict+0xcc/0x1d0
   inode_lru_isolate+0x14d/0x1e0
   __list_lru_walk_one+0xbe/0x1c0
   ? __pfx_inode_lru_isolate+0x10/0x10
   ? __pfx_inode_lru_isolate+0x10/0x10
   list_lru_walk_one+0x58/0x80
   prune_icache_sb+0x39/0x60
   super_cache_scan+0x161/0x1f0
   do_shrink_slab+0x163/0x340
   shrink_slab+0x1d3/0x290
   shrink_node+0x300/0x720
   balance_pgdat+0x35c/0x7a0
   kswapd+0x205/0x410
   ? __pfx_autoremove_wake_function+0x10/0x10
   ? __pfx_kswapd+0x10/0x10
   kthread+0xf0/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x29/0x50
   </TASK>

This happens because when we abort the transaction in the transaction
commit path we call invalidate_inode_pages2_range on our block group
cache inodes (if we have space cache v1) and any delalloc inodes we may
have.  The plain invalidate_inode_pages2_range() call passes through
GFP_KERNEL, which makes sense in most cases, but not here.  Wrap these
two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to
make sure we don't end up with the fs reclaim dependency under the
transaction dependency.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 13:08:28 +02:00
Johannes Thumshirn
806570c0bb btrfs: handle memory allocation failure in btrfs_csum_one_bio
Since f8a53bb58e ("btrfs: handle checksum generation in the storage
layer") the failures of btrfs_csum_one_bio() are handled via
bio_end_io().

This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in
case the allocation of the ordered sums fails.

This also fixes a syzkaller report, where injecting a failure into the
kvzalloc() call results in a BUG_ON().

Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 13:08:28 +02:00
Qu Wenruo
7561551e7b btrfs: scrub: try harder to mark RAID56 block groups read-only
Currently we allow a block group not to be marked read-only for scrub.

But for RAID56 block groups if we require the block group to be
read-only, then we're allowed to use cached content from scrub stripe to
reduce unnecessary RAID56 reads.

So this patch would:

- Make btrfs_inc_block_group_ro() try harder
  During my tests, for cases like btrfs/061 and btrfs/064, we can hit
  ENOSPC from btrfs_inc_block_group_ro() calls during scrub.

  The reason is if we only have one single data chunk, and trying to
  scrub it, we won't have any space left for any newer data writes.

  But this check should be done by the caller, especially for scrub
  cases we only temporarily mark the chunk read-only.
  And newer data writes would always try to allocate a new data chunk
  when needed.

- Return error for scrub if we failed to mark a RAID56 chunk read-only

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-17 11:59:46 +02:00
Arnd Bergmann
df67cb4c58 fs: d_path: include internal.h
make W=1 warns about a missing prototype that is defined but
not visible at point where simple_dname() is defined:

fs/d_path.c:317:7: error: no previous prototype for 'simple_dname' [-Werror=missing-prototypes]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Message-Id: <20230516195444.551461-1-arnd@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17 09:16:59 +02:00
Vladimir Sementsov-Ogievskiy
88e4607034 coredump: require O_WRONLY instead of O_RDWR
The motivation for this patch has been to enable using a stricter
apparmor profile to prevent programs from reading any coredump in the
system.

However, this became something else. The following details are based on
Christian's and Linus' archeology into the history of the number "2" in
the coredump handling code.

To make sure we're not accidently introducing some subtle behavioral
change into the coredump code we set out on a voyage into the depths of
history.git to figure out why this was O_RDWR in the first place.

Coredump handling was introduced over 30 years ago in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)").
The original code used O_WRONLY:

    open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)

However, this changed in 1993 and starting with commit
9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code
suddenly used the constant "2":

    open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)

This was curious as in the same commit the kernel switched from
constants to proper defines in other places such as KERNEL_DS and
USER_DS and O_RDWR did already exist.

So why was "2" used? It turns out that open_namei() - an early version
of what later turned into filp_open() - didn't accept O_RDWR.

A semantic quirk of the open() uapi is the definition of the O_RDONLY
flag. It would seem natural to define:

    #define O_RDWR (O_RDONLY | O_WRONLY)

but that isn't possible because:

    #define O_RDONLY 0

This makes O_RDONLY effectively meaningless when passed to the kernel.
In other words, there has never been a way - until O_PATH at least - to
open a file without any permission; O_RDONLY was always implied on the
uapi side while the kernel does in fact allow opening files without
permissions.

The trouble comes when trying to map the uapi flags onto the
corresponding file mode flags FMODE_{READ,WRITE}. This mapping still
happens today and is causing issues to this day (We ran into this
during additions for openat2() for example.).

So the special value "3" was used to indicate that the file was opened
for special access:

    f->f_flags = flag = flags;
    f->f_mode = (flag+1) & O_ACCMODE;
    if (f->f_mode)
            flag++;

This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping
the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The
special access then required read-write permissions and 0 was used to
access symlinks.

But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added
coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an
argument. So the coredump handling introduced in
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because
O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but
open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on
accident.

So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix
for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which
was correct.

The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2"
value was passed.

Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code
was changed to:

    -       dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
    ...
    +       file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);

At this point the raw "2" should have become O_WRONLY again as
filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.

Another 17 years later, the code was changed again cementing the mistake
and making it almost impossible to detect when commit
378c6520e7 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories")
replaced the raw "2" with O_RDWR.

And now, here we are with this patch that sent us on a quest to answer
the big questions in life such as "Why are coredump files opened with
O_RDWR?" and "Is it safe to just use O_WRONLY?".

So with this commit we're reintroducing O_WRONLY again and bringing this
code back to its original state when it was first introduced in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru>
[brauner@kernel.org: completely rewritten commit message]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-17 09:13:44 +02:00
Fangrui Song
60592fb6b6 coredump, vmcore: Set p_align to 4 for PT_NOTE
Tools like readelf/llvm-readelf use p_align to parse a PT_NOTE program
header as an array of 4-byte entries or 8-byte entries. Currently, there
are workarounds[1] in place for Linux to treat p_align==0 as 4. However,
it would be more appropriate to set the correct alignment so that tools
do not have to rely on guesswork. FreeBSD coredumps set p_align to 4 as
well.

[1]: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=82ed9683ec099d8205dc499ac84febc975235af6
[2]: https://reviews.llvm.org/D150022

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230512022528.3430327-1-maskray@google.com
2023-05-16 14:32:00 -07:00
Gustav Johansson
e7b8b8ed99 ksmbd: smb2: Allow messages padded to 8byte boundary
clc length is now accepted to <= 8 less than length,
rather than < 8.

Solve issues on some of Axis's smb clients which send
messages where clc length is 8 bytes less than length.

The specific client was running kernel 4.19.217 with
smb dialect 3.0.2 on armv7l.

Cc: stable@vger.kernel.org
Signed-off-by: Gustav Johansson <gustajo@axis.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16 10:26:14 -05:00
Chih-Yen Chang
443d61d1fa ksmbd: allocate one more byte for implied bcc[0]
ksmbd_smb2_check_message allows client to return one byte more, so we
need to allocate additional memory in ksmbd_conn_handler_loop to avoid
out-of-bound access.

Cc: stable@vger.kernel.org
Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16 10:26:14 -05:00
Chih-Yen Chang
f0a96d1aaf ksmbd: fix wrong UserName check in session_user
The offset of UserName is related to the address of security
buffer. To ensure the validaty of UserName, we need to compare name_off
+ name_len with secbuf_len instead of auth_msg_len.

[   27.096243] ==================================================================
[   27.096890] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x188/0x350
[   27.097609] Read of size 2 at addr ffff888005e3b542 by task kworker/0:0/7
...
[   27.099950] Call Trace:
[   27.100194]  <TASK>
[   27.100397]  dump_stack_lvl+0x33/0x50
[   27.100752]  print_report+0xcc/0x620
[   27.102305]  kasan_report+0xae/0xe0
[   27.103072]  kasan_check_range+0x35/0x1b0
[   27.103757]  smb_strndup_from_utf16+0x188/0x350
[   27.105474]  smb2_sess_setup+0xaf8/0x19c0
[   27.107935]  handle_ksmbd_work+0x274/0x810
[   27.108315]  process_one_work+0x419/0x760
[   27.108689]  worker_thread+0x2a2/0x6f0
[   27.109385]  kthread+0x160/0x190
[   27.110129]  ret_from_fork+0x1f/0x30
[   27.110454]  </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16 10:26:14 -05:00
Chih-Yen Chang
02f76c401d ksmbd: fix global-out-of-bounds in smb2_find_context_vals
Add tag_len argument in smb2_find_context_vals() to avoid out-of-bound
read when create_context's name_len is larger than tag length.

[    7.995411] ==================================================================
[    7.995866] BUG: KASAN: global-out-of-bounds in memcmp+0x83/0xa0
[    7.996248] Read of size 8 at addr ffffffff8258d940 by task kworker/0:0/7
...
[    7.998191] Call Trace:
[    7.998358]  <TASK>
[    7.998503]  dump_stack_lvl+0x33/0x50
[    7.998743]  print_report+0xcc/0x620
[    7.999458]  kasan_report+0xae/0xe0
[    7.999895]  kasan_check_range+0x35/0x1b0
[    8.000152]  memcmp+0x83/0xa0
[    8.000347]  smb2_find_context_vals+0xf7/0x1e0
[    8.000635]  smb2_open+0x1df2/0x43a0
[    8.006398]  handle_ksmbd_work+0x274/0x810
[    8.006666]  process_one_work+0x419/0x760
[    8.006922]  worker_thread+0x2a2/0x6f0
[    8.007429]  kthread+0x160/0x190
[    8.007946]  ret_from_fork+0x1f/0x30
[    8.008181]  </TASK>

Cc: stable@vger.kernel.org
Signed-off-by: Chih-Yen Chang <cc85nod@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-16 10:26:14 -05:00
Ritesh Harjani (IBM)
6e335cd789 ext2: Add direct-io trace points
This patch adds the trace point to ext2 direct-io apis
in fs/ext2/file.c

Here is how the output looks like

        a.out-467865 [006]  6758.170968: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0
        a.out-467865 [006]  6758.171061: ext2_dio_write_end:   dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT|WRITE aio 1 ret -529
kworker/3:153-444162 [003]  6758.171252: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0
        a.out-468222 [001]  6761.628924: ext2_dio_read_begin:  dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 1 ret 0
        a.out-468222 [001]  6761.629063: ext2_dio_read_end:    dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT aio 1 ret -529
        a.out-468428 [005]  6763.937454: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
        a.out-468428 [005]  6763.937829: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
        a.out-468428 [005]  6763.937847: ext2_dio_write_end:   dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096
        a.out-468609 [000]  6765.702878: ext2_dio_read_begin:  dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
        a.out-468609 [000]  6765.703243: ext2_dio_read_end:    dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096

Reported-and-tested-by: Disha Goel <disgoel@linux.ibm.com>
[Need to add CFLAGS_trace for fixing unable to find trace file problem]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <b8b0897fa2b273a448d7b4ba7317357ac73c08bc.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Ritesh Harjani (IBM)
fb5de4358e ext2: Move direct-io to use iomap
This patch converts ext2 direct-io path to iomap interface.
- This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK
  from ext2_iomap_begin(), in case if the write is done on a hole.
- This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of
  a partial write or if any error is detected in ext2_iomap_end().
  We try to return -ENOTBLK in such cases.
- For any unaligned or extending DIO writes, we pass
  IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes.
- For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because
  otherwise with dsync writes on devices that support FUA, generic_write_sync
  won't be called and we might miss inode metadata updates.
- Since ext2 already now uses _nolock vartiant of sync write. Hence
  there is no inode lock problem with iomap in this patch.
- ext2_iomap_ops are now being shared by DIO, DAX & fiemap path

Tested-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <610b672a52f2a7ff6dc550fd14d0f995806232a5.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Ritesh Harjani (IBM)
d053070425 ext2: Use generic_buffers_fsync() implementation
Next patch converts ext2 to use iomap interface for DIO.
iomap layer can call generic_write_sync() -> ext2_fsync() from
iomap_dio_complete while still holding the inode_lock().

Now writeback from other paths doesn't need inode_lock().
It seems there is also no need of an inode_lock() for
sync_mapping_buffers(). It uses it's own mapping->private_lock
for it's buffer list handling.
Hence this patch is in preparation to move ext2 to iomap.
This uses generic_buffers_fsync() which does not take any inode_lock()
in ext2_fsync().

Tested-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <76d206a464574ff91db25bc9e43479b51ca7e307.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Ritesh Harjani (IBM)
5b5b4ff8f9 ext4: Use generic_buffers_fsync_noflush() implementation
ext4 when got converted to iomap for dio, it copied __generic_file_fsync
implementation to avoid taking inode_lock in order to avoid any deadlock
(since iomap takes an inode_lock while calling generic_write_sync()).

The previous patch already added generic_buffers_fsync*() which does not
take any inode_lock(). Hence kill the redundant code and use
generic_buffers_fsync_noflush() function instead.

Tested-by: Disha Goel <disgoel@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <b43d4bb4403061ed86510c9587673e30a461ba14.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Ritesh Harjani (IBM)
31b2ebc092 fs/buffer.c: Add generic_buffers_fsync*() implementation
Some of the higher layers like iomap takes inode_lock() when calling
generic_write_sync().
Also writeback already happens from other paths without inode lock,
so it's difficult to say that we really need sync_mapping_buffers() to
take any inode locking here. Having said that, let's add
generic_buffers_fsync/_noflush() implementation in buffer.c with no
inode_lock/unlock() for now so that filesystems like ext2 and
ext4's nojournal mode can use it.

Ext4 when got converted to iomap for direct-io already copied it's own
variant of __generic_file_fsync() without lock.

This patch adds generic_buffers_fsync()
& generic_buffers_fsync_noflush() implementations for use in filesystems
like ext2 & ext4 respectively.

Later we can review other filesystems as well to see if we can make
generic_buffers_fsync/_noflush() which does not take any inode_lock() as
the default path.

Tested-by: Disha Goel <disgoel@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <d573408ac8408627d23a3d2d166e748c172c4c9e.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Ritesh Harjani (IBM)
fcced95b6b ext2/dax: Fix ext2_setsize when len is page aligned
PAGE_ALIGN(x) macro gives the next highest value which is multiple of
pagesize. But if x is already page aligned then it simply returns x.
So, if x passed is 0 in dax_zero_range() function, that means the
length gets passed as 0 to ->iomap_begin().

In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on
here in ext2_get_blocks().
	BUG_ON(maxblocks == 0);

Instead we should be calling dax_truncate_page() here which takes
care of it. i.e. it only calls dax_zero_range if the offset is not
page/block aligned.

This can be easily triggered with following on fsdax mounted pmem
device.

dd if=/dev/zero of=file count=1 bs=512
truncate -s 0 file

[79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff)
[93.793207] ------------[ cut here ]------------
[93.795102] kernel BUG at fs/ext2/inode.c:637!
[93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139
[93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610
<...>
[93.835298] Call Trace:
[93.836253]  <TASK>
[93.837103]  ? lock_acquire+0xf8/0x110
[93.838479]  ? d_lookup+0x69/0xd0
[93.839779]  ext2_iomap_begin+0xa7/0x1c0
[93.841154]  iomap_iter+0xc7/0x150
[93.842425]  dax_zero_range+0x6e/0xa0
[93.843813]  ext2_setsize+0x176/0x1b0
[93.845164]  ext2_setattr+0x151/0x200
[93.846467]  notify_change+0x341/0x4e0
[93.847805]  ? lock_acquire+0xf8/0x110
[93.849143]  ? do_truncate+0x74/0xe0
[93.850452]  ? do_truncate+0x84/0xe0
[93.851739]  do_truncate+0x84/0xe0
[93.852974]  do_sys_ftruncate+0x2b4/0x2f0
[93.854404]  do_syscall_64+0x3f/0x90
[93.855789]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

CC: stable@vger.kernel.org
Fixes: 2aa3048e03 ("iomap: switch iomap_zero_range to use iomap_iter")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <046a58317f29d9603d1068b2bbae47c2332c17ae.1682069716.git.ritesh.list@gmail.com>
2023-05-16 11:32:42 +02:00
Azeem Shaikh
21a3f33289 NFSD: Remove open coding of string copy
Instead of open coding a __dynamic_array(), use the __string() and
__assign_str() helper macros that exist for this kind of use case.

Part of an effort to remove deprecated strlcpy() [1] completely from the
kernel[2].

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Fixes: 3c92fba557 ("NFSD: Enhance the nfsd_cb_setup tracepoint")
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-15 08:03:04 -04:00
Min-Hua Chen
cedd0bdc16 fs: fix incorrect fmode_t casts
Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY to fixes
the following sparce warnings:
fs/overlayfs/file.c:48:37: sparse: warning: restricted fmode_t degrades to integer
fs/overlayfs/file.c:128:13: sparse: warning: restricted fmode_t degrades to integer
fs/open.c:1159:21: sparse: warning: restricted fmode_t degrades to integer

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Message-Id: <20230502232210.119063-1-minhuadotchen@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 13:14:01 +02:00
Fabian Frederick
1168f09541 jffs2: reduce stack usage in jffs2_build_xattr_subsystem()
Use kcalloc() for allocation/flush of 128 pointers table to
reduce stack usage.

Function now returns -ENOMEM or 0 on success.

stackusage
Before:
./fs/jffs2/xattr.c:775  jffs2_build_xattr_subsystem     1208
dynamic,bounded

After:
./fs/jffs2/xattr.c:775  jffs2_build_xattr_subsystem     192
dynamic,bounded

Also update definition when CONFIG_JFFS2_FS_XATTR is not enabled

Tested with an MTD mount point and some user set/getfattr.

Many current target on OpenWRT also suffer from a compilation warning
(that become an error with CONFIG_WERROR) with the following output:

fs/jffs2/xattr.c: In function 'jffs2_build_xattr_subsystem':
fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
  887 | }
      | ^

Using dynamic allocation fix this compilation warning.

Fixes: c9f700f840 ("[JFFS2][XATTR] using 'delete marker' for xdatum/xref deletion")
Reported-by: Tim Gardner <tim.gardner@canonical.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Ron Economos <re@w6rz.net>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org
Message-Id: <20230506045612.16616-1-ansuelsmth@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 12:43:15 +02:00
Anuradha Weeraman
55650b2fdd fs/open.c: Fix W=1 kernel doc warnings
fs/open.c: In functions 'setattr_vfsuid' and 'setattr_vfsgid':
 warning: Function parameter or member 'attr' not described
 - Fix warning by removing kernel-doc for these as they are static
   inline functions and not required to be exposed via kernel-doc.

fs/open.c:
 warning: Excess function parameter 'opened' description in 'finish_open'
 warning: Excess function parameter 'cred' description in 'vfs_open'
 - Fix by removing the parameters from the kernel-doc as they are no
   longer required by the function.

Signed-off-by: Anuradha Weeraman <anuradha@debian.org>
Message-Id: <20230506182928.384105-1-anuradha@debian.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 12:39:07 +02:00
Azeem Shaikh
c642256b91 vfs: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-Id: <20230510221119.3508930-1-azeemshaikh38@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 09:42:01 +02:00
Min-Hua Chen
38f1755a3e fs: use correct __poll_t type
Fix the following sparse warnings by using __poll_t instead
of unsigned type.

fs/eventpoll.c:541:9: sparse: warning: restricted __poll_t degrades to integer
fs/eventfd.c:67:17: sparse: warning: restricted __poll_t degrades to integer

Signed-off-by: Min-Hua Chen <minhuadotchen@gmail.com>
Message-Id: <20230511164628.336586-1-minhuadotchen@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-15 09:33:09 +02:00
Linus Torvalds
bb7c241fae Some ext4 bug fixes (mostly to address Syzbot reports) for v6.4-rc2.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRgCfAACgkQ8vlZVpUN
 gaOaOgf5AbFUBsjb95Aq2Y6SKvlyO2xFd2OqJXu6+bGaJScQ8qeoW2byihN4vD/e
 i5V5vivpk764k1uOUe9fq5BlkaTuvFJI8d81eEnJC3LW4s7r6Gv586dwbE5lr0Bq
 cZKCVMYdgwz3admGtPXrN0CVgg+Y/wHb1ZmGtt2nAqZfNqYfpX0waDyGr6JebhkO
 04VE8QQCvMkO6oOIR9ZfbJmVm5vrGqQVLW4T0hXVTj9r3gUu/61qAkt2XYAu5tKJ
 ENIoMv2ix0asAgFSbcIzY6YnCzSY9hiV/K6Twtusf63r22T+r6+LXBqUe+8hMx4E
 Vh8L+5wkeNkCXD8HwnHizPx5r0nLqw==
 =ouFA
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Some ext4 bug fixes (mostly to address Syzbot reports)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: bail out of ext4_xattr_ibody_get() fails for any reason
  ext4: add bounds checking in get_max_inline_xattr_value_size()
  ext4: add indication of ro vs r/w mounts in the mount message
  ext4: fix deadlock when converting an inline directory in nojournal mode
  ext4: improve error recovery code paths in __ext4_remount()
  ext4: improve error handling from ext4_dirhash()
  ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled
  ext4: check iomap type only if ext4_iomap_begin() does not fail
  ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum
  ext4: fix data races when using cached status extents
  ext4: avoid deadlock in fs reclaim with page writeback
  ext4: fix invalid free tracking in ext4_xattr_move_to_block()
  ext4: remove a BUG_ON in ext4_mb_release_group_pa()
  ext4: allow ext4_get_group_info() to fail
  ext4: fix lockdep warning when enabling MMP
  ext4: fix WARNING in mb_find_extent
2023-05-13 17:45:39 -07:00
Theodore Ts'o
2a534e1d0d ext4: bail out of ext4_xattr_ibody_get() fails for any reason
In ext4_update_inline_data(), if ext4_xattr_ibody_get() fails for any
reason, it's best if we just fail as opposed to stumbling on,
especially if the failure is EFSCORRUPTED.

Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
2220eaf909 ext4: add bounds checking in get_max_inline_xattr_value_size()
Normally the extended attributes in the inode body would have been
checked when the inode is first opened, but if someone is writing to
the block device while the file system is mounted, it's possible for
the inode table to get corrupted.  Add bounds checking to avoid
reading beyond the end of allocated memory if this happens.

Reported-by: syzbot+1966db24521e5f6e23f7@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=1966db24521e5f6e23f7
Cc: stable@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
6dcc98fbc4 ext4: add indication of ro vs r/w mounts in the mount message
Whether the file system is mounted read-only or read/write is more
important than the quota mode, which we are already printing.  Add the
ro vs r/w indication since this can be helpful in debugging problems
from the console log.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
f4ce24f54d ext4: fix deadlock when converting an inline directory in nojournal mode
In no journal mode, ext4_finish_convert_inline_dir() can self-deadlock
by calling ext4_handle_dirty_dirblock() when it already has taken the
directory lock.  There is a similar self-deadlock in
ext4_incvert_inline_data_nolock() for data files which we'll fix at
the same time.

A simple reproducer demonstrating the problem:

    mke2fs -Fq -t ext2 -O inline_data -b 4k /dev/vdc 64
    mount -t ext4 -o dirsync /dev/vdc /vdc
    cd /vdc
    mkdir file0
    cd file0
    touch file0
    touch file1
    attr -s BurnSpaceInEA -V abcde .
    touch supercalifragilisticexpialidocious

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230507021608.1290720-1-tytso@mit.edu
Reported-by: syzbot+91dccab7c64e2850a4e5@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=ba84cc80a9491d65416bc7877e1650c87530fe8a
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
4c0b4818b1 ext4: improve error recovery code paths in __ext4_remount()
If there are failures while changing the mount options in
__ext4_remount(), we need to restore the old mount options.

This commit fixes two problem.  The first is there is a chance that we
will free the old quota file names before a potential failure leading
to a use-after-free.  The second problem addressed in this commit is
if there is a failed read/write to read-only transition, if the quota
has already been suspended, we need to renable quota handling.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
4b3cb1d108 ext4: improve error handling from ext4_dirhash()
The ext4_dirhash() will *almost* never fail, especially when the hash
tree feature was first introduced.  However, with the addition of
support of encrypted, casefolded file names, that function can most
certainly fail today.

So make sure the callers of ext4_dirhash() properly check for
failures, and reflect the errors back up to their callers.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu
Reported-by: syzbot+394aa8a792cb99dbc837@syzkaller.appspotmail.com
Reported-by: syzbot+344aaa8697ebd232bfc8@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=db56459ea4ac4a676ae4b4678f633e55da005a9b
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Theodore Ts'o
a44be64bbe ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled
When a file system currently mounted read/only is remounted
read/write, if we clear the SB_RDONLY flag too early, before the quota
is initialized, and there is another process/thread constantly
attempting to create a directory, it's possible to trigger the

	WARN_ON_ONCE(dquot_initialize_needed(inode));

in ext4_xattr_block_set(), with the following stack trace:

   WARNING: CPU: 0 PID: 5338 at fs/ext4/xattr.c:2141 ext4_xattr_block_set+0x2ef2/0x3680
   RIP: 0010:ext4_xattr_block_set+0x2ef2/0x3680 fs/ext4/xattr.c:2141
   Call Trace:
    ext4_xattr_set_handle+0xcd4/0x15c0 fs/ext4/xattr.c:2458
    ext4_initxattrs+0xa3/0x110 fs/ext4/xattr_security.c:44
    security_inode_init_security+0x2df/0x3f0 security/security.c:1147
    __ext4_new_inode+0x347e/0x43d0 fs/ext4/ialloc.c:1324
    ext4_mkdir+0x425/0xce0 fs/ext4/namei.c:2992
    vfs_mkdir+0x29d/0x450 fs/namei.c:4038
    do_mkdirat+0x264/0x520 fs/namei.c:4061
    __do_sys_mkdirat fs/namei.c:4076 [inline]
    __se_sys_mkdirat fs/namei.c:4074 [inline]
    __x64_sys_mkdirat+0x89/0xa0 fs/namei.c:4074

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230506142419.984260-1-tytso@mit.edu
Reported-by: syzbot+6385d7d3065524c5ca6d@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=6513f6cb5cd6b5fc9f37e3bb70d273b94be9c34c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:05 -04:00
Baokun Li
fa83c34e3e ext4: check iomap type only if ext4_iomap_begin() does not fail
When ext4_iomap_overwrite_begin() calls ext4_iomap_begin() map blocks may
fail for some reason (e.g. memory allocation failure, bare disk write), and
later because "iomap->type ! = IOMAP_MAPPED" triggers WARN_ON(). When ext4
iomap_begin() returns an error, it is normal that the type of iomap->type
may not match the expectation. Therefore, we only determine if iomap->type
is as expected when ext4_iomap_begin() is executed successfully.

Cc: stable@kernel.org
Reported-by: syzbot+08106c4b7d60702dbc14@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/00000000000015760b05f9b4eee9@google.com
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230505132429.714648-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Tudor Ambarus
4f04351888 ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum
When modifying the block device while it is mounted by the filesystem,
syzbot reported the following:

BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58
Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586

CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc96618275234 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 crc16+0x206/0x280 lib/crc16.c:58
 ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187
 ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210
 ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline]
 ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173
 ext4_remove_blocks fs/ext4/extents.c:2527 [inline]
 ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline]
 ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958
 ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416
 ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342
 ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622
 notify_change+0xe50/0x1100 fs/attr.c:482
 do_truncate+0x200/0x2f0 fs/open.c:65
 handle_truncate fs/namei.c:3216 [inline]
 do_open fs/namei.c:3561 [inline]
 path_openat+0x272b/0x2dd0 fs/namei.c:3714
 do_filp_open+0x264/0x4f0 fs/namei.c:3741
 do_sys_openat2+0x124/0x4e0 fs/open.c:1310
 do_sys_open fs/open.c:1326 [inline]
 __do_sys_creat fs/open.c:1402 [inline]
 __se_sys_creat fs/open.c:1396 [inline]
 __x64_sys_creat+0x11f/0x160 fs/open.c:1396
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f72f8a8c0c9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280
RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000

Replace
	le16_to_cpu(sbi->s_es->s_desc_size)
with
	sbi->s_desc_size

It reduces ext4's compiled text size, and makes the code more efficient
(we remove an extra indirect reference and a potential byte
swap on big endian systems), and there is no downside. It also avoids the
potential KASAN / syzkaller failure, as a bonus.

Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com
Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff42
Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f3
Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/
Fixes: 717d50e497 ("Ext4: Uninitialized Block Groups")
Cc: stable@vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Jan Kara
492888df0c ext4: fix data races when using cached status extents
When using cached extent stored in extent status tree in tree->cache_es
another process holding ei->i_es_lock for reading can be racing with us
setting new value of tree->cache_es. If the compiler would decide to
refetch tree->cache_es at an unfortunate moment, it could result in a
bogus in_range() check. Fix the possible race by using READ_ONCE() when
using tree->cache_es only under ei->i_es_lock for reading.

Cc: stable@kernel.org
Reported-by: syzbot+4a03518df1e31b537066@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/000000000000d3b33905fa0fd4a6@google.com
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230504125524.10802-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Jan Kara
00d873c17e ext4: avoid deadlock in fs reclaim with page writeback
Ext4 has a filesystem wide lock protecting ext4_writepages() calls to
avoid races with switching of journalled data flag or inode format. This
lock can however cause a deadlock like:

CPU0                            CPU1

ext4_writepages()
  percpu_down_read(sbi->s_writepages_rwsem);
                                ext4_change_inode_journal_flag()
                                  percpu_down_write(sbi->s_writepages_rwsem);
                                    - blocks, all readers block from now on
  ext4_do_writepages()
    ext4_init_io_end()
      kmem_cache_zalloc(io_end_cachep, GFP_KERNEL)
        fs_reclaim frees dentry...
          dentry_unlink_inode()
            iput() - last ref =>
              iput_final() - inode dirty =>
                write_inode_now()...
                  ext4_writepages() tries to acquire sbi->s_writepages_rwsem
                    and blocks forever

Make sure we cannot recurse into filesystem reclaim from writeback code
to avoid the deadlock.

Reported-by: syzbot+6898da502aef574c5f8a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000004c66b405fa108e27@google.com
Fixes: c8585c6fca ("ext4: fix races between changing inode journal mode and ext4_writepages")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230504124723.20205-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Theodore Ts'o
b87c7cdf2b ext4: fix invalid free tracking in ext4_xattr_move_to_block()
In ext4_xattr_move_to_block(), the value of the extended attribute
which we need to move to an external block may be allocated by
kvmalloc() if the value is stored in an external inode.  So at the end
of the function the code tried to check if this was the case by
testing entry->e_value_inum.

However, at this point, the pointer to the xattr entry is no longer
valid, because it was removed from the original location where it had
been stored.  So we could end up calling kvfree() on a pointer which
was not allocated by kvmalloc(); or we could also potentially leak
memory by not freeing the buffer when it should be freed.  Fix this by
storing whether it should be freed in a separate variable.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430160426.581366-1-tytso@mit.edu
Link: https://syzkaller.appspot.com/bug?id=5c2aee8256e30b55ccf57312c16d88417adbd5e1
Link: https://syzkaller.appspot.com/bug?id=41a6b5d4917c0412eb3b3c3c604965bed7d7420b
Reported-by: syzbot+64b645917ce07d89bde5@syzkaller.appspotmail.com
Reported-by: syzbot+0d042627c4f2ad332195@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Theodore Ts'o
463808f237 ext4: remove a BUG_ON in ext4_mb_release_group_pa()
If a malicious fuzzer overwrites the ext4 superblock while it is
mounted such that the s_first_data_block is set to a very large
number, the calculation of the block group can underflow, and trigger
a BUG_ON check.  Change this to be an ext4_warning so that we don't
crash the kernel.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-3-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-13 18:05:04 -04:00
Theodore Ts'o
5354b2af34 ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number
as BUG(), since in theory it should never happen.  However, if a
malicious attaker (or fuzzer) modifies the superblock via the block
device while it is the file system is mounted, it is possible for
s_first_data_block to get set to a very large number.  In that case,
when calculating the block group of some block number (such as the
starting block of a preallocation region), could result in an
underflow and very large block group number.  Then the BUG_ON check in
ext4_get_group_info() would fire, resutling in a denial of service
attack that can be triggered by root or someone with write access to
the block device.

For a quality of implementation perspective, it's best that even if
the system administrator does something that they shouldn't, that it
will not trigger a BUG.  So instead of BUG'ing, ext4_get_group_info()
will call ext4_error and return NULL.  We also add fallback code in
all of the callers of ext4_get_group_info() that it might NULL.

Also, since ext4_get_group_info() was already borderline to be an
inline function, un-inline it.  The results in a next reduction of the
compiled text size of ext4 by roughly 2k.

Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu
Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-05-13 18:02:46 -04:00
Linus Torvalds
76c7f8873a for-6.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRebDIACgkQxWXV+ddt
 WDu3vA//RNyRGjEz0HgfhTc1119DXJLwK6j544waYLrzRcMtBK4xKByiaFkAA4tL
 PQidGX+nAQPm+pZl0jcK30cBMObik5GXJwoSOZGl7/ectx4O7aFfXqiSfwPTyqZU
 3fTavoqoJxbxJCVbifcXOPNhsUxMlEGYJmA3CVRsllLviXY+3HMpX2ZpWZ7vch+N
 MLENNBfUo1HVdWaxOYfQif/qT5iR9G7D8dBjX9DUK0kVwrbwBB0rolJy4fPrY6z5
 gBLED9Ks3FBgyU3mYq4qrfPmbfF8mPiaU0+1j+B46vw3PdPtIwjIForR+91GsZ1v
 iHojbykf6VWTQV+gO78mgv4O4vRtn3C+UJaGxLL86OMOaiQQHFYdSETn9arPmoho
 p1wCBidI82tvfIOGYXgrTGorLN27hhyPJinHe/2Bqo+1wUL8/J8mwCWunIox7a8z
 rxO5QhDIDFX7gamsvYjkW3tBkYuGiGvBjx+Ic2cBHTkVp9wSPL9PCvqNNru2qexA
 t0BpAL9DxvN+T1xO1thC3qsm2Ogx0QEmgdDfRglbEVASnRZKZZsJEMO90FzFbkFg
 vLbs0KnT7yS7mTwq4NklDrgHZ0eiiJLZVCb8bR8xkzVW+ADrUmZuDM8WOcCgJAUp
 fUoMmFsJZi5zsdAOygDWr1bBHorLV5szrY0bSB5L2eHwJjYZ6KE=
 =uWUN
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs fixes from David Sterba:

 - fix incorrect number of bitmap entries for space cache if loading is
   interrupted by some error

 - fix backref walking, this breaks a mode of LOGICAL_INO_V2 ioctl that
   is used in deduplication tools

 - zoned mode fixes:
      - properly finish zone reserved for relocation
      - correctly calculate super block zone end on ZNS
      - properly initialize new extent buffer for redirty

 - make mount option clear_cache work with block-group-tree, to rebuild
   free-space-tree instead of temporarily disabling it that would lead
   to a forced read-only mount

 - fix alignment check for offset when printing extent item

* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: make clear_cache mount option to rebuild FST without disabling it
  btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
  btrfs: zoned: fix full zone super block reading on ZNS
  btrfs: zoned: zone finish data relocation BG with last IO
  btrfs: fix backref walking not returning all inode refs
  btrfs: fix space cache inconsistency after error loading it from disk
  btrfs: print-tree: parent bytenr must be aligned to sector size
2023-05-12 17:10:32 -05:00
Linus Torvalds
fd88f147cb 6 smb3 client fixes, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRd3pcACgkQiiy9cAdy
 T1FrEAwAj6M0DnI13diLdjZVs9WAVRIiu5Fk1l8oA6bco74UepbJjZmTf8K8G0pM
 fQDTXZWDa5g11tbYuXFb4Ue9Dv7hu4RuUthWZCxoDnKcvH00VtAtYplE1LJstWBh
 mZwHm8iEBRPs2oB1YZ7KnQ8kFSjs2VgwV2PQAI24QDP0QWTT4jtw/nPqLxKLxyef
 Rr9YkOK0LZPhhGEAZ5ABwkPH97U5CUMMwRTqhGDO/mm91FZ8sD+vAsqM7GdLU73w
 wGHp3M1RKBVBubNCgNT0uRMlnlqBcvoThqhYgJ6PYnS4uE23a2uLDNI8vj9LlMRo
 csoYdM+Kltpvki8hFIWiFxfHFfECHD6h/QOpRMSv9InU0ly6o7zwnl9VtuWcRyrM
 vyz3xnlI6wG6TCy1HGezq3HSaEbEbVpwLCxyG/xFiAHV/VdkmjmRUb0RJV4X3X8M
 /KUT+1wj0MAMmVUTIpIHuotRikScAPnP/NDMorIsgEUp/2TFh2m+FQZ15o401sGv
 96G8v+9I
 =ax5r
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:

 - fix for copy_file_range bug for very large files that are multiples
   of rsize

 - do not ignore "isolated transport" flag if set on share

 - set rasize default better

 - three fixes related to shutdown and freezing (fixes 4 xfstests, and
   closes deferred handles faster in some places that were missed)

* tag '6.4-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: release leases for deferred close handles when freezing
  smb3: fix problem remounting a share after shutdown
  SMB3: force unmount was failing to close deferred close files
  smb3: improve parallel reads of large files
  do not reuse connection if share marked as isolated
  cifs: fix pcchunk length type in smb2_copychunk_range
2023-05-12 17:01:36 -05:00
Linus Torvalds
df8c2d13e2 vfs/v6.4-rc1/pipe
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZF5YpQAKCRCRxhvAZXjc
 oqZpAQCuXA/XDf4Cyhfk87C0gbzy1Akjf3Xpia8q1mWtvN4I+AD+PE54krWgxGRi
 mu5Ae2X95tQ1v+MUAcmWOMdITK8HewE=
 =8+GL
 -----END PGP SIGNATURE-----

Merge tag 'vfs/v6.4-rc1/pipe' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fix from Christian Brauner:
 "During the pipe nonblock rework the check for both O_NONBLOCK and
  IOCB_NOWAIT was dropped. Both checks need to be performed to ensure
  that files without O_NONBLOCK but IOCB_NOWAIT don't block when writing
  to or reading from a pipe.

  This just contains the fix adding the check for IOCB_NOWAIT back in"

* tag 'vfs/v6.4-rc1/pipe' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  pipe: check for IOCB_NOWAIT alongside O_NONBLOCK
2023-05-12 16:56:09 -05:00
Jens Axboe
c04fe8e32f
pipe: check for IOCB_NOWAIT alongside O_NONBLOCK
Pipe reads or writes need to enable nonblocking attempts, if either
O_NONBLOCK is set on the file, or IOCB_NOWAIT is set in the iocb being
passed in. The latter isn't currently true, ensure we check for both
before waiting on data or space.

Fixes: afed6271f5 ("pipe: set FMODE_NOWAIT on pipes")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <e5946d67-4e5e-b056-ba80-656bab12d9f6@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-05-12 17:17:27 +02:00
Linus Torvalds
849a4f0973 xfs: bug fixes for 6.4-rc2
o fixes for inode garbage collection shutdown racing with work queue
   updates
 o ensure inodegc workers run on the CPU they are supposed to
 o disable counter scrubbing until we can exclusively freeze the
   filesystem from the kernel
 o Regression fixes for new allocation related bugs
 o a couple of minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRcSIsUHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h2Y8xAAxtsTdOx71XtDuNyfBOiqzZgTCq6b
 6LsckJIDQa1AXjUNq9G3zWcUcWBcRWcw+CWbkqjqQ9W47K/ijLuoKnjRsQ+5B4DU
 TBUctVq+/Zk2lBlb6HKuKdzqDGnIFWGVKVd7u8KlowqnXuzUeQ0vFkT7ZHTepUKG
 P+midgGNVT4+tykq7oH0H8WxoTyNPZhKiAUcZjneBgA60IAoQWHA2iUt+SKpbrkL
 1HyK+/edVMTXiDXtyHfXmDaH9Pgy6NCpw3TNkPDhuL1UDpLhg/zgT39rFZGBsAUt
 gaDM3wN5jBrot/mvJE3rH9bdZhkcf+NQKPx/1DDg3DL8plS/1/LUC4cImdolBJ3w
 RNmgJv1lK+AlE4MUJ/bUDlEpHUmwAjnnsxBXwEvnYNfj+9V6/mDB+HqKiY7/XxVK
 vF77s6z+CWvefdnZavJ4/72pVVJNkcDYCYmvh/donRP6vtnwZyzocFUeBeNMInV1
 /s3WMrF9hwmJqAClKG7p1fnszWp658yFIuw/TXVs+NrjTtQgXwMpl2cEYYvUZEJN
 Trq2p0xH/JSwcnOPSPJO6WHb8UPoqrM6lgGFaJVWJx1AWt1i1CFLf5eA5X+XisDV
 AJKgpqlnDg02bBMQ0tMFGZUaNx/1S1mwtxcZsyEFTutpUNxqJKDaMohpxxrWb0WC
 ppSqDvyJN4wtlFI=
 =qok2
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs bug fixes from Dave Chinner:
 "Largely minor bug fixes and cleanups, th emost important of which are
  probably the fixes for regressions in the extent allocation code:

   - fixes for inode garbage collection shutdown racing with work queue
     updates

   - ensure inodegc workers run on the CPU they are supposed to

   - disable counter scrubbing until we can exclusively freeze the
     filesystem from the kernel

   - regression fixes for new allocation related bugs

   - a couple of minor cleanups"

* tag 'xfs-6.4-rc1-fixes' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix xfs_inodegc_stop racing with mod_delayed_work
  xfs: disable reaping in fscounters scrub
  xfs: check that per-cpu inodegc workers actually run on that cpu
  xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
  xfs: fix negative array access in xfs_getbmap
  xfs: don't allocate into the data fork for an unshare request
  xfs: flush dirty data and drain directios before scrubbing cow fork
  xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
  xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
2023-05-11 16:51:11 -05:00
Steve French
d39fc592ef cifs: release leases for deferred close handles when freezing
We should not be caching closed files when freeze is invoked on an fs
(so we can release resources more gracefully).

Fixes xfstests generic/068 generic/390 generic/491

Reviewed-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-10 17:48:30 -05:00
Linus Torvalds
d295b66a7b \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmRcFLAACgkQnJ2qBz9k
 QNmSuQf+P+hFOHVpCb83S4TrNsfPC9M7Pv/M1Rw926p1gESNLez1ElAx2B35OIBZ
 y4ijqrSHgXbCsC3PvbxR9YfBPJZwIhtmKwjPC+9KlhAFB536Bvt/nt34fjRCWJiS
 nABnexegkirk7EQufAGVZQ1gtm1p0qSCA1B/l5AFl4KeLIin31U/08F2eE1OAo3K
 TmxhXzlC+JRXjf/X/X8mzJ8nz/hY5039BDvs7EpVdOsZcLFRQcCUygUCrOdV94uI
 1wyzMoGNaiIsZHpHPf3NtphUoE9IFpZQJGXMTR4ehBgFYkkOHYxsjW+wn8YOmaks
 qzRRXb9w0WdgAXChDGu1xYXV6Wpjiw==
 =kqSe
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull inotify fix from Jan Kara:
 "A fix for possibly reporting invalid watch descriptor with inotify
  event"

* tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Avoid reporting event with invalid wd
2023-05-10 17:07:42 -05:00
Linus Torvalds
2a78769da3 gfs2 fix
- Fix a NULL pointer dereference when mounting corrupted filesystems.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmRbtSYUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrQSg/6A7xD25DpU1CSL18CEKMxnViXPNLP
 IYiTtkByCsRkgQKR4KMtVgPD/lEyWbhpuZJzSF4mZkslpnngjL/o+Rj4Sds9GvPL
 +66CSJAqjpJUjgGDvAoiJcZYOejRyqZLCHzKF54h0oJ3k9JCd3xN86EX8vKGoc2e
 F/As+E1Wnf0PP4jiERDyFhhg95MNxFuNVJ/JJ7mwtWpumx+60SpGzzUp575AHcoe
 4zHkVHlj6FAz2G8l8NADYAc0wMuhQ05g19psHzRDr2bB9zHBy9kWzXN0MYrhRvuY
 HuQ3CNDbXCJPXFjrNEmLV0f7Y8yCk/YNrYPbo4zV+AW0l7NzesqzrwkP/qgNfySD
 pnQJ/mQ76P/h728fln0pYvO8GnFUFoKKoCR27jH4F/Mntr4hgR4FZNDhvK6BMQNc
 PpipJcXWT8AUY45Sg3TLKPEhjp37K+abiMKfyTEFQRvLT83O6WyCUx2HYpZI7l87
 /aTIsidlYkb2ZSeLeXEfBplRXwF7NQNvvRJCGQKHAN7UbTB1o2RVpC5bn6UIghb1
 eQQZObs4AUfmzHthF2OaeJy08AiFidbOVLmbpZuj2iOWbJzmjJea8YT/GdntxhMT
 swmkHJqdmZEYoSKKvDmemT6l7qtOwijg7sTQ/d2D77lk8QPPIHdUw6g1OVvMqoNE
 oESpIA4O2fshlpU=
 =/s3O
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Fix a NULL pointer dereference when mounting corrupted filesystems

* tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't deref jdesc in evict
2023-05-10 16:48:35 -05:00
Uwe Kleine-König
48f2c681df pstore/ram: Convert to platform remove callback returning void
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is (mostly) ignored
and this typically results in resource leaks. To improve here there is a
quest to make the remove callback return void. In the first step of this
quest all drivers are converted to .remove_new() which already returns
void.

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230401120000.2487153-1-u.kleine-koenig@pengutronix.de
2023-05-10 14:06:00 -07:00
Bob Peterson
504a10d9e4 gfs2: Don't deref jdesc in evict
On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:

init_journal()
...
fail_jindex:
   gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
      if (gfs2_holder_initialized(&ji_gh))
         gfs2_glock_dq_uninit(&ji_gh);
fail:
   iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
      evict()
         gfs2_evict_inode()
            evict_linked_inode()
               ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.

The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.

This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.

Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-05-10 17:15:18 +02:00
Qu Wenruo
1d6a4fc857 btrfs: make clear_cache mount option to rebuild FST without disabling it
Previously clear_cache mount option would simply disable free-space-tree
feature temporarily then re-enable it to rebuild the whole free space
tree.

But this is problematic for block-group-tree feature, as we have an
artificial dependency on free-space-tree feature.

If we go the existing method, after clearing the free-space-tree
feature, we would flip the filesystem to read-only mode, as we detect a
super block write with block-group-tree but no free-space-tree feature.

This patch would change the behavior by properly rebuilding the free
space tree without disabling this feature, thus allowing clear_cache
mount option to work with block group tree.

Now we can mount a filesystem with block-group-tree feature and
clear_mount option:

  $ mkfs.btrfs  -O block-group-tree /dev/test/scratch1  -f
  $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache
  $ sudo dmesg -t | head -n 5
  BTRFS info (device dm-1): force clearing of disk cache
  BTRFS info (device dm-1): using free space tree
  BTRFS info (device dm-1): auto enabling async discard
  BTRFS info (device dm-1): rebuilding free space tree
  BTRFS info (device dm-1): checking UUID tree

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:51:27 +02:00
Christoph Hellwig
c83b56d1dd btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add
btrfs_redirty_list_add zeroes the buffer data and sets the
EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus
header.  But it does that after already marking the buffer dirty, which
means that writeback could already be looking at the buffer.

Switch the order of operations around so that the buffer is only marked
dirty when we're ready to write it.

Fixes: d3575156f6 ("btrfs: zoned: redirty released extent buffers")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:29 +02:00
Naohiro Aota
02ca9e6fb5 btrfs: zoned: fix full zone super block reading on ZNS
When both of the superblock zones are full, we need to check which
superblock is newer. The calculation of last superblock position is wrong
as it does not consider zone_capacity and uses the length.

Fixes: 9658b72ef3 ("btrfs: zoned: locate superblock position using zone capacity")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:22 +02:00
Naohiro Aota
f84353c7c2 btrfs: zoned: zone finish data relocation BG with last IO
For data block groups, we zone finish a zone (or, just deactivate it) when
seeing the last IO in btrfs_finish_ordered_io(). That is only called for
IOs using ZONE_APPEND, but we use a regular WRITE command for data
relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10 14:50:12 +02:00
Filipe Manana
0cad8f14d7 btrfs: fix backref walking not returning all inode refs
When using the logical to ino ioctl v2, if the flag to ignore offsets of
file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the
backref walking code ends up not returning references for all file offsets
of an inode that point to the given logical bytenr. This happens since
kernel 6.2, commit 6ce6ba5344 ("btrfs: use a single argument for extent
offset in backref walking functions") because:

1) It mistakenly skipped the search for file extent items in a leaf that
   point to the target extent if that flag is given. Instead it should
   only skip the filtering done by check_extent_in_eb() - that is, it
   should not avoid the calls to that function (or find_extent_in_eb(),
   which uses it).

2) It was also not building a list of inode extent elements (struct
   extent_inode_elem) if we have multiple inode references for an extent
   when the ignore offset flag is given to the logical to ino ioctl - it
   would leave a single element, only the last one that was found.

These stem from the confusing old interface for backref walking functions
where we had an extent item offset argument that was a pointer to a u64
and another boolean argument that indicated if the offset should be
ignored, but the pointer could be NULL. That NULL case is used by
relocation, qgroup extent accounting and fiemap, simply to avoid building
the inode extent list for each reference, as it's not necessary for those
use cases and therefore avoids memory allocations and some computations.

Fix this by adding a boolean argument to the backref walk context
structure to indicate that the inode extent list should not be built,
make relocation set that argument to true and fix the backref walking
logic to skip the calls to check_extent_in_eb() and find_extent_in_eb()
only if this new argument is true, instead of 'ignore_extent_item_pos'
being true.

A test case for fstests will be added soon, to provide cover not only
for these cases but to the logical to ino ioctl in general as well, as
currently we do not have a test case for it.

Reported-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Link: https://lore.kernel.org/linux-btrfs/CAHhfkvwo=nmzrJSqZ2qMfF-rZB-ab6ahHnCD_sq9h4o8v+M7QQ@mail.gmail.com/
Fixes: 6ce6ba5344 ("btrfs: use a single argument for extent offset in backref walking functions")
CC: stable@vger.kernel.org # 6.2+
Tested-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09 22:09:11 +02:00
Filipe Manana
0004ff15ea btrfs: fix space cache inconsistency after error loading it from disk
When loading a free space cache from disk, at __load_free_space_cache(),
if we fail to insert a bitmap entry, we still increment the number of
total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
since we failed to add the bitmap entry. On error we then empty the
cache by calling __btrfs_remove_free_space_cache(), which will result
in getting the total bitmaps counter set to 1.

A failure to load a free space cache is not critical, so if a failure
happens we just rebuild the cache by scanning the extent tree, which
happens at block-group.c:caching_thread(). Yet the failure will result
in having the total bitmaps of the btrfs_free_space_ctl always bigger
by 1 then the number of bitmap entries we have. So fix this by having
the total bitmaps counter be incremented only if we successfully added
the bitmap entry.

Fixes: a67509c300 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
Reviewed-by: Anand Jain <anand.jain@oracle.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09 22:08:05 +02:00
Anastasia Belova
c87f318e6f btrfs: print-tree: parent bytenr must be aligned to sector size
Check nodesize to sectorsize in alignment check in print_extent_item.
The comment states that and this is correct, similar check is done
elsewhere in the functions.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: ea57788eb7 ("btrfs: require only sector size alignment for parent eb bytenr")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anastasia Belova <abelova@astralinux.ru>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-09 22:07:40 +02:00
Linus Torvalds
16a8829130 nfs: fix another case of NULL/IS_ERR confusion wrt folio pointers
Dan has been improving on the smatch error pointer checks, and pointed
at another case where the __filemap_get_folio() conversion to error
pointers had been overlooked.  This time because it was hidden behind
the filemap_grab_folio() helper function that is a wrapper around it.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-09 10:22:13 -07:00
Linus Torvalds
1dc3731daf for-6.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRaYhYACgkQxWXV+ddt
 WDvCRQ/+MjRuInALh+N34mMneF8jjPOlUQBZbaC43XYJ0ss9drSzvE3STmPVrjdK
 IHyzRYipKI6vdTtzYbyGwxJ9oazsuXTQXC3w/qMW1hO1EAQ0a9tbnTSIQ+BDbU63
 BW7rJ3JuM6hKxKK+e9Dserhks0lOgQc+xKT1CUELvAHp3UykD4OrNczguaIT2lGR
 YXL+9B3ex2SooCqrQStkqEtjD/kxbaYUkK7yWA2FssXWqU5SjZwUOsuY3ZPOWrm1
 ULNI67gIxkMkSynV3aYka7nY3xc9oGIfk9WPeylWcOcH3+pWabeptjk617XbA0KI
 4biz1zZ/qTRXWlCLDv3ukUa5EIVAWQ1kxVE/hAt3SzqJvoqB/ymML/2LeQNdyx2i
 adMTZQ95JkhQNU9Lp9QOtpgfZonhhjxnL9KE7eMVo28zJFdYjge3egINjimY+mLz
 qzrzUBI3bqCNYG0LRR1EvuN0feBd/9nNMFjLBi2mkDqsWtzvTxxzWvVlV5EEcoJe
 xrozGh00Y5ioP6ZanKuZRib+u2ligbD66dYhKSU74D6B5kuZPic3Kkn9qICjRByM
 uBGBze/7GT/3ouhPOwxVPtGZstiFhbAxE7mApROrIxAx8I9rZjBdHgFJQklolXNy
 HSKNf3u98XZBVVcku/O1hyoeTLnApPfApxD4lv3qlRmgdEnAp6I=
 =K25T
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix backward leaf iteration which could possibly return the same key

 - fix assertion when device add and balance race for exclusive
   operation

 - fix regression when freeing device, state tree would leak after
   device replace

 - fix attempt to clear space cache v1 when block-group-tree is enabled

 - fix potential i_size corruption when encoded write races with send v2
   and enabled no-holes (the race is hard to hit though, the window is a
   few instructions wide)

 - fix wrong bitmap API use when checking empty zones, parameters were
   swapped but not causing a bug due to other code

 - prevent potential qgroup leak if subvolume create does not commit
   transaction (which is pending in the development queue)

 - error handling and reporting:
     - abort transaction when sibling keys check fails for leaves
     - print extent buffers when sibling keys check fails

* tag 'for-6.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't free qgroup space unless specified
  btrfs: fix encoded write i_size corruption with no-holes
  btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones
  btrfs: properly reject clear_cache and v1 cache for block-group-tree
  btrfs: print extent buffers when sibling keys check fails
  btrfs: abort transaction when sibling keys check fails for leaves
  btrfs: fix leak of source device allocation state after device replace
  btrfs: fix assertion of exclop condition when starting balance
  btrfs: fix btrfs_prev_leaf() to not return the same key twice
2023-05-09 09:53:41 -07:00
Steve French
716a3cf317 smb3: fix problem remounting a share after shutdown
xfstests generic/392 showed a problem where even after a
shutdown call was made on a mount, we would still attempt
to use the (now inaccessible) superblock if another mount
was attempted for the same share.

Reported-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Fixes: 087f757b01 ("cifs: add shutdown support")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-09 09:54:58 -05:00
Steve French
2cb6f96877 SMB3: force unmount was failing to close deferred close files
In investigating a failure with xfstest generic/392 it
was noticed that mounts were reusing a superblock that should
already have been freed. This turned out to be related to
deferred close files keeping a reference count until the
closetimeo expired.

Currently the only way an fs knows that mount is beginning is
when force unmount is called, but when this, ie umount_begin(),
is called all deferred close files on the share (tree
connection) should be closed immediately (unless shared by
another mount) to avoid using excess resources on the server
and to avoid reusing a superblock which should already be freed.

In umount_begin, close all deferred close handles for that
share if this is the last mount using that share on this
client (ie send the SMB3 close request over the wire for those
that have been already closed by the app but that we have
kept a handle lease open for and have not sent closes to the
server for yet).

Reported-by: David Howells <dhowells@redhat.com>
Acked-by: Bharath SM <bharathsm@microsoft.com>
Cc: <stable@vger.kernel.org>
Fixes: 78c09634f7 ("Cifs: Fix kernel oops caused by deferred close for files.")
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-09 09:53:59 -05:00
Steve French
ba8c2b75b0 smb3: improve parallel reads of large files
rasize (ra_pages) should be set higher than read size by default
to allow parallel reads when reading large files in order to
improve performance (otherwise there is much dead time on the
network when doing readahead of large files).  Default rasize
to twice readsize.

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-09 00:59:48 -05:00
Christophe JAILLET
08c3eab525 f2fs: remove some dead code
'ret' is known to be 0 at the point.
So these lines of code should just be removed.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-08 11:18:04 -07:00
Li Zetao
1223e432d9 f2fs: remove redundant goto statement in f2fs_read_single_page()
After the commit "0a4ee518185", this "goto" statement was redundant,
remote it for clean code.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-08 11:18:04 -07:00
Yangtao Li
7cd2e5f75b f2fs: do not allow to defragment files have FI_COMPRESS_RELEASED
If a file has FI_COMPRESS_RELEASED, all writes for it should not be
allowed.

Fixes: 5fdb322ff2 ("f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE")
Signed-off-by: Qi Han <hanqi@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-08 11:18:04 -07:00
Yangtao Li
888ca6edac f2fs: add sanity check for proc_mkdir
Return -ENOMEM when proc_mkdir failed.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-08 11:18:04 -07:00
Chao Yu
b62e71be21 f2fs: support errors=remount-ro|continue|panic mountoption
This patch supports errors=remount-ro|continue|panic mount option
for f2fs.

f2fs behaves as below in three different modes:
mode			continue	remount-ro	panic
access ops		normal		noraml		N/A
syscall errors		-EIO		-EROFS		N/A
mount option		rw		ro		N/A
pending dir write	keep		keep		N/A
pending non-dir write	drop		keep		N/A
pending node write	drop		keep		N/A
pending meta write	keep		keep		N/A

By default it uses "continue" mode.

[Yangtao helps to clean up function's name]
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2023-05-08 11:18:04 -07:00
Steve French
cbd4cbabef do not reuse connection if share marked as isolated
"SHAREFLAG_ISOLATED_TRANSPORT" indicates that we should not reuse the socket
for this share (for future mounts).  Mark the socket as server->nosharesock if
share flags returned include SHAREFLAG_ISOLATED_TRANSPORT.

See MS-SMB2 MS-SMB2 2.2.10 and 3.2.5.5

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-08 12:00:47 -05:00
Pawel Witek
d66cde50c3 cifs: fix pcchunk length type in smb2_copychunk_range
Change type of pcchunk->Length from u32 to u64 to match
smb2_copychunk_range arguments type. Fixes the problem where performing
server-side copy with CIFS_IOC_COPYCHUNK_FILE ioctl resulted in incomplete
copy of large files while returning -EINVAL.

Fixes: 9bf0c9cd43 ("CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files")
Cc: <stable@vger.kernel.org>
Signed-off-by: Pawel Witek <pawel.ireneusz.witek@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-08 12:00:47 -05:00
Jan Kara
949f95ff39 ext4: fix lockdep warning when enabling MMP
When we enable MMP in ext4_multi_mount_protect() during mount or
remount, we end up calling sb_start_write() from write_mmp_block(). This
triggers lockdep warning because freeze protection ranks above s_umount
semaphore we are holding during mount / remount. The problem is harmless
because we are guaranteed the filesystem is not frozen during mount /
remount but still let's fix the warning by not grabbing freeze
protection from ext4_multi_mount_protect().

Cc: stable@kernel.org
Reported-by: syzbot+6b7df7d5506b32467149@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=ab7e5b6f400b7778d46f01841422e5718fb81843
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230411121019.21940-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-07 21:11:18 -04:00
Ye Bin
fa08a7b61d ext4: fix WARNING in mb_find_extent
Syzbot found the following issue:

EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support!
EXT4-fs (loop0): orphan cleanup on readonly fs
------------[ cut here ]------------
WARNING: CPU: 1 PID: 5067 at fs/ext4/mballoc.c:1869 mb_find_extent+0x8a1/0xe30
Modules linked in:
CPU: 1 PID: 5067 Comm: syz-executor307 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
RIP: 0010:mb_find_extent+0x8a1/0xe30 fs/ext4/mballoc.c:1869
RSP: 0018:ffffc90003c9e098 EFLAGS: 00010293
RAX: ffffffff82405731 RBX: 0000000000000041 RCX: ffff8880783457c0
RDX: 0000000000000000 RSI: 0000000000000041 RDI: 0000000000000040
RBP: 0000000000000040 R08: ffffffff82405723 R09: ffffed10053c9402
R10: ffffed10053c9402 R11: 1ffff110053c9401 R12: 0000000000000000
R13: ffffc90003c9e538 R14: dffffc0000000000 R15: ffffc90003c9e2cc
FS:  0000555556665300(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056312f6796f8 CR3: 0000000022437000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ext4_mb_complex_scan_group+0x353/0x1100 fs/ext4/mballoc.c:2307
 ext4_mb_regular_allocator+0x1533/0x3860 fs/ext4/mballoc.c:2735
 ext4_mb_new_blocks+0xddf/0x3db0 fs/ext4/mballoc.c:5605
 ext4_ext_map_blocks+0x1868/0x6880 fs/ext4/extents.c:4286
 ext4_map_blocks+0xa49/0x1cc0 fs/ext4/inode.c:651
 ext4_getblk+0x1b9/0x770 fs/ext4/inode.c:864
 ext4_bread+0x2a/0x170 fs/ext4/inode.c:920
 ext4_quota_write+0x225/0x570 fs/ext4/super.c:7105
 write_blk fs/quota/quota_tree.c:64 [inline]
 get_free_dqblk+0x34a/0x6d0 fs/quota/quota_tree.c:130
 do_insert_tree+0x26b/0x1aa0 fs/quota/quota_tree.c:340
 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375
 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375
 do_insert_tree+0x722/0x1aa0 fs/quota/quota_tree.c:375
 dq_insert_tree fs/quota/quota_tree.c:401 [inline]
 qtree_write_dquot+0x3b6/0x530 fs/quota/quota_tree.c:420
 v2_write_dquot+0x11b/0x190 fs/quota/quota_v2.c:358
 dquot_acquire+0x348/0x670 fs/quota/dquot.c:444
 ext4_acquire_dquot+0x2dc/0x400 fs/ext4/super.c:6740
 dqget+0x999/0xdc0 fs/quota/dquot.c:914
 __dquot_initialize+0x3d0/0xcf0 fs/quota/dquot.c:1492
 ext4_process_orphan+0x57/0x2d0 fs/ext4/orphan.c:329
 ext4_orphan_cleanup+0xb60/0x1340 fs/ext4/orphan.c:474
 __ext4_fill_super fs/ext4/super.c:5516 [inline]
 ext4_fill_super+0x81cd/0x8700 fs/ext4/super.c:5644
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Add some debug information:
mb_find_extent: mb_find_extent block=41, order=0 needed=64 next=0 ex=0/41/1@3735929054 64 64 7
block_bitmap: ff 3f 0c 00 fc 01 00 00 d2 3d 00 00 00 00 00 00 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Acctually, blocks per group is 64, but block bitmap indicate at least has
128 blocks. Now, ext4_validate_block_bitmap() didn't check invalid block's
bitmap if set.
To resolve above issue, add check like fsck "Padding at end of block bitmap is
not set".

Cc: stable@kernel.org
Reported-by: syzbot+68223fe9f6c95ad43bed@syzkaller.appspotmail.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116020015.1506120-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-05-07 21:03:24 -04:00
Linus Torvalds
63342b1dd5 9 smb3 client fixes, mostly DFS or reconnect related
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRWpDAACgkQiiy9cAdy
 T1Gm1gv+IZhwnTNWJ5OLaFKVjKsM5flAaj4To15hFxHDUgfkTlG+c0T2jbrjQZq5
 OEdnXn/zgVrq/JTU6TIrMA6netmkNKEvEW/MPb5sKgbXdkM8dM8AkCYjs2WvaHLV
 IBrmo9URUr0/I0rBMCvEFHGQvvggAizKrkvqVZq1DmQ+8Y7WXMWBuINux6i2+NmJ
 uDgBx/FcEIqpc0i5STD7hQ3kZ/MnBOVcd4lxUAZk6GtpiKZe/u6veWpyt9FqD8N7
 AXQhhfoLXEPNda1N5P1tfCIsWO+70XzgF5eetr95/71f3MIWPNbdOyJ9MFyr8DzV
 dUA8uReV8RVF7uCEC6AnXmOlteZ855trzpk8EK54ey5fmcnelzMLsIKNbzxgcJZP
 y4EDNbP57b+G28SnWTLsdXfrBxicO7ExBPOiys2hfJ3VCC5xSkN04IDlRnsgZKiE
 Tbwo/eaZy31Z6EN5BiqjFI/RgNZJaEdl9GeGZJFum3AGB+wKDSWl5Es77i11PBuI
 5qZgahwT
 =ZsoO
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "smb3 client fixes, mostly DFS or reconnect related:

   - Two DFS connection sharing fixes

   - DFS refresh fix

   - Reconnect fix

   - Two potential use after free fixes

   - Also print prefix patch in mount debug msg

   - Two small cleanup fixes"

* tag '6.4-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Remove unneeded semicolon
  cifs: fix sharing of DFS connections
  cifs: avoid potential races when handling multiple dfs tcons
  cifs: protect access of TCP_Server_Info::{origin,leaf}_fullpath
  cifs: fix potential race when tree connecting ipc
  cifs: fix potential use-after-free bugs in TCP_Server_Info::hostname
  cifs: print smb3_fs_context::source when mounting
  cifs: protect session status check in smb2_reconnect()
  SMB3.1.1: correct definition for app_instance_id create contexts
2023-05-07 10:46:21 -07:00
Linus Torvalds
706ce3caea Five hotfixes. Three are cc:stable, two pertain to post-6.3 merge window
changes.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFaSnAAKCRDdBJ7gKXxA
 jskTAPwIkL6xVMIEdtWfQNIy751o0onzfOQAIWgM3dSL83cMJQEA5OXxXt4LjrJw
 4q0zf4GLYXjMjhnfYr8zXF39y05iyQA=
 =nIhV
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Five hotfixes.

  Three are cc:stable, two pertain to merge window changes"

* tag 'mm-hotfixes-stable-2023-05-06-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  afs: fix the afs_dir_get_folio return value
  nilfs2: do not write dirty data after degenerating to read-only
  mm: do not reclaim private data from pinned page
  nilfs2: fix infinite loop in nilfs_mdt_get_block()
  mm/mmap/vma_merge: always check invariants
2023-05-06 11:25:03 -07:00
Linus Torvalds
994e2419f1 nfs: fix mis-merged __filemap_get_folio() error check
Fix another case of an incorrect check for the returned 'folio' value
from __filemap_get_folio().

The failure case used to return NULL, but was changed by commit
66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio").

But in the meantime, commit ec108d3cc7 ("NFS: Convert readdir page
array functions to use a folio") added a new user of that function.

And my merge of the two did not fix this up correctly.

The ext4 merge had the same issue, but that one had been caught in
linux-next and got properly fixed while merging.

Fixes: 0127f25b5d ("Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs")
Cc: Anna Schumaker <anna@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-06 10:16:19 -07:00
Christoph Hellwig
58f5f6698a afs: fix the afs_dir_get_folio return value
Keep returning NULL on failure instead of letting an ERR_PTR escape to
callers that don't expect it.

Link: https://lkml.kernel.org/r/20230503154526.1223095-2-hch@lst.de
Fixes: 66dabbb65d ("mm: return an ERR_PTR from __filemap_get_folio")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Cc: Marc Dionne <marc.dionne@auristor.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06 10:10:08 -07:00
Ryusuke Konishi
28a65b49eb nilfs2: do not write dirty data after degenerating to read-only
According to syzbot's report, mark_buffer_dirty() called from
nilfs_segctor_do_construct() outputs a warning with some patterns after
nilfs2 detects metadata corruption and degrades to read-only mode.

After such read-only degeneration, page cache data may be cleared through
nilfs_clear_dirty_page() which may also clear the uptodate flag for their
buffer heads.  However, even after the degeneration, log writes are still
performed by unmount processing etc., which causes mark_buffer_dirty() to
be called for buffer heads without the "uptodate" flag and causes the
warning.

Since any writes should not be done to a read-only file system in the
first place, this fixes the warning in mark_buffer_dirty() by letting
nilfs_segctor_do_construct() abort early if in read-only mode.

This also changes the retry check of nilfs_segctor_write_out() to avoid
unnecessary log write retries if it detects -EROFS that
nilfs_segctor_do_construct() returned.

Link: https://lkml.kernel.org/r/20230427011526.13457-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+2af3bc9585be7f23f290@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=2af3bc9585be7f23f290
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06 10:10:07 -07:00
Ryusuke Konishi
a6a491c048 nilfs2: fix infinite loop in nilfs_mdt_get_block()
If the disk image that nilfs2 mounts is corrupted and a virtual block
address obtained by block lookup for a metadata file is invalid,
nilfs_bmap_lookup_at_level() may return the same internal return code as
-ENOENT, meaning the block does not exist in the metadata file.

This duplication of return codes confuses nilfs_mdt_get_block(), causing
it to read and create a metadata block indefinitely.

In particular, if this happens to the inode metadata file, ifile,
semaphore i_rwsem can be left held, causing task hangs in lock_mount.

Fix this issue by making nilfs_bmap_lookup_at_level() treat virtual block
address translation failures with -ENOENT as metadata corruption instead
of returning the error code.

Link: https://lkml.kernel.org/r/20230430193046.6769-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+221d75710bde87fa0e97@syzkaller.appspotmail.com
  Link: https://syzkaller.appspot.com/bug?extid=221d75710bde87fa0e97
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-06 10:10:07 -07:00
Linus Torvalds
a3b111b046 for-6.4/block-2023-05-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWLQYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnwhD/4xRYfAY4O0oGZCITiKEEFxiSfDHCPMEBji
 zTEtideDRjcrpFmjmZ411C9prW32MxYQ3wQTf4O7w4t906xTYVr9FQy8g3Et4izI
 zglPcsa2jPzYCadQ0Ye4n9dRuYOH9FDJzDDLC+smu8zQKKmAqEAN/1ftpnADdVrY
 qX1sHyfz4RQRAgTHg2WqgKOi2O9VwGSMwOfBocmzAnruv9oLUlypcGnFPRCSZH/a
 OKpUZlvQhCBZTKScvVxQeJMg2Tl5yokQ0TH+gkQsdav9XcPJktqXiuD+c4h6q5ux
 oTlysEqcrwcaAafOcV0w9u80SFNlFACUYsNmEnFJaPXFTqAdNHvo1DJNsmxiHJDU
 bGo5ktlo5b/VZ51niOoWvGxursavq16G4yIYlGGHc7f4wGs12oc5ZP/yM3GRUY+C
 PdezwEvvQufxP7sFokfpgAS4SuH+tBlrhFXMsYaI4NukZQW4TK1zzbMrzOkdxFhW
 BOx17VFUKWtUnRmxinFGIA8Vj+FXN+E+ND+FoDsbrMyJD4maKDdJapPchG0J0Vbs
 pDcsB4c0pBC6H2xrobKiA1CuSq2t2qvyvwe1Zl2Xd+RVW9vBB5SI6HXYrC+UtxwY
 7LfX8F13cFD1E6iJ9Nta6x8fOunGnOVBdW5O0k4hDWEuZduvHItEDn2c3Ehqp4Jw
 P8dFBbk8SQ==
 =gAYf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - MD pull request via Song:
      - Improve raid5 sequential IO performance on spinning disks, which
        fixes a regression since v6.0 (Jan Kara)
      - Fix bitmap offset types, which fixes an issue introduced in this
        merge window (Jonathan Derrick)

 - Cleanup of hweight type used for cgroup writeback (Maxim)

 - Fix a regression with the "has_submit_bio" changes across partitions
   (Ming)

 - Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.

   We used to set this flag on queues non blk-mq queues, and hence some
   drivers clear it unconditionally. Since all of these have since been
   converted to true blk-mq drivers, drop the useless clear as the bit
   is not set (Chaitanya)

 - Fix the flags being set in a bio for a flush for drbd (Christoph)

 - Cleanup and deduplication of the code handling setting block device
   capacity (Damien)

 - Fix for ublk handling IO timeouts (Ming)

 - Fix for a regression in blk-cgroup teardown (Tao)

 - NBD documentation and code fixes (Eric)

 - Convert blk-integrity to using device_attributes rather than a second
   kobject to manage lifetimes (Thomas)

* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
  ublk: add timeout handler
  drbd: correctly submit flush bio on barrier
  mailmap: add mailmap entries for Jens Axboe
  block: Skip destroyed blkg when restart in blkg_destroy_all()
  writeback: fix call of incorrect macro
  md: Fix bitmap offset type in sb writer
  md/raid5: Improve performance for sequential IO
  docs nbd: userspace NBD now favors github over sourceforge
  block nbd: use req.cookie instead of req.handle
  uapi nbd: add cookie alias to handle
  uapi nbd: improve doc links to userspace spec
  blk-integrity: register sysfs attributes on struct device
  blk-integrity: convert to struct device_attribute
  blk-integrity: use sysfs_emit
  block/drivers: remove dead clear of random flag
  block: sync part's ->bd_has_submit_bio with disk's
  block: Cleanup set_capacity()/bdev_set_nr_sectors()
2023-05-06 08:28:58 -07:00
Linus Torvalds
7644c82319 pipe-nonblock-2023-05-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWK50QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkP8D/9PKQy4FCTRKRYABomlkqc9UlrsEzkkpXy+
 t2kCZ7OmgZ1go/3gRv8AVx0xC/S8zgs8vCwriUXVQH2Q4D2FnV9/Cy5xXzQJ2jE5
 sWdj5+Sks2Zerq7hRfItxFxP70Jlvn4r2Ud6n9uoBKI+DirpGQB5xwqCF/Z9I9Ev
 cAOtq3cvdKqiqAmrA0WAjjpLpvuC/OOEXyQf6n+viQ7iJZgCHDWplpa4jQCamrlJ
 wDz5eNkptDc3KnZ1QGtfMlCdmzwjFEyi20Eq8VLo6OBkhoP6FymqWC6YkeigibI4
 Da+fyx/wJvqnsuXPLCFgla5MXMRHLb9uoRRFbon6XqXABPZ8v4v8kZhswVLUxBzz
 UjBI8cmKQ8rkHu+4o/522lCRQCMU5h7LxGgCguePi9+Hap2AuBSmpOzpgDqnNDmN
 ARD6gW4uOkXRU6CgvppOio3hTn3BU9bUs8iN8ym3XW0eQo6nUElRT2mwC1XEJDYn
 HAM1MqoJWU5rWp1yyAdiw8rLBOmMLMyNaowHlWEjEGiykU2FlfNDp/HJeq/gxjEK
 NxBC5+qlhbDp3hTqDJVaRa+i8MchrTJLXfJuXzP3el3c1dp2vU5gLO49WYAA6G5d
 oycxTm2SyX6dUwWNeH06jgocVt99fC7klTuEq59Jm9ykFSfbj2kfv24U+jDi/KvX
 DW++jNDl+g==
 =uryD
 -----END PGP SIGNATURE-----

Merge tag 'pipe-nonblock-2023-05-06' of git://git.kernel.dk/linux

Pull nonblocking pipe io_uring support from Jens Axboe:
 "Here's the revised edition of the FMODE_NOWAIT support for pipes, in
  which we just flag it as such supporting FMODE_NOWAIT unconditionally,
  but clear it if we ever end up using splice/vmsplice on the pipe.

  The pipe read/write side is perfectly fine for nonblocking IO, however
  splice and vmsplice can potentially wait for IO with the pipe lock
  held"

* tag 'pipe-nonblock-2023-05-06' of git://git.kernel.dk/linux:
  pipe: set FMODE_NOWAIT on pipes
  splice: clear FMODE_NOWAIT on file if splice/vmsplice is used
2023-05-06 08:15:20 -07:00
Linus Torvalds
2e1e133788 ten ksmbd server fixes, including some important security fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRUx3IACgkQiiy9cAdy
 T1G8MAv/TY6Oprfh58t4I4bWQKW49kOzFsPbNk9454CnjEnlV1jHirAKolTEN3ut
 11ZvzAcUUPa23EBPOKaeroUI718AcAp91ByH58ERZtwHFz0ml0/ePL4+ihGiCUn3
 0rTFSuZy1rW8Ld5sM9PMuDD3bzqAnmGZlOXAPg9acrbNyUwGguv4OF3MEjvYXpwF
 vzcixvnG0HsIgCyM0gl4xK6vECx8PoDbUZDs44VWAPf5676PNwntJcbYq/GVCCGq
 EiPDm1ke40nmSjxSDLQcpfWj3d6dYrmm6PVpa3wUSdI+MeuqSm0YR9DptCo6cawE
 WqE6ZrUVEJyUlH6mFu6O4VZhwQ8jO4Jsasr3tay9crD8QSCLThLFkH4/Xg1u5JJw
 miMaGk43hKHFStA00rObmLLFtk+HrPdKv5HIbByrm3m5x58hG+2PtKnsD8OI8Ny3
 p7/lS4H/HHE9BZ44//rKtsRsAe7OAWHfYjiS/pPIhjepxr+2ra1RaI2ciKLNwIqd
 tbPTVemh
 =sgnv
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc-ksmbd-server-fixes-part2' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:
 "Ten ksmbd server fixes, including some important security fixes:

   - Two use after free fixes

   - Fix RCU callback race

   - Deadlock fix

   - Three patches to prevent session setup attacks

   - Prevent guest users from establishing multichannel sessions

   - Fix null pointer dereference in query FS info

   - Memleak fix"

* tag '6.4-rc-ksmbd-server-fixes-part2' of git://git.samba.org/ksmbd:
  ksmbd: call rcu_barrier() in ksmbd_server_exit()
  ksmbd: fix racy issue under cocurrent smb2 tree disconnect
  ksmbd: fix racy issue from smb2 close and logoff with multichannel
  ksmbd: not allow guest user on multichannel
  ksmbd: fix deadlock in ksmbd_find_crypto_ctx()
  ksmbd: block asynchronous requests when making a delay on session setup
  ksmbd: destroy expired sessions
  ksmbd: fix racy issue from session setup and logoff
  ksmbd: fix NULL pointer dereference in smb2_get_info_filesystem()
  ksmbd: fix memleak in session setup
2023-05-05 19:16:58 -07:00
Linus Torvalds
ed23734c23 Including fixes from netfilter.
Current release - regressions:
 
  - sched: act_pedit: free pedit keys on bail from offset check
 
 Current release - new code bugs:
 
  - pds_core:
   - Kconfig fixes (DEBUGFS and AUXILIARY_BUS)
   - fix mutex double unlock in error path
 
 Previous releases - regressions:
 
  - sched: cls_api: remove block_cb from driver_list before freeing
 
  - nf_tables: fix ct untracked match breakage
 
  - eth: mtk_eth_soc: drop generic vlan rx offload
 
  - sched: flower: fix error handler on replace
 
 Previous releases - always broken:
 
  - tcp: fix skb_copy_ubufs() vs BIG TCP
 
  - ipv6: fix skb hash for some RST packets
 
  - af_packet: don't send zero-byte data in packet_sendmsg_spkt()
 
  - rxrpc: timeout handling fixes after moving client call connection
    to the I/O thread
 
  - ixgbe: fix panic during XDP_TX with > 64 CPUs
 
  - igc: RMW the SRRCTL register to prevent losing timestamp config
 
  - dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621
 
  - r8152:
    - fix flow control issue of RTL8156A
    - fix the poor throughput for 2.5G devices
    - move setting r8153b_rx_agg_chg_indicate() to fix coalescing
    - enable autosuspend
 
  - ncsi: clear Tx enable mode when handling a Config required AEN
 
  - octeontx2-pf: macsec: fixes for CN10KB ASIC rev
 
 Misc:
 
  - 9p: remove INET dependency
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmRVeUIACgkQMUZtbf5S
 IrtTug/9Hhg/L0PTSwrfuGh4W1/cjheMWppNLkwyWQUiKG7FcZQ9vu9PxceE3VRu
 2fTqHyvgDMZ8jACovXObeda8z1+g3s/tIPaXELephBIjVlF/h3kG2OaIzlU4jDb4
 A4vklwf8eLbfyVBG22QgKl/I70zVMtnmnOo6c6CPuIOTcMPzslndFO9tB0nCg99F
 DCgCM1BBP1tz+OUch2rLnSzYcqkWqS49BhRk6dhYSliawUFU/5+1tDGDjwWolkfm
 0jqP9DjBOSpZKO8m7SpsUNz7NFRIfYErWZ+YebWbggNxj/6TRJTP83MM0tGoK1rE
 /mz2xpuOki59frlwVOAD6gb/qefjHUp21P4NA7bnhizxFlQL5MHpCeGQ9yLHBSmY
 9Q4ArJkM4jXQ0oDA2nII/pz+cDZGEWFGQ14WW3kYUb7WFmISH4I9OiA9i0TBW6OL
 r1Y/rqzkUvtKWzh9RpiAF9lsdHAm3SX9ES5RfMxzv0x886VOZR4jaMmokRDdPRzq
 0r2Oyj75b62+X0r44Fe22Pl/kPS/uh3642xo9h85aAv/EvhT9JNzMvomJm9d6tkb
 966I085AVbwxPAy+rl5SWyAq60EWDExNTjZvPv0mSMlmSsQ9iK5//xOF2Saw2zai
 /44zQ27tVGkCC44Ou5KmfJN3u4OrKkhcuyxtcDr9QeoOdKZRkMg=
 =9xND
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - sched: act_pedit: free pedit keys on bail from offset check

  Current release - new code bugs:

   - pds_core:
      - Kconfig fixes (DEBUGFS and AUXILIARY_BUS)
      - fix mutex double unlock in error path

  Previous releases - regressions:

   - sched: cls_api: remove block_cb from driver_list before freeing

   - nf_tables: fix ct untracked match breakage

   - eth: mtk_eth_soc: drop generic vlan rx offload

   - sched: flower: fix error handler on replace

  Previous releases - always broken:

   - tcp: fix skb_copy_ubufs() vs BIG TCP

   - ipv6: fix skb hash for some RST packets

   - af_packet: don't send zero-byte data in packet_sendmsg_spkt()

   - rxrpc: timeout handling fixes after moving client call connection
     to the I/O thread

   - ixgbe: fix panic during XDP_TX with > 64 CPUs

   - igc: RMW the SRRCTL register to prevent losing timestamp config

   - dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621

   - r8152:
      - fix flow control issue of RTL8156A
      - fix the poor throughput for 2.5G devices
      - move setting r8153b_rx_agg_chg_indicate() to fix coalescing
      - enable autosuspend

   - ncsi: clear Tx enable mode when handling a Config required AEN

   - octeontx2-pf: macsec: fixes for CN10KB ASIC rev

  Misc:

   - 9p: remove INET dependency"

* tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
  net: bcmgenet: Remove phy_stop() from bcmgenet_netif_stop()
  pds_core: fix mutex double unlock in error path
  net/sched: flower: fix error handler on replace
  Revert "net/sched: flower: Fix wrong handle assignment during filter change"
  net/sched: flower: fix filter idr initialization
  net: fec: correct the counting of XDP sent frames
  bonding: add xdp_features support
  net: enetc: check the index of the SFI rather than the handle
  sfc: Add back mailing list
  virtio_net: suppress cpu stall when free_unused_bufs
  ice: block LAN in case of VF to VF offload
  net: dsa: mt7530: fix network connectivity with multiple CPU ports
  net: dsa: mt7530: fix corrupt frames using trgmii on 40 MHz XTAL MT7621
  9p: Remove INET dependency
  netfilter: nf_tables: fix ct untracked match breakage
  af_packet: Don't send zero-byte data in packet_sendmsg_spkt().
  igc: read before write to SRRCTL register
  pds_core: add AUXILIARY_BUS and NET_DEVLINK to Kconfig
  pds_core: remove CONFIG_DEBUG_FS from makefile
  ionic: catch failure from devlink_alloc
  ...
2023-05-05 19:12:01 -07:00
Jakub Kicinski
644bca1d48 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
There's a fix which landed in net-next, pull it in along
with the couple of minor cleanups.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-05 13:35:45 -07:00
Yang Li
9ee04875ae cifs: Remove unneeded semicolon
./fs/cifs/smb2pdu.c:4140:2-3: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4863
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-05 04:29:58 -05:00
Paulo Alcantara
8e3554150d cifs: fix sharing of DFS connections
When matching DFS connections, we can't rely on the values set in
cifs_sb_info::prepath and cifs_tcon::tree_name as they might change
during DFS failover.  The DFS referrals related to a specific DFS tcon
are already matched earlier in match_server(), therefore we can safely
skip those checks altogether as the connection is guaranteed to be
unique for the DFS tcon.

Besides, when creating or finding an SMB session, make sure to also
refcount any DFS root session related to it (cifs_ses::dfs_root_ses),
so if a new DFS mount ends up reusing the connection from the old
mount while there was an umount(2) still in progress (e.g. umount(2)
-> cifs_umount() -> reconnect -> cifs_put_tcon()), the connection
could potentially be put right after the umount(2) finished.

Patch has minor update to include fix for unused variable issue
noted by the kernel test robot

Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/oe-kbuild-all/202305041040.j7W2xQSy-lkp@intel.com/
Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-04 16:54:44 -05:00
Linus Torvalds
3c4aa44343 A few filesystem improvements, with a rather nasty use-after-free fix
from Xiubo intended for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmRT9yoTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi0dKB/9bx9W3GH7YnlJNkUq9eNAQjKlQ7B6s
 i1T/ulPW3kidNwlK5z2ke7bLVN6CzZotOCjAqXHvyWAPFC46NQW+i6jLGgi0PVac
 acCNuF5Vda5cuKN1xAumGsxCiAScCLvyr1v6BvkkRogrOMTnm5qOhUJn/ED2sjrt
 02OPSpQ1Mt586GRNiYz4PaaNr9NgTUVHE/BV2GHX1EuTxNhMaeIVnH8xxN6DU9lg
 LN5dMeCsYbM0hCOccu5zdnGtCy/OGn6oAmVWpiXFJ5WJFJV19QO5sg8BenzNO1gm
 Hj66nEbN9cvxZqLeBPvWe+GeNlK/LR3c6jQQgTeniVqm0qJmf29FYWYu
 =gK0m
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "A few filesystem improvements, with a rather nasty use-after-free fix
  from Xiubo intended for stable"

* tag 'ceph-for-6.4-rc1' of https://github.com/ceph/ceph-client:
  ceph: reorder fields in 'struct ceph_snapid_map'
  ceph: pass ino# instead of old_dentry if it's disconnected
  ceph: fix potential use-after-free bug when trimming caps
  ceph: implement writeback livelock avoidance using page tagging
  ceph: do not print the whole xattr value if it's too long
2023-05-04 14:48:02 -07:00
Linus Torvalds
8e15605be8 9p patches for 6.4 merge window
This pull request includes a number of patches that didn't quite make
 the cut last merge window while we addressed some outstanding issues
 and review comments.  It includes some new caching modes for those that
 only want readahead caches and reworks how we do writeback caching so we
 are not keeping extra references around which both causes performance problems
 and uses lots of additional resources on the server.
 
 It also includes a new flag to force disabling of xattrs which can also
 cause major performance issues, particularly if the underlying filesystem
 on the server doesn't support them.
 
 Finally it adds a couple of additional mount options to better support
 directio and enabling caches when the server doesn't support qid.version.
 
 There was one late-breaking bug report that has also been included as its
 own patch where I forgot to propagate an embarassing bit-logic fix to the
 various variations of open.  Since that was only added to for-next a week
 ago, if you would like to not include it, I can include it in the first
 round of fixes for -rc2.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmRS/LEACgkQiP/V+0pf
 /5i84RAAqFtTYPYxg0VMLvjix0BCCE7CYR+vr7UsoGFeuVRyFsKh7G6fhmRXUhpG
 2kf+1CK+xzQd9PEKiQnVmGhib9SCdeWqb9EUpCgLZEmrSci0qUenzjf3Jg5Qgwhx
 u6xu9tipzHeHMFBD6n0f+j2fZEsDAv5IzgL14F9YfJQueVsL+HUOifLncruWnUEn
 rJAf7omhE1x+FfWsNaB22AksvYUXtQfoG3MgCPWql/XKlo/6xeW/8/eprutIO06M
 LUp3ie4RbSRZiE63SfPhxFMJCZ7g+R1JSKe1J8i9bbbwzYsVO8dovwdMCw0tP7ZQ
 QjVq+Ng4x0Rn5Bmekj18ua7zRsbl96DaoqcnYWKUOHkRVJleTG9t31MeZQaRC+gh
 F3321XkqDBHMXPVVLVQ1Gb1Pxt8dk9iFwmTMxktTrM4n5Zwv8ldMDKQgOhDIv9WA
 dDTjZ7mYpk2f9atxA0oys5oebQlT3D0CM/3p6n/PsXXpk30tat99kgcKlpN8SrL+
 Fs0UkXTQuu0Vin8zaMarQ2TW2UGQVM8qRD2gTooRnOItdqItx17czaE2ALOdIuVs
 LCbDxOXNPP/YbzakNUxh5ldI3Z3eahbilVfGa8ILfvprKnH7MaMmSo0rkP4XHj1h
 JDUiN7JOCOW4dvfaXa4knSIjs62Y9oTS0MWrO9Ajq8bg4ku6U7c=
 =qXIe
 -----END PGP SIGNATURE-----

Merge tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p updates from Eric Van Hensbergen:
 "This includes a number of patches that didn't quite make the cut last
  merge window while we addressed some outstanding issues and review
  comments. It includes some new caching modes for those that only want
  readahead caches and reworks how we do writeback caching so we are not
  keeping extra references around which both causes performance problems
  and uses lots of additional resources on the server.

  It also includes a new flag to force disabling of xattrs which can
  also cause major performance issues, particularly if the underlying
  filesystem on the server doesn't support them.

  Finally it adds a couple of additional mount options to better support
  directio and enabling caches when the server doesn't support
  qid.version.

  There was one late-breaking bug report that has also been included as
  its own patch where I forgot to propagate an embarassing bit-logic fix
  to the various variations of open"

* tag '9p-6.4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: Fix bit operation logic error
  fs/9p: Rework cache modes and add new options to Documentation
  fs/9p: remove writeback fid and fix per-file modes
  fs/9p: Add new mount modes
  9p: Add additional debug flags and open modes
  fs/9p: allow disable of xattr support on mount
  fs/9p: Remove unnecessary superblock flags
  fs/9p: Consolidate file operations and add readahead and writeback
2023-05-04 14:37:53 -07:00
Jason Andryuk
d7385ba137 9p: Remove INET dependency
9pfs can run over assorted transports, so it doesn't have an INET
dependency.  Drop it and remove the includes of linux/inet.h.

NET_9P_FD/trans_fd.o builds without INET or UNIX and is usable over
plain file descriptors.  However, tcp and unix functionality is still
built and would generate runtime failures if used.  Add imply INET and
UNIX to NET_9P_FD, so functionality is enabled by default but can still
be explicitly disabled.

This allows configuring 9pfs over Xen with INET and UNIX disabled.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-04 21:46:57 +01:00
Linus Torvalds
15fb96a35d - Some DAMON cleanups from Kefeng Wang
- Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
   ioctl's behavior more similar to KSM's behavior.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZFLsxAAKCRDdBJ7gKXxA
 jl8yAQCqjstPsOULf9QN0z4bGAUhY+Wj4ERz1jbKSIuhFCJWiQEAgQvgRXObKjmi
 OtUB0Ek4CMDCQzbyIQ1Bhp3kxi6+Jgs=
 =AbyC
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more MM updates from Andrew Morton:

 - Some DAMON cleanups from Kefeng Wang

 - Some KSM work from David Hildenbrand, to make the PR_SET_MEMORY_MERGE
   ioctl's behavior more similar to KSM's behavior.

[ Andrew called these "final", but I suspect we'll have a series fixing
  up the fact that the last commit in the dmapools series in the
  previous pull seems to have unintentionally just reverted all the
  other commits in the same series..   - Linus ]

* tag 'mm-stable-2023-05-03-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: hwpoison: coredump: support recovery from dump_user_range()
  mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()
  mm/ksm: move disabling KSM from s390/gmap code to KSM code
  selftests/ksm: ksm_functional_tests: add prctl unmerge test
  mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0
  mm/damon/paddr: fix missing folio_sz update in damon_pa_young()
  mm/damon/paddr: minor refactor of damon_pa_mark_accessed_or_deactivate()
  mm/damon/paddr: minor refactor of damon_pa_pageout()
2023-05-04 13:09:43 -07:00
Paulo Alcantara
6be2ea33a4 cifs: avoid potential races when handling multiple dfs tcons
Now that a DFS tcon manages its own list of DFS referrals and
sessions, there is no point in having a single worker to refresh
referrals of all DFS tcons.  Make it faster and less prone to race
conditions when having several mounts by queueing a worker per DFS
tcon that will take care of refreshing only the DFS referrals related
to it.

Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:29:47 -05:00
Paulo Alcantara
3dc9c433c9 cifs: protect access of TCP_Server_Info::{origin,leaf}_fullpath
Protect access of TCP_Server_Info::{origin,leaf}_fullpath when
matching DFS connections, and get rid of
TCP_Server_Info::current_fullpath while we're at it.

Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:29:39 -05:00
Paulo Alcantara
ee20d7c610 cifs: fix potential race when tree connecting ipc
Protect access of TCP_Server_Info::hostname when building the ipc tree
name as it might get freed in cifsd thread and thus causing an
use-after-free bug in __tree_connect_dfs_target().  Also, while at it,
update status of IPC tcon on success and then avoid any extra tree
connects.

Cc: stable@vger.kernel.org # v6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:29:05 -05:00
Namjae Jeon
eb307d09fe ksmbd: call rcu_barrier() in ksmbd_server_exit()
racy issue is triggered the bug by racing between closing a connection
and rmmod. In ksmbd, rcu_barrier() is not called at module unload time,
so nothing prevents ksmbd from getting unloaded while it still has RCU
callbacks pending. It leads to trigger unintended execution of kernel
code locally and use to defeat protections such as Kernel Lockdown

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20477
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:02 -05:00
Namjae Jeon
30210947a3 ksmbd: fix racy issue under cocurrent smb2 tree disconnect
There is UAF issue under cocurrent smb2 tree disconnect.
This patch introduce TREE_CONN_EXPIRE flags for tcon to avoid cocurrent
access.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20592
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:02 -05:00
Namjae Jeon
abcc506a9a ksmbd: fix racy issue from smb2 close and logoff with multichannel
When smb client send concurrent smb2 close and logoff request
with multichannel connection, It can cause racy issue. logoff request
free tcon and can cause UAF issues in smb2 close. When receiving logoff
request with multichannel, ksmbd should wait until all remaning requests
complete as well as ones in the current connection, and then make
session expired.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20796 ZDI-CAN-20595
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:02 -05:00
Namjae Jeon
3353ab2df5 ksmbd: not allow guest user on multichannel
This patch return STATUS_NOT_SUPPORTED if binding session is guest.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20480
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:02 -05:00
Namjae Jeon
7b4323373d ksmbd: fix deadlock in ksmbd_find_crypto_ctx()
Deadlock is triggered by sending multiple concurrent session setup
requests. It should be reused after releasing when getting ctx for crypto.
Multiple consecutive ctx uses cause deadlock while waiting for releasing
due to the limited number of ctx.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20591
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Namjae Jeon
b096d97f47 ksmbd: block asynchronous requests when making a delay on session setup
ksmbd make a delay of 5 seconds on session setup to avoid dictionary
attacks. But the 5 seconds delay can be bypassed by using asynchronous
requests. This patch block all requests on current connection when
making a delay on sesstion setup failure.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20482
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Namjae Jeon
ea174a9189 ksmbd: destroy expired sessions
client can indefinitely send smb2 session setup requests with
the SessionId set to 0, thus indefinitely spawning new sessions,
and causing indefinite memory usage. This patch limit to the number
of sessions using expired timeout and session state.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20478
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Namjae Jeon
f5c779b7dd ksmbd: fix racy issue from session setup and logoff
This racy issue is triggered by sending concurrent session setup and
logoff requests. This patch does not set connection status as
KSMBD_SESS_GOOD if state is KSMBD_SESS_NEED_RECONNECT in session setup.
And relookup session to validate if session is deleted in logoff.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20481, ZDI-CAN-20590, ZDI-CAN-20596
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Namjae Jeon
3ac00a2ab6 ksmbd: fix NULL pointer dereference in smb2_get_info_filesystem()
If share is , share->path is NULL and it cause NULL pointer
dereference issue.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20479
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Namjae Jeon
6d7cb549c2 ksmbd: fix memleak in session setup
If client send session setup request with unknown NTLMSSP message type,
session that does not included channel can be created. It will cause
session memleak. because ksmbd_sessions_deregister() does not destroy
session if channel is not included. This patch return error response if
client send the request unknown NTLMSSP message type.

Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-20593
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-03 23:03:01 -05:00
Linus Torvalds
049a18f232 sysctl-6.4-rc1-v2
As mentioned on my first pull request for sysctl-next, for v6.4-rc1
 we're very close to being able to deprecating register_sysctl_paths().
 I was going to assess the situation after the first week of the merge
 window.
 
 That time is now and things are looking good. We only have one stragglers
 on the patch which had already an ACK for so I'm picking this up here now and
 the last patch is the one that uses an axe. Some careful eyeballing would
 be appreciated by others. If this doesn't get properly reviewed I can also
 just hold off on this in my tree for the next merge window. Either way is
 fine by me.
 
 I have boot tested the last patch and 0-day build completed successfully.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRSsn0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinzzMQAK6ddUwQM32z6E2SY/Ku6ZDQJhVKxE+Y
 +HvghMqaGzr2eawEaASZzV6p//Q1aH4c2yaChyENa/O82QBXhbc2RBvdiAzQeJZx
 cUQ4C6Lc+BlpoB24Nes69F9j1LAEI5YXKMK911DKDu7LNNS7Ytxt1IOfM2RpyqRV
 6+9vOvAqCSh9EEjZeZDrMlsYhBA+t3YIkU6JFMX7Upc2P7m//57inLsZyUZBqnou
 t9sfC0d1lDTZXZ0vSIk534VhoxXe1MkYERKkAciEprxbdNnqcsi4WMXKdXG6Mcpy
 O1ZuUXqndAfhTSHLkqNidtuDP29TTvcdz5tDfwmaJ3JUTt0cDvlC2T7J9WyXDfCZ
 XsR8Ik0/vEH/j9rVabF9fQ8DeTSLe9AgpaItHd6/LWI8UESs5k/wYi9O+7lhCf2p
 JZpXl3G1itKA9ABMD1GUEtC5hfWTUxkTEgPkXbqFuKtCl0mI8lD3FPFRbuhYNLa8
 7R/6SN9h6/43C9Ffp2bY3c/gKQj51QlvGOSctahvdYSFkG8KXKhEnsDu0V6eLM9G
 QYrhvht8o9jbuKJKtEno9fjTlClVvXp5vARQQyy9OHyTuhU/Y8q2lH2BCZtFYZrM
 cpIFdfqB18tZmo7QfNHZPLfws2j3MqsbXFG8Q23BB7cJp1P5QdLJjmjnqbMq8xmk
 3kdRMZ2Xkfcx
 =8VOt
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.4-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull more sysctl updates from Luis Chamberlain:
 "As mentioned on my first pull request for sysctl-next, for v6.4-rc1
  we're very close to being able to deprecating register_sysctl_paths().
  I was going to assess the situation after the first week of the merge
  window.

  That time is now and things are looking good. We only have one which
  had already an ACK for so I'm picking this up here now and the last
  patch is the one that uses an axe.

  I have boot tested the last patch and 0-day build completed
  successfully"

* tag 'sysctl-6.4-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
  sysctl: remove register_sysctl_paths()
  kernel: pid_namespace: simplify sysctls with register_sysctl()
2023-05-03 19:08:20 -07:00
Linus Torvalds
342528ff00 This pull request contains the following changes for UML:
- Make stub data pages configurable
 - Make it harder to mix user and kernel code by accident
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmRSslcWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wVbGEADWL/4rcVNqaD/b9V3bUXDkNhpG
 81Jc9fXcx3OXV87GsxfadiQI05UaOhTAMWJlsdKUF0/IW5tSQSZc0QcDNxTs6aS1
 NDkWvjk8s8OpbqdmV0cNKFmaeginrH4AuASajvNleU0BxX4ziqw6BPadih+wGVGz
 iVEd/M5YH2yet33pvD7Wzh3KZOPlo2OOKBqXzE6Snc3yoOiIfN95U8Cl23nxzMI4
 F235LS5m+4+GWIMkYs6+hlWEa8SxSnsH/beQJxAybtetSHRxsDXOHlEpDI9olxYI
 +OFNqsFLDyhyjib2Snwt8HMES9VmMfUUwT7ZK1cChXN8Quix3lXAyz6rpocmdnL/
 X1tzqWb1bz+yqrVRjtN3AGbcav+sQp4UWXsoExVsiz3lzaXfrbPvRZFucH2+SStK
 Wadyy2v2rHZKNq9sDLUBf5GJ0C4LrBAyT15ADim2L5/K7pZyI7AKSPcdU0Xc8m5y
 kMswml7FLaI+gsuKOBl7jySaoNDiVZFvQjsxd1+hn5cBNOcyBSt2rSErDw3geJgn
 pFzVkLtklfxMpHMm+bwXONBZMrXOBPNgo0gSkVrCVzO68x6j/MaBGLFazSqOKxbc
 UcVULhiomjIae4AcTomzOQx7oM1Xs3XuPo0x2RgJ3gMlCCZKoDNuqDgB3ZmPP/gq
 cbbHHveXEx7aMOPz0A==
 =Qb+I
 -----END PGP SIGNATURE-----

Merge tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux

Pull uml updates from Richard Weinberger:

 - Make stub data pages configurable

 - Make it harder to mix user and kernel code by accident

* tag 'uml-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux:
  um: make stub data pages size tweakable
  um: prevent user code in modules
  um: further clean up user_syms
  um: don't export printf()
  um: hostfs: define our own API boundary
  um: add __weak for exported functions
2023-05-03 19:02:03 -07:00
Linus Torvalds
9f2692326b This pull request contains updates for UBI and UBIFS
UBI:
 	- Fix error value for try_write_vid_and_data()
 	- Minor cleanups
 
 UBIFS:
 	- Fixes for various memory leaks
 	- Minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmRSs7cWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wcx4D/9zdLoaz3jv7P8vIMwHW/1Ewcjz
 Muvd5zPha31pjExK9Q90G8GzWX24rVUUXJPzO9zi88ayPiM0kD/lj4g8VD2a7E+G
 khK7K87ar9PitGeydzfByCed33tBaZ3XrDwOct2Byikdpl/lH6VQN4f6byKuAosA
 hSHw3c3Cf8XqnVq//mZTga9P+I0//gsTVBYMGEy4aaH5Ypi6kYz32GG5ERnYJpTG
 pRNon+CQWY6/+SfA1RAir2DqPU++Q5K50rErb1VQ2A/CdybbxgXZWUZ0zt3JHmdY
 CuoUSqIUFg/Abw57hDO+b+BbRMda2LHRO+xp/jCX325uQoBOnDhXQqr5g44kYFfI
 mxI1v/VZVmecyh5lYqxMoAEtHMOUOZGhOEiVgWNuQX8pl+9vUEd0gHuuo2d6woiI
 qM+W8kaZIWbBXnvQ8hzjMYuhWvMLpD2Lq2ZQCqkE/gNUgrdCk4D6G3fVd1x7Yypr
 iIbcg73xFyTurnIw20xqI03Acj0tbL3ofFMrxD6lS4y4drGPQxrKupN0jMTseHHB
 ZhxbB1zmYDntBCvQQSox3MCghFh4I5FSS5Dj6u6alAcmyctzOEbktC2hiGLtdGAH
 xhSKbflLogy8+W7GI4hhAOZNcBW3mqCFi1smlPSgo1m3a56gehw9tqoqBz5OLzGI
 BeusDNFBi1vgnGCsSw==
 =3ows
 -----END PGP SIGNATURE-----

Merge tag 'ubifs-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs

Pull UBI and UBIFS updates from Richard Weinberger:
 "UBI:

   - Fix error value for try_write_vid_and_data()

   - Minor cleanups

  UBIFS:

   - Fixes for various memory leaks

   - Minor cleanups"

* tag 'ubifs-for-linus-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
  ubifs: Fix memleak when insert_old_idx() failed
  Revert "ubifs: dirty_cow_znode: Fix memleak in error handling path"
  ubifs: Fix memory leak in do_rename
  ubifs: Free memory for tmpfile name
  ubi: Fix return value overwrite issue in try_write_vid_and_data()
  ubifs: Remove return in compr_exit()
  ubi: Simplify bool conversion
2023-05-03 18:58:59 -07:00
Josef Bacik
d246331b78 btrfs: don't free qgroup space unless specified
Boris noticed in his simple quotas testing that he was getting a leak
with Sweet Tea's change to subvol create that stopped doing a
transaction commit.  This was just a side effect of that change.

In the delayed inode code we have an optimization that will free extra
reservations if we think we can pack a dir item into an already modified
leaf.  Previously this wouldn't be triggered in the subvolume create
case because we'd commit the transaction, it was still possible but
much harder to trigger.  It could actually be triggered if we did a
mkdir && subvol create with qgroups enabled.

This occurs because in btrfs_insert_delayed_dir_index(), which gets
called when we're adding the dir item, we do the following:

  btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL);

if we're able to skip reserving space.

The problem here is that trans->block_rsv points at the temporary block
rsv for the subvolume create, which has qgroup reservations in the block
rsv.

This is a problem because btrfs_block_rsv_release() will do the
following:

  if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
	  qgroup_to_release = block_rsv->qgroup_rsv_reserved -
		  block_rsv->qgroup_rsv_size;
	  block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
  }

The temporary block rsv just has ->qgroup_rsv_reserved set,
->qgroup_rsv_size == 0.  The optimization in
btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0.  Then
later on when we call btrfs_subvolume_release_metadata() which has

  btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release);
  btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release);

qgroup_to_release is set to 0, and we do not convert the reserved
metadata space.

The problem here is that the block rsv code has been unconditionally
messing with ->qgroup_rsv_reserved, because the main place this is used
is delalloc, and any time we call btrfs_block_rsv_release() we do it
with qgroup_to_release set, and thus do the proper accounting.

The subvolume code is the only other code that uses the qgroup
reservation stuff, but it's intermingled with the above optimization,
and thus was getting its reservation freed out from underneath it and
thus leaking the reserved space.

The solution is to simply not mess with the qgroup reservations if we
don't have qgroup_to_release set.  This works with the existing code as
anything that messes with the delalloc reservations always have
qgroup_to_release set.  This fixes the leak that Boris was observing.

Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-03 16:37:56 +02:00
Luis Chamberlain
0199849acd sysctl: remove register_sysctl_paths()
The deprecation for register_sysctl_paths() is over. We can rejoice as
we nuke register_sysctl_paths(). The routine register_sysctl_table()
was the only user left of register_sysctl_paths(), so we can now just
open code and move the implementation over to what used to be
to __register_sysctl_paths().

The old dynamic struct ctl_table_set *set is now the point to
sysctl_table_root.default_set.

The old dynamic const struct ctl_path *path was being used in the
routine register_sysctl_paths() with a static:

static const struct ctl_path null_path[] = { {} };

Since this is a null path we can now just simplfy the old routine
and remove its use as its always empty.

This saves us a total of 230 bytes.

$ ./scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 2/7 grow/shrink: 1/1 up/down: 1015/-1245 (-230)
Function                                     old     new   delta
register_leaf_sysctl_tables.constprop          -     524    +524
register_sysctl_table                         22     497    +475
__pfx_register_leaf_sysctl_tables.constprop       -      16     +16
null_path                                      8       -      -8
__pfx_register_sysctl_paths                   16       -     -16
__pfx_register_leaf_sysctl_tables             16       -     -16
__pfx___register_sysctl_paths                 16       -     -16
__register_sysctl_base                        29      12     -17
register_sysctl_paths                         18       -     -18
register_leaf_sysctl_tables                  534       -    -534
__register_sysctl_paths                      620       -    -620
Total: Before=21259666, After=21259436, chg -0.00%

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-05-02 19:24:16 -07:00
Kefeng Wang
245f092268 mm: hwpoison: coredump: support recovery from dump_user_range()
dump_user_range() is used to copy the user page to a coredump file, but if
a hardware memory error occurred during copy, which called from
__kernel_write_iter() in dump_user_range(), it crashes,

  CPU: 112 PID: 7014 Comm: mca-recover Not tainted 6.3.0-rc2 #425

  pc : __memcpy+0x110/0x260
  lr : _copy_from_iter+0x3bc/0x4c8
  ...
  Call trace:
   __memcpy+0x110/0x260
   copy_page_from_iter+0xcc/0x130
   pipe_write+0x164/0x6d8
   __kernel_write_iter+0x9c/0x210
   dump_user_range+0xc8/0x1d8
   elf_core_dump+0x308/0x368
   do_coredump+0x2e8/0xa40
   get_signal+0x59c/0x788
   do_signal+0x118/0x1f8
   do_notify_resume+0xf0/0x280
   el0_da+0x130/0x138
   el0t_64_sync_handler+0x68/0xc0
   el0t_64_sync+0x188/0x190

Generally, the '->write_iter' of file ops will use copy_page_from_iter()
and copy_page_from_iter_atomic(), change memcpy() to copy_mc_to_kernel()
in both of them to handle #MC during source read, which stop coredump
processing and kill the task instead of kernel panic, but the source
address may not always a user address, so introduce a new copy_mc flag in
struct iov_iter{} to indicate that the iter could do a safe memory copy,
also introduce the helpers to set/cleck the flag, for now, it's only used
in coredump's dump_user_range(), but it could expand to any other
scenarios to fix the similar issue.

Link: https://lkml.kernel.org/r/20230417045323.11054-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tong Tiangen <tongtiangen@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 17:21:50 -07:00
Tom Rix
fc412a6196 lockd: define nlm_port_min,max with CONFIG_SYSCTL
gcc with W=1 and ! CONFIG_SYSCTL
fs/lockd/svc.c:80:51: error: ‘nlm_port_max’ defined but not used [-Werror=unused-const-variable=]
   80 | static const int                nlm_port_min = 0, nlm_port_max = 65535;
      |                                                   ^~~~~~~~~~~~
fs/lockd/svc.c:80:33: error: ‘nlm_port_min’ defined but not used [-Werror=unused-const-variable=]
   80 | static const int                nlm_port_min = 0, nlm_port_max = 65535;
      |                                 ^~~~~~~~~~~~

The only use of these variables is when CONFIG_SYSCTL
is defined, so their definition should be likewise conditional.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-02 15:47:33 -04:00
Tom Rix
340086da9a nfsd: define exports_proc_ops with CONFIG_PROC_FS
gcc with W=1 and ! CONFIG_PROC_FS
fs/nfsd/nfsctl.c:161:30: error: ‘exports_proc_ops’
  defined but not used [-Werror=unused-const-variable=]
  161 | static const struct proc_ops exports_proc_ops = {
      |                              ^~~~~~~~~~~~~~~~

The only use of exports_proc_ops is when CONFIG_PROC_FS
is defined, so its definition should be likewise conditional.

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-02 15:46:54 -04:00
Marc Dionne
9ea4eff4b6 afs: Avoid endless loop if file is larger than expected
afs_read_dir fetches an amount of data that's based on what the inode
size is thought to be.  If the file on the server is larger than what
was fetched, the code rechecks i_size and retries.  If the local i_size
was not properly updated, this can lead to an endless loop of fetching
i_size from the server and noticing each time that the size is larger on
the server.

If it is known that the remote size is larger than i_size, bump up the
fetch size to that size.

Fixes: f3ddee8dc4 ("afs: Fix directory handling")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
2023-05-02 17:23:50 +01:00
David Howells
45f66fa03b afs: Fix getattr to report server i_size on dirs, not local size
Fix afs_getattr() to report the server's idea of the file size of a
directory rather than the local size.  The local size may differ as we edit
the local copy to avoid having to redownload it and we may end up with a
differently structured blob of a different size.

However, if the directory is discarded from the pagecache we then download
it again and the user may see the directory file size apparently change.

Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
2023-05-02 17:17:42 +01:00
Marc Dionne
d7f74e9a91 afs: Fix updating of i_size with dv jump from server
If the data version returned from the server is larger than expected,
the local data is invalidated, but we may still want to note the remote
file size.

Since we're setting change_size, we have to also set data_changed
for the i_size to get updated.

Fixes: 3f4aa98181 ("afs: Fix EOF corruption")
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
2023-05-02 17:08:18 +01:00
Paulo Alcantara
90c49fce1c cifs: fix potential use-after-free bugs in TCP_Server_Info::hostname
TCP_Server_Info::hostname may be updated once or many times during
reconnect, so protect its access outside reconnect path as well and
then prevent any potential use-after-free bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-02 09:41:09 -05:00
Paulo Alcantara
1810769e3a cifs: print smb3_fs_context::source when mounting
Print full device name (UNC + optional prefix) from @old_ctx->source
when printing info about mount.

Before patch

  mount.cifs //srv/share/dir /mnt -o ...
  dmesg
  ...
  CIFS: Attempting to mount \\srv\share

After patch

  mount.cifs //srv/share/dir /mnt -o ...
  dmesg
  ...
  CIFS: Attempting to mount //srv/share/dir

Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-02 09:23:51 -05:00
Paulo Alcantara
5bff9f741a cifs: protect session status check in smb2_reconnect()
Use @ses->ses_lock to protect access of @ses->ses_status.

Cc: stable@vger.kernel.org
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-02 09:23:51 -05:00
Steve French
3d6b15a8f3 SMB3.1.1: correct definition for app_instance_id create contexts
The name lengths were incorrect for two create contexts.
     SMB2_CREATE_APP_INSTANCE_ID
     SMB2_CREATE_APP_INSTANCE_VERSION

Update the definitions for these two to match the protocol specs.

Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-05-02 09:23:23 -05:00
Boris Burkov
e7db9e5c6b btrfs: fix encoded write i_size corruption with no-holes
We have observed a btrfs filesystem corruption on workloads using
no-holes and encoded writes via send stream v2. The symptom is that a
file appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.

So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, i_size would be set to 1MiB, but the ninth extent,
nbyte, etc. would all appear correct otherwise.

The source of the race is a narrow (one line of code) window in which a
no-holes fs has read in an updated i_size, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)

  ENC-WR1 (second to last)                                         ENC-WR2 (last)
  -------                                                          -------
  btrfs_do_encoded_write
    set i_size = 1M
    submit bio B1 ending at 1M
  endio B1
  btrfs_inode_safe_disk_i_size_write
    local i_size = 1M
    falls off a cliff for some reason
							      btrfs_do_encoded_write
								set i_size = 1M+X
								submit bio B2 ending at 1M+X
							      endio B2
							      btrfs_inode_safe_disk_i_size_write
								local i_size = 1M+X
								disk_i_size = 1M+X
    disk_i_size = 1M
							      btrfs_delayed_update_inode
    btrfs_delayed_update_inode

And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect i_size and present a corrupted file missing its last
extents.

Fix this by holding the inode lock in the no-holes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date i_size.

Fixes: 41a2ee75aa ("btrfs: introduce per-inode file extent tree")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-02 14:21:00 +02:00
Darrick J. Wong
2254a7396a xfs: fix xfs_inodegc_stop racing with mod_delayed_work
syzbot reported this warning from the faux inodegc shrinker that tries
to kick off inodegc work:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 102 at kernel/workqueue.c:1445 __queue_work+0xd44/0x1120 kernel/workqueue.c:1444
RIP: 0010:__queue_work+0xd44/0x1120 kernel/workqueue.c:1444
Call Trace:
 __queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1672
 mod_delayed_work_on+0xe1/0x220 kernel/workqueue.c:1746
 xfs_inodegc_shrinker_scan fs/xfs/xfs_icache.c:2212 [inline]
 xfs_inodegc_shrinker_scan+0x250/0x4f0 fs/xfs/xfs_icache.c:2191
 do_shrink_slab+0x428/0xaa0 mm/vmscan.c:853
 shrink_slab+0x175/0x660 mm/vmscan.c:1013
 shrink_one+0x502/0x810 mm/vmscan.c:5343
 shrink_many mm/vmscan.c:5394 [inline]
 lru_gen_shrink_node mm/vmscan.c:5511 [inline]
 shrink_node+0x2064/0x35f0 mm/vmscan.c:6459
 kswapd_shrink_node mm/vmscan.c:7262 [inline]
 balance_pgdat+0xa02/0x1ac0 mm/vmscan.c:7452
 kswapd+0x677/0xd60 mm/vmscan.c:7712
 kthread+0x2e8/0x3a0 kernel/kthread.c:376
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

This warning corresponds to this code in __queue_work:

	/*
	 * For a draining wq, only works from the same workqueue are
	 * allowed. The __WQ_DESTROYING helps to spot the issue that
	 * queues a new work item to a wq after destroy_workqueue(wq).
	 */
	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
		     WARN_ON_ONCE(!is_chained_work(wq))))
		return;

For this to trip, we must have a thread draining the inodedgc workqueue
and a second thread trying to queue inodegc work to that workqueue.
This can happen if freezing or a ro remount race with reclaim poking our
faux inodegc shrinker and another thread dropping an unlinked O_RDONLY
file:

Thread 0	Thread 1	Thread 2

xfs_inodegc_stop

				xfs_inodegc_shrinker_scan
				xfs_is_inodegc_enabled
				<yes, will continue>

xfs_clear_inodegc_enabled
xfs_inodegc_queue_all
<list empty, do not queue inodegc worker>

		xfs_inodegc_queue
		<add to list>
		xfs_is_inodegc_enabled
		<no, returns>

drain_workqueue
<set WQ_DRAINING>

				llist_empty
				<no, will queue list>
				mod_delayed_work_on(..., 0)
				__queue_work
				<sees WQ_DRAINING, kaboom>

In other words, everything between the access to inodegc_enabled state
and the decision to poke the inodegc workqueue requires some kind of
coordination to avoid the WQ_DRAINING state.  We could perhaps introduce
a lock here, but we could also try to eliminate WQ_DRAINING from the
picture.

We could replace the drain_workqueue call with a loop that flushes the
workqueue and queues workers as long as there is at least one inode
present in the per-cpu inodegc llists.  We've disabled inodegc at this
point, so we know that the number of queued inodes will eventually hit
zero as long as xfs_inodegc_start cannot reactivate the workers.

There are four callers of xfs_inodegc_start.  Three of them come from the
VFS with s_umount held: filesystem thawing, failed filesystem freezing,
and the rw remount transition.  The fourth caller is mounting rw (no
remount or freezing possible).

There are three callers ofs xfs_inodegc_stop.  One is unmounting (no
remount or thaw possible).  Two of them come from the VFS with s_umount
held: fs freezing and ro remount transition.

Hence, it is correct to replace the drain_workqueue call with a loop
that drains the inodegc llists.

Fixes: 6191cf3ad5 ("xfs: flush inodegc workqueue tasks before cancel")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:16:14 +10:00
Darrick J. Wong
2d5f38a319 xfs: disable reaping in fscounters scrub
The fscounters scrub code doesn't work properly because it cannot
quiesce updates to the percpu counters in the filesystem, hence it
returns false corruption reports.  This has been fixed properly in
one of the online repair patchsets that are under review by replacing
the xchk_disable_reaping calls with an exclusive filesystem freeze.
Disabling background gc isn't sufficient to fix the problem.

In other words, scrub doesn't need to call xfs_inodegc_stop, which is
just as well since it wasn't correct to allow scrub to call
xfs_inodegc_start when something else could be calling xfs_inodegc_stop
(e.g. trying to freeze the filesystem).

Neuter the scrubber for now, and remove the xchk_*_reaping functions.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:16:14 +10:00
Darrick J. Wong
b37c4c8339 xfs: check that per-cpu inodegc workers actually run on that cpu
Now that we've allegedly worked out the problem of the per-cpu inodegc
workers being scheduled on the wrong cpu, let's put in a debugging knob
to let us know if a worker ever gets mis-scheduled again.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:16:12 +10:00
Darrick J. Wong
03e0add80f xfs: explicitly specify cpu when forcing inodegc delayed work to run immediately
I've been noticing odd racing behavior in the inodegc code that could
only be explained by one cpu adding an inode to its inactivation llist
at the same time that another cpu is processing that cpu's llist.
Preemption is disabled between get/put_cpu_ptr, so the only explanation
is scheduler mayhem.  I inserted the following debug code into
xfs_inodegc_worker (see the next patch):

	ASSERT(gc->cpu == smp_processor_id());

This assertion tripped during overnight tests on the arm64 machines, but
curiously not on x86_64.  I think we haven't observed any resource leaks
here because the lockfree list code can handle simultaneous llist_add
and llist_del_all functions operating on the same list.  However, the
whole point of having percpu inodegc lists is to take advantage of warm
memory caches by inactivating inodes on the last processor to touch the
inode.

The incorrect scheduling seems to occur after an inodegc worker is
subjected to mod_delayed_work().  This wraps mod_delayed_work_on with
WORK_CPU_UNBOUND specified as the cpu number.  Unbound allows for
scheduling on any cpu, not necessarily the same one that scheduled the
work.

Because preemption is disabled for as long as we have the gc pointer, I
think it's safe to use current_cpu() (aka smp_processor_id) to queue the
delayed work item on the correct cpu.

Fixes: 7cf2b0f961 ("xfs: bound maximum wait time for inodegc work")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:16:05 +10:00
Darrick J. Wong
1bba82fe1a xfs: fix negative array access in xfs_getbmap
In commit 8ee81ed581, Ye Bin complained about an ASSERT in the bmapx
code that trips if we encounter a delalloc extent after flushing the
pagecache to disk.  The ioctl code does not hold MMAPLOCK so it's
entirely possible that a racing write page fault can create a delalloc
extent after the file has been flushed.  The proposed solution was to
replace the assertion with an early return that avoids filling out the
bmap recordset with a delalloc entry if the caller didn't ask for it.

At the time, I recall thinking that the forward logic sounded ok, but
felt hesitant because I suspected that changing this code would cause
something /else/ to burst loose due to some other subtlety.

syzbot of course found that subtlety.  If all the extent mappings found
after the flush are delalloc mappings, we'll reach the end of the data
fork without ever incrementing bmv->bmv_entries.  This is new, since
before we'd have emitted the delalloc mappings even though the caller
didn't ask for them.  Once we reach the end, we'll try to set
BMV_OF_LAST on the -1st entry (because bmv_entries is zero) and go
corrupt something else in memory.  Yay.

I really dislike all these stupid patches that fiddle around with debug
code and break things that otherwise worked well enough.  Nobody was
complaining that calling XFS_IOC_BMAPX without BMV_IF_DELALLOC would
return BMV_OF_DELALLOC records, and now we've gone from "weird behavior
that nobody cared about" to "bad behavior that must be addressed
immediately".

Maybe I'll just ignore anything from Huawei from now on for my own sake.

Reported-by: syzbot+c103d3808a0de5faaf80@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-xfs/20230412024907.GP360889@frogsfrogsfrogs/
Fixes: 8ee81ed581 ("xfs: fix BUG_ON in xfs_getbmap()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:15:01 +10:00
Darrick J. Wong
1f1397b721 xfs: don't allocate into the data fork for an unshare request
For an unshare request, we only have to take action if the data fork has
a shared mapping.  We don't care if someone else set up a cow operation.
If we find nothing in the data fork, return a hole to avoid allocating
space.

Note that fallocate will replace the delalloc reservation with an
unwritten extent anyway, so this has no user-visible effects outside of
avoiding unnecessary updates.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:51 +10:00
Darrick J. Wong
397b2d7e0f xfs: flush dirty data and drain directios before scrubbing cow fork
When we're scrubbing the COW fork, we need to take MMAPLOCK_EXCL to
prevent page_mkwrite from modifying any inode state.  The ILOCK should
suffice to avoid confusing online fsck, but let's take the same locks
that we do everywhere else.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:43 +10:00
Darrick J. Wong
8e698ee72c xfs: set bnobt/cntbt numrecs correctly when formatting new AGs
Through generic/300, I discovered that mkfs.xfs creates corrupt
filesystems when given these parameters:

# mkfs.xfs -d size=512M /dev/sda -f -d su=128k,sw=4 --unsupported
Filesystems formatted with --unsupported are not supported!!
meta-data=/dev/sda               isize=512    agcount=8, agsize=16352 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=130816, imaxpct=25
         =                       sunit=32     swidth=128 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=8192, version=2
         =                       sectsz=512   sunit=32 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
         =                       rgcount=0    rgsize=0 blks
Discarding blocks...Done.
# xfs_repair -n /dev/sda
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
        - 16:30:50: zeroing log - 16320 of 16320 blocks done
        - scan filesystem freespace and inode maps...
agf_freeblks 25, counted 0 in ag 4
sb_fdblocks 8823, counted 8798

The root cause of this problem is the numrecs handling in
xfs_freesp_init_recs, which is used to initialize a new AG.  Prior to
calling the function, we set up the new bnobt block with numrecs == 1
and rely on _freesp_init_recs to format that new record.  If the last
record created has a blockcount of zero, then it sets numrecs = 0.

That last bit isn't correct if the AG contains the log, the start of the
log is not immediately after the initial blocks due to stripe alignment,
and the end of the log is perfectly aligned with the end of the AG.  For
this case, we actually formatted a single bnobt record to handle the
free space before the start of the (stripe aligned) log, and incremented
arec to try to format a second record.  That second record turned out to
be unnecessary, so what we really want is to leave numrecs at 1.

The numrecs handling itself is overly complicated because a different
function sets numrecs == 1.  Change the bnobt creation code to start
with numrecs set to zero and only increment it after successfully
formatting a free space extent into the btree block.

Fixes: f327a00745 ("xfs: account for log space when formatting new AGs")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:36 +10:00
Darrick J. Wong
b82a5c42a5 xfs: don't unconditionally null args->pag in xfs_bmap_btalloc_at_eof
xfs/170 on a filesystem with su=128k,sw=4 produces this splat:

BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP
CPU: 1 PID: 4022907 Comm: dd Tainted: G        W          6.3.0-xfsx #2 6ebeeffbe9577d32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-bu
RIP: 0010:xfs_perag_rele+0x10/0x70 [xfs]
RSP: 0018:ffffc90001e43858 EFLAGS: 00010217
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000100
RDX: ffffffffa054e717 RSI: 0000000000000005 RDI: 0000000000000000
RBP: ffff888194eea000 R08: 0000000000000000 R09: 0000000000000037
R10: ffff888100ac1cb0 R11: 0000000000000018 R12: 0000000000000000
R13: ffffc90001e43a38 R14: ffff888194eea000 R15: ffff888194eea000
FS:  00007f93d1a0e740(0000) GS:ffff88843fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 000000018a34f000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 xfs_bmap_btalloc+0x1a7/0x5d0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_allocate+0xee/0x470 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_bmapi_write+0x539/0x9e0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_iomap_write_direct+0x1bb/0x2b0 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_direct_write_iomap_begin+0x51c/0x710 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 iomap_iter+0x132/0x2f0
 __iomap_dio_rw+0x2f8/0x840
 iomap_dio_rw+0xe/0x30
 xfs_file_dio_write_aligned+0xad/0x180 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 xfs_file_write_iter+0xfb/0x190 [xfs f85291d6841cbb3dc740083f1f331c0327394518]
 vfs_write+0x2eb/0x410
 ksys_write+0x65/0xe0
 do_syscall_64+0x2b/0x80

This crash occurs under the "out_low_space" label.  We grabbed a perag
reference, passed it via args->pag into xfs_bmap_btalloc_at_eof, and
afterwards args->pag is NULL.  Fix the second function not to clobber
args->pag if the caller had passed one in.

Fixes: 8584332709 ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-05-02 09:14:27 +10:00
Linus Torvalds
06936aaf49 Some ext4 regression and bug fixes for -rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmROracACgkQ8vlZVpUN
 gaPqLQf/ZvzvspL4o3SNsHE/M2tKNBVY/z/vsfmAZwMgrGoK5qCkDsNA7c7+oUwE
 xjiHiVHOaYjJVWwkdODAwe7xNbWB6FoKptBaBi89fAyibMY/N7BZ8rad69NQTvyc
 JbKjorvEBc+qgsUEt2+ZpMogN9KHlVh3NJwlovesmucQtg2gWLKs8wrxW2bC7uAh
 2uR9GWUnhDrs6jHbjHkG3/lgB0aS0StLRxfsbchjZvCsniTDZymLmmgkA1ln17ce
 6iRg2ESjYUryPX09YFtUuQVvObtUTM+z8DzwyQuAJ4VfmdoPA4L6mpdqzPGFuKQc
 gJrLSB8VZJDvPoGjaHZ+Qdl1tHlFRw==
 =2SEf
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Some ext4 regression and bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: clean up error handling in __ext4_fill_super()
  ext4: reflect error codes from ext4_multi_mount_protect() to its callers
  ext4: fix lost error code reporting in __ext4_fill_super()
  ext4: fix unused iterator variable warnings
  ext4: fix use-after-free read in ext4_find_extent for bigalloc + inline
  ext4: fix i_disksize exceeding i_size problem in paritally written case
2023-05-01 11:00:04 -07:00
Linus Torvalds
26c009dffc 11 smb3 client fixes, mostly cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmROg4sACgkQiiy9cAdy
 T1FeNgv/S/dFaQ9RXDGp0AsO9aDUwKPMZWdZVgPtnktQF5icTI7CrYn3R2KrA6i2
 +a27pSWsefF1/RpRIGm5n0AFkEgRaClqxWIzM7VBXWtsR5oFA5GoyYzOk206qAvl
 CTvpS7Kuf091UG8NoOVqmM+AtSE8tEx4itDbh7wS9HeApoxiZKPJvblzaiCAzEeR
 mc+ehfTocUy+1UZh8xZB/epl0xHAVUr845zIkVZXE2HBQCSni/5ywPIHc3xyAQXJ
 6a5sEYi0e3wQ9457zS6POW3rMXys2ZanYlEfy6guGcfCAX6PsPt5Yl+sJtdMw08k
 XB9qJkGg111kLKncM38Ju5R1QHYCOj/tOC7gjleNhHWs/iHclMFrDrA/ZYSzibd4
 USVQpLRCjFFAwvKj/LTVmPIRw60fr3lf4n4maQGLqJCHXQkO/+Z4q/UEBqslXrot
 Y1c4+ALqJRQvMe591hCsN/uDV7S9ETy2BRePBbLyokcwji8i9PyJ+4XYONmngVyx
 OuB2KeAE
 =4iMh
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:

 - deferred close fix for an important case when cached file should be
   closed immediately

 - two fixes for missing locks

 - eight minor cleanup

* tag '6.4-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko
  smb3: move some common open context structs to smbfs_common
  smb3: make query_on_disk_id open context consistent and move to common code
  SMB3.1.1: add new tree connect ShareFlags
  cifs: missing lock when updating session status
  SMB3: Close deferred file handles in case of handle lease break
  SMB3: Add missing locks to protect deferred close file list
  cifs: Avoid a cast in add_lease_context()
  cifs: Simplify SMB2_open_init()
  cifs: Simplify SMB2_open_init()
  cifs: Simplify SMB2_open_init()
2023-05-01 10:43:44 -07:00
David Howells
db099c625b rxrpc: Fix timeout of a call that hasn't yet been granted a channel
afs_make_call() calls rxrpc_kernel_begin_call() to begin a call (which may
get stalled in the background waiting for a connection to become
available); it then calls rxrpc_kernel_set_max_life() to set the timeouts -
but that starts the call timer so the call timer might then expire before
we get a connection assigned - leading to the following oops if the call
stalled:

	BUG: kernel NULL pointer dereference, address: 0000000000000000
	...
	CPU: 1 PID: 5111 Comm: krxrpcio/0 Not tainted 6.3.0-rc7-build3+ #701
	RIP: 0010:rxrpc_alloc_txbuf+0xc0/0x157
	...
	Call Trace:
	 <TASK>
	 rxrpc_send_ACK+0x50/0x13b
	 rxrpc_input_call_event+0x16a/0x67d
	 rxrpc_io_thread+0x1b6/0x45f
	 ? _raw_spin_unlock_irqrestore+0x1f/0x35
	 ? rxrpc_input_packet+0x519/0x519
	 kthread+0xe7/0xef
	 ? kthread_complete_and_exit+0x1b/0x1b
	 ret_from_fork+0x22/0x30

Fix this by noting the timeouts in struct rxrpc_call when the call is
created.  The timer will be started when the first packet is transmitted.

It shouldn't be possible to trigger this directly from userspace through
AF_RXRPC as sendmsg() will return EBUSY if the call is in the
waiting-for-conn state if it dropped out of the wait due to a signal.

Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-01 07:43:19 +01:00
Christophe JAILLET
db2993a423 ceph: reorder fields in 'struct ceph_snapid_map'
Group some variables based on their sizes to reduce holes.
On x86_64, this shrinks the size of 'struct ceph_snapid_map' from 72 to 64
bytes.

When such a structure is allocated, because of the way memory allocation
works, when 72 bytes were requested, 96 bytes were allocated.

So, on x86_64, this change saves 32 bytes per allocation and has the
structure fit in a single cacheline.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30 12:37:28 +02:00
Xiubo Li
a5ffd7b6e9 ceph: pass ino# instead of old_dentry if it's disconnected
When exporting the kceph to NFS it may pass a DCACHE_DISCONNECTED
dentry for the link operation. Then it will parse this dentry as a
snapdir, and the mds will fail the link request as -EROFS.

MDS allow clients to pass a ino# instead of a path.

Link: https://tracker.ceph.com/issues/59515
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30 12:37:28 +02:00
Xiubo Li
aaf67de788 ceph: fix potential use-after-free bug when trimming caps
When trimming the caps and just after the 'session->s_cap_lock' is
released in ceph_iterate_session_caps() the cap maybe removed by
another thread, and when using the stale cap memory in the callbacks
it will trigger use-after-free crash.

We need to check the existence of the cap just after the 'ci->i_ceph_lock'
being acquired. And do nothing if it's already removed.

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/43272
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30 12:37:28 +02:00
Xiubo Li
7d41870d65 ceph: implement writeback livelock avoidance using page tagging
While the mapped IOs continue if we try to flush a file's buffer
we can see that the fsync() won't complete until the IOs finish.

This is analogous to Jan Kara's commit (f446daaea9 mm: implement
writeback livelock avoidance using page tagging), we will try to
avoid livelocks of writeback when some steadily creates dirty pages
in a mapping we are writing out.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30 12:37:28 +02:00
Xiubo Li
7a6c3a035a ceph: do not print the whole xattr value if it's too long
If the xattr's value size is long enough the kernel will warn and
then will fail the xfstests test case.

Just print part of the value string if it's too long.

At the same time fix the function name issue in the debug logs.

Link: https://tracker.ceph.com/issues/58404
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2023-04-30 12:37:28 +02:00
Linus Torvalds
1ae78a1451 five ksmbd server fixes, and new lock_rename_child VFS helper routine
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmRJ1QwACgkQiiy9cAdy
 T1FoMwv/dQv3nlq7CFW/p82xRGQg1rqE/bLw8iAz2kl2QMNI32mnNZ5vsnAOx5CT
 CXUsdfQTgVsdKQ0yAV0WIAopxOxf4ASmKn8Ej1wJBEREFuwqQUgFE16DALKDPhcW
 77cuPqFEGD1CfbOvM+IjSN1OlDmsu2a3qSuD1OfTxC14s3+mPZitGpSm06PMsaT6
 C17RG6dC+PJmzXOT79OGfPmP1e6103abhv7PUdy7tZdiHi3I2ESXvCUGAoeFzjG9
 i4vxM+WdfStun0sXkZysS+iiyajfbFF8cOkaXp4QG+aplDD5Z2gjV6w+n3phegiX
 N1QvTZ3C8dQEhvmy40DGyLgj2A3/dFqh5bEfA3Mtni2NqEHKG4ng2dQsmkMiNg0M
 iuCHwLUz2kgvc+zl4+HJpKDVF0HG3RHAlzRq0oNEw+lQApzfg/Q+p5ZcZUBjcoSj
 n4fxxNO4bggHmCBHZNDL7VNG81m0YEg8xnDpBPvwtLMO1yEj8RFwW+RTfIwqyOYE
 VK1dcRc1
 =DweL
 -----END PGP SIGNATURE-----

Merge tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server updates from Steve French:

 - SMB3.1.1 negotiate context fixes and cleanup

 - new lock_rename_child VFS helper

 - ksmbd fix to avoid unlink race and to use the new VFS helper to avoid
   rename race

* tag '6.4-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix racy issue from using ->d_parent and ->d_name
  ksmbd: remove unused compression negotiate ctx packing
  ksmbd: avoid duplicate negotiate ctx offset increments
  ksmbd: set NegotiateContextCount once instead of every inc
  fs: introduce lock_rename_child() helper
  ksmbd: remove internal.h include
2023-04-29 11:10:39 -07:00
Linus Torvalds
4e1c80ae5c NFSD 6.4 Release Notes
The big ticket item for this release is support for RPC-with-TLS
 [RFC 9289] has been added to the Linux NFS server. The goal is to
 provide a simple-to-deploy, low-overhead in-transit confidentiality
 and peer authentication mechanism. It can supplement NFS Kerberos
 and it can protect the use of legacy non-cryptographic user
 authentication flavors such as AUTH_SYS. The TLS Record protocol is
 handled entirely by kTLS, meaning it can use either software
 encryption or offload encryption to smart NICs.
 
 Work continues on improving NFSD's open file cache. Among the many
 clean-ups in that area is a patch to convert the rhashtable to use
 the list-hashing version of that data structure.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmRK/JMACgkQM2qzM29m
 f5cF5A/+JZFRSPlfSYt0YHzUQQSDdYn5n/IG9TwJQd62xheu083WuKRaCOYYoOhg
 06nZd6p7nuF1E0n2ZWOKSE6YkBSE6z4M6KrQlm6lCe/nmxYCR87IYfJCXuL+Yf0e
 /LdL4OTvDHzY5ec1DreERldPIUJ8GFzwChH8/z4XwbNDR7qJkF/gf8YxpFr+8K+j
 Cfyl8woZiEze/Nvxy1YtAqa7HMEpitt0aWJN55rHwTh9c3b0nmDzziYFcVqXgybJ
 /qUHfHBak66ll8RqhcQ3BMuyfszwASERbPsaZ2a5F/RaxLL5ZWfFyhgQwm+PZWT+
 J5DdSBwLEQYtKQGD41A1aorP6v/u2QelfWrl4S7/qjRpREp8Ba2IU4fYLjGb1499
 Imk68BA7NwFp87tdMi/7en1VVgina4U/S3X71aUYWe+C0g48BfTrVwq4SVbQSAo4
 1638vbZnrJbsJMr9OaaysKWfv4KZB020Ji1KVwuqmgy5F8kdfJCCQ2UR/fHuJ3DY
 R0Zrd1Ryjwr83viP+Xj0ERiW405gPdCT0RJqoA7rznRPCqT5M42tf5z65uO7iZeE
 C1udgDaoQOtioKlem6FcDXLkryf986slGA7V91lat/Jt8A5jLKQfjVe3Q+kaaqXP
 ka1DQnYelzMzILQQs39cqW5pShrH8e3tfRZ7JhdBgrpxVXz9ZZM=
 =lA2+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd updates from Chuck Lever:
 "The big ticket item for this release is that support for RPC-with-TLS
  [RFC 9289] has been added to the Linux NFS server.

  The goal is to provide a simple-to-deploy, low-overhead in-transit
  confidentiality and peer authentication mechanism. It can supplement
  NFS Kerberos and it can protect the use of legacy non-cryptographic
  user authentication flavors such as AUTH_SYS. The TLS Record protocol
  is handled entirely by kTLS, meaning it can use either software
  encryption or offload encryption to smart NICs.

  Aside from that, work continues on improving NFSD's open file cache.
  Among the many clean-ups in that area is a patch to convert the
  rhashtable to use the list-hashing version of that data structure"

* tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits)
  NFSD: Handle new xprtsec= export option
  SUNRPC: Support TLS handshake in the server-side TCP socket code
  NFSD: Clean up xattr memory allocation flags
  NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
  SUNRPC: Clear rq_xid when receiving a new RPC Call
  SUNRPC: Recognize control messages in server-side TCP socket code
  SUNRPC: Be even lazier about releasing pages
  SUNRPC: Convert svc_xprt_release() to the release_pages() API
  SUNRPC: Relocate svc_free_res_pages()
  nfsd: simplify the delayed disposal list code
  SUNRPC: Ignore return value of ->xpo_sendto
  SUNRPC: Ensure server-side sockets have a sock->file
  NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
  sunrpc: simplify two-level sysctl registration for svcrdma_parm_table
  SUNRPC: return proper error from get_expiry()
  lockd: add some client-side tracepoints
  nfs: move nfs_fhandle_hash to common include file
  lockd: server should unlock lock if client rejects the grant
  lockd: fix races in client GRANTED_MSG wait logic
  lockd: move struct nlm_wait to lockd.h
  ...
2023-04-29 11:04:14 -07:00
Linus Torvalds
0127f25b5d NFS Client Updates for Linux 6.4
New Features:
   * Convert the readdir path to use folios
   * Convert the NFS fscache code to use netfs
 
 Bugfixes and Cleanups:
   * Always send a RECLAIM_COMPLETE after establishing a lease
   * Simplify sysctl registrations and other cleanups
   * Handle out-of-order write replies on NFS v3
   * Have sunrpc call_bind_status use standard hard/soft task semantics
   * Other minor cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmRMI04ACgkQ18tUv7Cl
 QOuCNQ//SkQm8aOM4DkYFeDIObye6xMzgtWrB25grYNG4a/DcYqb5kNcbmI5l1tE
 Tus8KMZAWSpwa0m8ALctzp+pZQWQkY/svsqqHrKIGUHBI8F0OinVCqc2MzNN75WX
 m/1wELW6ek9RBL5BoJtAPt+Qu8/jP6KD64Zot7snBeUrzreaZDcz0HM+EcQhi7X7
 qd5XS0/cA2eLEBBQcQdFpRhHvgW12BMYM/zp3/ER5H52L2iAlZunGWw+Nqs8ueOR
 D7K2+CF1sV1k6hYbLWNoaF2J6PZr5dRpc6gSq4fLP4WUKjqQwmQp8cm9iLpf6jGa
 a+Y7t8aj7vup8jVCVGWYWZA2G2gi6jWmxxWudkJwfAa1E45t1B4/C0udwlxR20OO
 XI2Bhe5YwTURgSOvOS9QTZJpQN4qfpEL0NoAmAT5fAHBQ2CXDrMlSIxPS7U6LO9q
 YqwIHcAHvYVnbD45IUh2Zjbp65mRb1VkU6WzOyK1/sNHEyYpubIWXB/yLaA3oGge
 V3xUgvlTzLVzzyQfwiRfzAD1P5/USaXE/B36c4itfCr5rJnAfsiBP3gk0o9yq18J
 3Yb6olrmc9CzeA7PN88uEus4VZHbaE9OktRFIjJ22jlLQEY4xougdE5asY1XX8F+
 OKLLLeeCrsbvrANB9XcLVsLqdMYvsd0VaCX9HtN3UP+7Lod5T10=
 =gpBC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "New Features:

   - Convert the readdir path to use folios

   - Convert the NFS fscache code to use netfs

  Bugfixes and Cleanups:

   - Always send a RECLAIM_COMPLETE after establishing a lease

   - Simplify sysctl registrations and other cleanups

   - Handle out-of-order write replies on NFS v3

   - Have sunrpc call_bind_status use standard hard/soft task semantics

   - Other minor cleanups"

* tag 'nfs-for-6.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4.2: Rework scratch handling for READ_PLUS
  NFS: Cleanup unused rpc_clnt variable
  NFS: set varaiable nfs_netfs_debug_id storage-class-specifier to static
  SUNRPC: remove the maximum number of retries in call_bind_status
  NFS: Convert readdir page array functions to use a folio
  NFS: Convert the readdir array-of-pages into an array-of-folios
  NFSv3: handle out-of-order write replies.
  NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
  NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API
  NFS: Convert buffered read paths to use netfs when fscache is enabled
  NFS: Configure support for netfs when NFS fscache is configured
  NFS: Rename readpage_async_filler to nfs_read_add_folio
  sunrpc: simplify one-level sysctl registration for debug_table
  sunrpc: move sunrpc_table and proc routines above
  sunrpc: simplify one-level sysctl registration for xs_tunables_table
  sunrpc: simplify one-level sysctl registration for xr_tunables_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  NFSv4.1: Always send a RECLAIM_COMPLETE after establishing lease
2023-04-29 10:58:44 -07:00
Linus Torvalds
1e098dec61 driver ntfs3 for linux 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmRLZc0ACgkQqbAzH4Mk
 B7YWkA/+JHsryYjInVGmYfQ4oCmq1F2cEDgKdTeths+IZ+2vbk03hUbvmLBg8YR3
 VvYZ2YqNAs9HuZxP6mnVZB9YOMIFEdV3Pyto68l9jwip4irpjVrABtxwn6udTAmx
 RXVDDLRxN53UyoWgRoOfAEJUXfmOvw7laj8T9PrsKxXKhlOKmK8mSKtq8xTt+ECo
 8cJ0Nw2d0TVY6+Ou6CCfH8CzHtQjInHnKSt/KynJ+OJHxLmHRGlbDJbyjuJU0OBr
 grXngKPmqvHTiO3Zs14gRv5tFXNMGdvRcomy/XnztD/nIC2jEODI2uUCSedEh5vf
 IpzjKwQF1qRseGHwr2U0THPHOso4IP2T79WRQEuLY8DUGJClIGGmcdfeBqNnjbey
 1GVUis/leBvQb1CMAkX5HjGXO9i6xEfuTxwHSlk+Wu8euEDQ8OWudyeQmOYiHnqS
 vKZxXY88DDjATSokSUSSb8IW63z+JEPmovsmLKLpWusvWikIKhIEnjRZUG5XhgS3
 Ux66Owt9yguKgVAPX66b9PGQx4wy3GAR6FRxG1BpDX0XooGbzGRTAn213sjkBKiV
 JLswbQTJ5LOXv8bS1srNMPhQFeqFcEgrDbC+WxmOpzTvutimYmzUTDrOPq92Xw0f
 oi/i5kCoPFmOfoRsLhIl+lft1XLZ5iYFvj0faDJ5AWUOjb73Z3s=
 =xlIh
 -----END PGP SIGNATURE-----

Merge tag 'ntfs3_for_6.4' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 updates from Konstantin Komarov:
 "New code:

   - add missed "nocase" in ntfs_show_options

   - extend information on failures/errors

   - small optimizations

  Fixes:

   - some logic errors

   - some dead code was removed

   - code is refactored and reformatted according to the new version of
     clang-format

  Code removal:

   - 'noacsrules' option.

     Currently, this option does not work properly, and its use leads to
     unstable results. If we figure out how to implement it without
     errors, we will add it later

   - writepage"

* tag 'ntfs3_for_6.4' of https://github.com/Paragon-Software-Group/linux-ntfs3: (30 commits)
  fs/ntfs3: Fix root inode checking
  fs/ntfs3: Print details about mount fails
  fs/ntfs3: Add missed "nocase" in ntfs_show_options
  fs/ntfs3: Code formatting and refactoring
  fs/ntfs3: Changed ntfs_get_acl() to use dentry
  fs/ntfs3: Remove field sbi->used.bitmap.set_tail
  fs/ntfs3: Undo critial modificatins to keep directory consistency
  fs/ntfs3: Undo endian changes
  fs/ntfs3: Optimization in ntfs_set_state()
  fs/ntfs3: Fix ntfs_create_inode()
  fs/ntfs3: Remove noacsrules
  fs/ntfs3: Use bh_read to simplify code
  fs/ntfs3: Fix a possible null-pointer dereference in ni_clear()
  fs/ntfs3: Refactoring of various minor issues
  fs/ntfs3: Restore overflow checking for attr size in mi_enum_attr
  fs/ntfs3: Check for extremely large size of $AttrDef
  fs/ntfs3: Improved checking of attribute's name length
  fs/ntfs3: Add null pointer checks
  fs/ntfs3: fix spelling mistake "attibute" -> "attribute"
  fs/ntfs3: Add length check in indx_get_root
  ...
2023-04-29 10:52:37 -07:00
Linus Torvalds
56c455b38d xfs: New code for 6.4
o Added detailed design documentation for the upcoming online repair feature
 o major update to online scrub to complete the reverse mapping cross-referencing
   infrastructure enabling us to fully validate allocated metadata against owner
   records. This is the last piece of scrub infrastructure needed before we can
   start merging online repair functionality.
 o Fixes for the ascii-ci hashing issues
 o deprecation of the ascii-ci functionality
 o on-disk format verification bug fixes
 o various random bug fixes for syzbot and other bug reports
 
 Signed-off-by: Dave Chinner <david@fromorbit.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEmJOoJ8GffZYWSjj/regpR/R1+h0FAmRMTZ8UHGRhdmlkQGZy
 b21vcmJpdC5jb20ACgkQregpR/R1+h3XtA//bZYjsYRU3hzyGLKee++5t/zbiqZB
 KWw8zuPEdEsSAPphK4DQYO7XPWetgFh8iBU39M8TM0+g5YacjzBLGADjQiEv7naN
 IxSoElQQzZbvMcUPOnuRaoopn0v7pbWIDRo3hKWaRDKCrnMGOnTvDFuC/VX0RAbn
 GzPimbuvaYJPXTnWTwsKeAuVYP4HLdTh2R1gUMjyY80Ed08hxhCzrXSvjEtuxOOy
 tDk50wJUhgx7UTgFBsXug1wXLCYwDFvAUjpsBKnmq+vSl0MpI3TdCetmSQbuvAeu
 gvkRyBMOcqcY5rlozcKPpyXwy7I0ftXOY4xpUSW8H9tAx0oVImkC69DsAjotQV0r
 r6vEtcw7LgmaS9kbA6G2Z4JfKEHuf2d/6OI4onZh25b5SWq7+qFBPo67AoFg8UQf
 bKSf3QQNVLTyWqpRf8Z3XOEBygYGsDUuxrm2AA5Aar4t4T3y5oAKFKkf4ZAlAYxH
 KViQsq0qVcoQ4k4txZgU7XQrftKyu2csqxqtKDozH7FutxscchZEwvjdQ6jnS2+L
 2Qlf6On8edfEkPKzF7/1cgxUXCXuTqakFVetChXjZ1/ZFt9LUYphvESdaolJ8Aqz
 lwEy5UrbC2oMrBDT7qESLWs3U66mPhRaaFfuLUJRyhHN3Y0tVVA2mgNzyD6oBQVy
 ffIbZ3+1QEPOaOQ=
 =lBJy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Dave Chinner:
 "This consists mainly of online scrub functionality and the design
  documentation for the upcoming online repair functionality built on
  top of the scrub code:

   - Added detailed design documentation for the upcoming online repair
     feature

   - major update to online scrub to complete the reverse mapping
     cross-referencing infrastructure enabling us to fully validate
     allocated metadata against owner records. This is the last piece of
     scrub infrastructure needed before we can start merging online
     repair functionality.

   - Fixes for the ascii-ci hashing issues

   - deprecation of the ascii-ci functionality

   - on-disk format verification bug fixes

   - various random bug fixes for syzbot and other bug reports"

* tag 'xfs-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (107 commits)
  xfs: fix livelock in delayed allocation at ENOSPC
  xfs: Extend table marker on deprecated mount options table
  xfs: fix duplicate includes
  xfs: fix BUG_ON in xfs_getbmap()
  xfs: verify buffer contents when we skip log replay
  xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done
  xfs: remove WARN when dquot cache insertion fails
  xfs: don't consider future format versions valid
  xfs: deprecate the ascii-ci feature
  xfs: test the ascii case-insensitive hash
  xfs: stabilize the dirent name transformation function used for ascii-ci dir hash computation
  xfs: cross-reference rmap records with refcount btrees
  xfs: cross-reference rmap records with inode btrees
  xfs: cross-reference rmap records with free space btrees
  xfs: cross-reference rmap records with ag btrees
  xfs: introduce bitmap type for AG blocks
  xfs: convert xbitmap to interval tree
  xfs: drop the _safe behavior from the xbitmap foreach macro
  xfs: don't load local xattr values during scrub
  xfs: remove the for_each_xbitmap_ helpers
  ...
2023-04-29 10:44:27 -07:00
Linus Torvalds
bedf149527 New code for 6.4:
* Remove an unused symbol.
  * Add tracepoints for the directio code.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZEKzuQAKCRBKO3ySh0YR
 pothAQD9sBm7//+vYXxQXPlmX09Jvxixnlwyth+PvUI2Al3mrgEA0h1ZSRhxBbxx
 QiIFXCQYckb9GTIcRd67lCp1j/Ie/g0=
 =vGbm
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "The only changes for this cycle are the addition of tracepoints to the
  iomap directio code so that Ritesh (who is working on porting ext2 to
  iomap) can observe the io flows more easily.

  Summary:

   - Remove an unused symbol

   - Add tracepoints for the directio code"

* tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Add DIO tracepoints
  iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
  fs.h: Add TRACE_IOCB_STRINGS for use in trace points
2023-04-29 10:35:48 -07:00
Steve French
9be11a6931 cifs: update internal module version number for cifs.ko
From 2.42 to 2.43

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-28 22:50:42 -05:00
Steve French
2fe187dca6 smb3: move some common open context structs to smbfs_common
create durable and create durable reconnect context and the maximal
access create context struct definitions can be put in common code in
smbfs_common

Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-28 22:50:32 -05:00
Steve French
1149c8467d smb3: make query_on_disk_id open context consistent and move to common code
cifs and ksmbd were using a slightly different version of the query_on_disk_id
struct which could be confusing. Use the ksmbd version of this struct, and
move it to common code.

Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-28 22:50:16 -05:00
Steve French
c09ba02cfa SMB3.1.1: add new tree connect ShareFlags
Also update these flag names in a few places to match the simpler
easier to understand names now used in the protocol documentation
(see MS-SMB2 section 2.2.10)

Acked-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-28 21:24:52 -05:00
Steve French
943fb67b09 cifs: missing lock when updating session status
Coverity noted a place where we were not grabbing
the ses_lock when setting (and checking) ses_status.

Addresses-Coverity: 1536833 ("Data race condition (MISSING_LOCK)")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-28 21:23:42 -05:00
Linus Torvalds
d579c468d7 tracing updates for 6.4:
- User events are finally ready!
   After lots of collaboration between various parties, we finally locked
   down on a stable interface for user events that can also work with user
   space only tracing. This is implemented by telling the kernel (or user
   space library, but that part is user space only and not part of this
   patch set), where the variable is that the application uses to know if
   something is listening to the trace. There's also an interface to tell
   the kernel about these events, which will show up in the
   /sys/kernel/tracing/events/user_events/ directory, where it can be
    enabled. When it's enabled, the kernel will update the variable, to tell
   the application to start writing to the kernel.
   See https://lwn.net/Articles/927595/
 
 - Cleaned up the direct trampolines code to simplify arm64 addition of
   direct trampolines. Direct trampolines use the ftrace interface but
   instead of jumping to the ftrace trampoline, applications (mostly BPF)
   can register their own trampoline for performance reasons.
 
 - Some updates to the fprobe infrastructure. fprobes are more efficient than
   kprobes, as it does not need to save all the registers that kprobes on
   ftrace do. More work needs to be done before the fprobes will be exposed
   as dynamic events.
 
 - More updates to references to the obsolete path of
   /sys/kernel/debug/tracing for the new /sys/kernel/tracing path.
 
 - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line
   by line instead of all at once. There's users in production kernels that
   have a large data dump that originally used printk() directly, but the
   data dump was larger than what printk() allowed as a single print.
   Using seq_buf() to do the printing fixes that.
 
 - Add /sys/kernel/tracing/touched_functions that shows all functions that
   was every traced by ftrace or a direct trampoline. This is used for
   debugging issues where a traced function could have caused a crash by
   a bpf program or live patching.
 
 - Add a "fields" option that is similar to "raw" but outputs the fields of
   the events. It's easier to read by humans.
 
 - Some minor fixes and clean ups.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZEr36xQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quZHAQCzuqnn2S8DsPd3Sy1vKIYaj0uajW5D
 Kz1oUJH4F0H7kgEA8XwXkdtfKpOXWc/ZH4LWfL7Orx2wJZJQMV9dVqEPDAE=
 =w0Z1
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing updates from Steven Rostedt:

 - User events are finally ready!

   After lots of collaboration between various parties, we finally
   locked down on a stable interface for user events that can also work
   with user space only tracing.

   This is implemented by telling the kernel (or user space library, but
   that part is user space only and not part of this patch set), where
   the variable is that the application uses to know if something is
   listening to the trace.

   There's also an interface to tell the kernel about these events,
   which will show up in the /sys/kernel/tracing/events/user_events/
   directory, where it can be enabled.

   When it's enabled, the kernel will update the variable, to tell the
   application to start writing to the kernel.

   See https://lwn.net/Articles/927595/

 - Cleaned up the direct trampolines code to simplify arm64 addition of
   direct trampolines.

   Direct trampolines use the ftrace interface but instead of jumping to
   the ftrace trampoline, applications (mostly BPF) can register their
   own trampoline for performance reasons.

 - Some updates to the fprobe infrastructure. fprobes are more efficient
   than kprobes, as it does not need to save all the registers that
   kprobes on ftrace do. More work needs to be done before the fprobes
   will be exposed as dynamic events.

 - More updates to references to the obsolete path of
   /sys/kernel/debug/tracing for the new /sys/kernel/tracing path.

 - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer
   line by line instead of all at once.

   There are users in production kernels that have a large data dump
   that originally used printk() directly, but the data dump was larger
   than what printk() allowed as a single print.

   Using seq_buf() to do the printing fixes that.

 - Add /sys/kernel/tracing/touched_functions that shows all functions
   that was every traced by ftrace or a direct trampoline. This is used
   for debugging issues where a traced function could have caused a
   crash by a bpf program or live patching.

 - Add a "fields" option that is similar to "raw" but outputs the fields
   of the events. It's easier to read by humans.

 - Some minor fixes and clean ups.

* tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits)
  ring-buffer: Sync IRQ works before buffer destruction
  tracing: Add missing spaces in trace_print_hex_seq()
  ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus
  recordmcount: Fix memory leaks in the uwrite function
  tracing/user_events: Limit max fault-in attempts
  tracing/user_events: Prevent same address and bit per process
  tracing/user_events: Ensure bit is cleared on unregister
  tracing/user_events: Ensure write index cannot be negative
  seq_buf: Add seq_buf_do_printk() helper
  tracing: Fix print_fields() for __dyn_loc/__rel_loc
  tracing/user_events: Set event filter_type from type
  ring-buffer: Clearly check null ptr returned by rb_set_head_page()
  tracing: Unbreak user events
  tracing/user_events: Use print_format_fields() for trace output
  tracing/user_events: Align structs with tabs for readability
  tracing/user_events: Limit global user_event count
  tracing/user_events: Charge event allocs to cgroups
  tracing/user_events: Update documentation for ABI
  tracing/user_events: Use write ABI in example
  tracing/user_events: Add ABI self-test
  ...
2023-04-28 15:57:53 -07:00
Anna Schumaker
fbd2a05f29 NFSv4.2: Rework scratch handling for READ_PLUS
Instead of using a tiny, static scratch buffer, we should use a kmalloc()-ed
buffer that is allocated when checking for read plus usage. This lets us
use the buffer before decoding any part of the READ_PLUS operation
instead of setting it right before segment decoding, meaning it should
be a little more robust.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-04-28 15:48:45 -04:00
Eric Van Hensbergen
21e26d5e54
fs/9p: Fix bit operation logic error
This re-introduces a fix that somehow got dropped during rebase of the
current series in for-next.  When writeback is enabled, opens
are forced to support both read and write operations but with the
logic error other flags may be dropped unintentionaly.

Reported-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2023-04-28 16:59:26 +00:00
Theodore Ts'o
d4fab7b28e ext4: clean up error handling in __ext4_fill_super()
There were two ways to return an error code; one was via setting the
'err' variable, and the second, if err was zero, was via the 'ret'
variable.  This was both confusing and fragile, and when code was
factored out of __ext4_fill_super(), some of the error codes returned
by the original code was replaced by -EINVAL, and in one case, the
error code was placed by 0, triggering a kernel null pointer
dereference.

Clean this up by removing the 'ret' variable, leaving only one way to
set the error code to be returned, and restore the errno codes that
were returned via the the mount system call as they were before we
started refactoring __ext4_fill_super().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
2023-04-28 12:56:40 -04:00
Theodore Ts'o
3b50d5018e ext4: reflect error codes from ext4_multi_mount_protect() to its callers
This will allow more fine-grained errno codes to be returned by the
mount system call.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-28 12:56:40 -04:00
Theodore Ts'o
d5e72c4e32 ext4: fix lost error code reporting in __ext4_fill_super()
When code was factored out of __ext4_fill_super() into
ext4_percpu_param_init() the error return was discarded.  This meant
that it was possible for __ext4_fill_super() to return zero,
indicating success, without the struct super getting completely filled
in, leading to a potential NULL pointer dereference.

Reported-by: syzbot+bbf0f9a213c94f283a5c@syzkaller.appspotmail.com
Fixes: 1f79467c8a ("ext4: factor out ext4_percpu_param_init() ...")
Link: https://syzkaller.appspot.com/bug?id=6dac47d5e58af770c0055f680369586ec32e144c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
2023-04-28 12:56:40 -04:00
Nathan Chancellor
856dd6c598 ext4: fix unused iterator variable warnings
When CONFIG_QUOTA is disabled, there are warnings around unused iterator
variables:

  fs/ext4/super.c: In function 'ext4_put_super':
  fs/ext4/super.c:1262:13: error: unused variable 'i' [-Werror=unused-variable]
   1262 |         int i, err;
        |             ^
  fs/ext4/super.c: In function '__ext4_fill_super':
  fs/ext4/super.c:5200:22: error: unused variable 'i' [-Werror=unused-variable]
   5200 |         unsigned int i;
        |                      ^
  cc1: all warnings being treated as errors

The kernel has updated to GNU11, allowing the variables to be declared
within the for loop.  Do so to clear up the warnings.

Fixes: dcbf87589d ("ext4: factor out ext4_flex_groups_free()")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230420-ext4-unused-variables-super-c-v1-1-138b6db6c21c@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-28 12:56:40 -04:00
Ye Bin
835659598c ext4: fix use-after-free read in ext4_find_extent for bigalloc + inline
Syzbot found the following issue:
loop0: detected capacity change from 0 to 2048
EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 without journal. Quota mode: none.
==================================================================
BUG: KASAN: use-after-free in ext4_ext_binsearch_idx fs/ext4/extents.c:768 [inline]
BUG: KASAN: use-after-free in ext4_find_extent+0x76e/0xd90 fs/ext4/extents.c:931
Read of size 4 at addr ffff888073644750 by task syz-executor420/5067

CPU: 0 PID: 5067 Comm: syz-executor420 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 ext4_ext_binsearch_idx fs/ext4/extents.c:768 [inline]
 ext4_find_extent+0x76e/0xd90 fs/ext4/extents.c:931
 ext4_clu_mapped+0x117/0x970 fs/ext4/extents.c:5809
 ext4_insert_delayed_block fs/ext4/inode.c:1696 [inline]
 ext4_da_map_blocks fs/ext4/inode.c:1806 [inline]
 ext4_da_get_block_prep+0x9e8/0x13c0 fs/ext4/inode.c:1870
 ext4_block_write_begin+0x6a8/0x2290 fs/ext4/inode.c:1098
 ext4_da_write_begin+0x539/0x760 fs/ext4/inode.c:3082
 generic_perform_write+0x2e4/0x5e0 mm/filemap.c:3772
 ext4_buffered_write_iter+0x122/0x3a0 fs/ext4/file.c:285
 ext4_file_write_iter+0x1d0/0x18f0
 call_write_iter include/linux/fs.h:2186 [inline]
 new_sync_write fs/read_write.c:491 [inline]
 vfs_write+0x7dc/0xc50 fs/read_write.c:584
 ksys_write+0x177/0x2a0 fs/read_write.c:637
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f4b7a9737b9
RSP: 002b:00007ffc5cac3668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4b7a9737b9
RDX: 00000000175d9003 RSI: 0000000020000200 RDI: 0000000000000004
RBP: 00007f4b7a933050 R08: 0000000000000000 R09: 0000000000000000
R10: 000000000000079f R11: 0000000000000246 R12: 00007f4b7a9330e0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Above issue is happens when enable bigalloc and inline data feature. As
commit 131294c35e fixed delayed allocation bug in ext4_clu_mapped for
bigalloc + inline. But it only resolved issue when has inline data, if
inline data has been converted to extent(ext4_da_convert_inline_data_to_extent)
before writepages, there is no EXT4_STATE_MAY_INLINE_DATA flag. However
i_data is still store inline data in this scene. Then will trigger UAF
when find extent.
To resolve above issue, there is need to add judge "ext4_has_inline_data(inode)"
in ext4_clu_mapped().

Fixes: 131294c35e ("ext4: fix delayed allocation bug in ext4_clu_mapped for bigalloc + inline")
Reported-by: syzbot+bf4bb7731ef73b83a3b4@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Tested-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Link: https://lore.kernel.org/r/20230406111627.1916759-1-tudor.ambarus@linaro.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-28 12:56:35 -04:00
Linus Torvalds
22b8cc3e78 Add support for new Linear Address Masking CPU feature. This is similar
to ARM's Top Byte Ignore and allows userspace to store metadata in some
 bits of pointers without masking it out before use.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRK/WIACgkQaDWVMHDJ
 krAL+RAAw33EhsWyYVkeAtYmYBKkGvlgeSDULtfJKe5bynJBTHkGKfM6RE9MSJIt
 5fHWaConGh8HNpy0Us1sDvd/aWcWRm5h7ZcCVD+R4qrgh/vc7ULzM+elXe5jzr4W
 cyuTckF2eW6SVrYg6fH5q+6Uy/moDtrdkLRvwRBf+AYeepB8gvSSH5XixKDNiVBE
 pjNy1xXVZQokqD4tjsFelmLttyacR5OabiE/aeVNoFYf9yTwfnN8N3T6kwuOoS4l
 Lp6NA+/0ux+oBlR+Is+JJG8Mxrjvz96yJGZYdR2YP5k3bMQtHAAjuq2w+GgqZm5i
 j3/E6KQepEGaCfC+bHl68xy/kKx8ik+jMCEcBalCC25J3uxbLz41g6K3aI890wJn
 +5ZtfcmoDUk9pnUyLxR8t+UjOSBFAcRSUE+FTjUH1qEGsMPK++9a4iLXz5vYVK1+
 +YCt1u5LNJbkDxE8xVX3F5jkXh0G01SJsuUVAOqHSNfqSNmohFK8/omqhVRrRqoK
 A7cYLtnOGiUXLnvjrwSxPNOzRrG+GAwqaw8gwOTaYogETWbTY8qsSCEVl204uYwd
 m8io9rk2ZXUdDuha56xpBbPE0JHL9hJ2eKCuPkfvRgJT9YFyTh+e0UdX20k+nDjc
 ang1S350o/Y0sus6rij1qS8AuxJIjHucG0GdgpZk3KUbcxoRLhI=
 =qitk
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 LAM (Linear Address Masking) support from Dave Hansen:
 "Add support for the new Linear Address Masking CPU feature.

  This is similar to ARM's Top Byte Ignore and allows userspace to store
  metadata in some bits of pointers without masking it out before use"

* tag 'x86_mm_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/iommu/sva: Do not allow to set FORCE_TAGGED_SVA bit from outside
  x86/mm/iommu/sva: Fix error code for LAM enabling failure due to SVA
  selftests/x86/lam: Add test cases for LAM vs thread creation
  selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking
  selftests/x86/lam: Add inherit test cases for linear-address masking
  selftests/x86/lam: Add io_uring test cases for linear-address masking
  selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking
  selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking
  x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
  iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
  mm: Expose untagging mask in /proc/$PID/status
  x86/mm: Provide arch_prctl() interface for LAM
  x86/mm: Reduce untagged_addr() overhead for systems without LAM
  x86/uaccess: Provide untagged_addr() and remove tags before address check
  mm: Introduce untagged_addr_remote()
  x86/mm: Handle LAM on context switch
  x86: CPUID and CR3/CR4 flags for Linear Address Masking
  x86: Allow atomic MM_CONTEXT flags setting
  x86/mm: Rework address range check in get_user() and put_user()
2023-04-28 09:43:49 -07:00
Maxim Korotkov
3e46c89c74 writeback: fix call of incorrect macro
the variable 'history' is of type u16, it may be an error
 that the hweight32 macro was used for it
 I guess macro hweight16 should be used

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 2a81490811 ("writeback: implement foreign cgroup inode detection")
Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230119104443.3002-1-korotkov.maxim.s@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-28 10:41:32 -06:00
Naohiro Aota
631003e233 btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones
find_next_bit and find_next_zero_bit take @size as the second parameter and
@offset as the third parameter. They are specified opposite in
btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to
detect the empty zones. Fix them and (maybe) return the result a bit
faster.

Note: the naming is a bit confusing, size has two meanings here, bitmap
and our range size.

Fixes: 1cd6121f2a ("btrfs: zoned: implement zoned chunk allocator")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 17:17:25 +02:00
Zhihao Cheng
1dedde6903 ext4: fix i_disksize exceeding i_size problem in paritally written case
It is possible for i_disksize can exceed i_size, triggering a warning.

generic_perform_write
 copied = iov_iter_copy_from_user_atomic(len) // copied < len
 ext4_da_write_end
 | ext4_update_i_disksize
 |  new_i_size = pos + copied;
 |  WRITE_ONCE(EXT4_I(inode)->i_disksize, newsize) // update i_disksize
 | generic_write_end
 |  copied = block_write_end(copied, len) // copied = 0
 |   if (unlikely(copied < len))
 |    if (!PageUptodate(page))
 |     copied = 0;
 |  if (pos + copied > inode->i_size) // return false
 if (unlikely(copied == 0))
  goto again;
 if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
  status = -EFAULT;
  break;
 }

We get i_disksize greater than i_size here, which could trigger WARNING
check 'i_size_read(inode) < EXT4_I(inode)->i_disksize' while doing dio:

ext4_dio_write_iter
 iomap_dio_rw
  __iomap_dio_rw // return err, length is not aligned to 512
 ext4_handle_inode_extension
  WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize) // Oops

 WARNING: CPU: 2 PID: 2609 at fs/ext4/file.c:319
 CPU: 2 PID: 2609 Comm: aa Not tainted 6.3.0-rc2
 RIP: 0010:ext4_file_write_iter+0xbc7
 Call Trace:
  vfs_write+0x3b1
  ksys_write+0x77
  do_syscall_64+0x39

Fix it by updating 'copied' value before updating i_disksize just like
ext4_write_inline_data_end() does.

A reproducer can be found in the buganizer link below.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217209
Fixes: 64769240bd ("ext4: Add delayed allocation support in data=writeback mode")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230321013721.89818-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-28 11:07:19 -04:00
Qu Wenruo
64b5d5b285 btrfs: properly reject clear_cache and v1 cache for block-group-tree
[BUG]
With block-group-tree feature enabled, mounting it with clear_cache
would cause the following transaction abort at mount or remount:

  BTRFS info (device dm-4): force clearing of disk cache
  BTRFS info (device dm-4): using free space tree
  BTRFS info (device dm-4): auto enabling async discard
  BTRFS info (device dm-4): clearing free space tree
  BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes
  BTRFS error (device dm-4): super block corruption detected before writing it to disk
  BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
  BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction.

[CAUSE]
For block-group-tree feature, we have an artificial dependency on
free-space-tree.

This means if we detect block-group-tree without v2 cache, we consider
it a corruption and cause the problem.

For clear_cache mount option, it would temporary disable v2 cache, then
re-enable it.

But unfortunately for that temporary v2 cache disabled status, we refuse
to write a superblock with bg tree only flag, thus leads to the above
transaction abortion.

[FIX]
For now, just reject clear_cache and v1 cache mount option for block
group tree.  So now we got a graceful rejection other than a transaction
abort:

  BTRFS info (device dm-4): force clearing of disk cache
  BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature
  BTRFS error (device dm-4): open_ctree failed

CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:45 +02:00
Filipe Manana
a2cea677db btrfs: print extent buffers when sibling keys check fails
When trying to move keys from one node/leaf to another sibling node/leaf,
if the sibling keys check fails we just print an error message with the
last key of the left sibling and the first key of the right sibling.
However it's also useful to print all the keys of each sibling, as it
may provide some clues to what went wrong, which code path may be
inserting keys in an incorrect order. So just do that, print the siblings
with btrfs_print_tree(), as it works for both leaves and nodes.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:39 +02:00
Filipe Manana
9ae5afd02a btrfs: abort transaction when sibling keys check fails for leaves
If the sibling keys check fails before we move keys from one sibling
leaf to another, we are not aborting the transaction - we leave that to
some higher level caller of btrfs_search_slot() (or anything else that
uses it to insert items into a b+tree).

This means that the transaction abort will provide a stack trace that
omits the b+tree modification call chain. So change this to immediately
abort the transaction and therefore get a more useful stack trace that
shows us the call chain in the bt+tree modification code.

It's also important to immediately abort the transaction just in case
some higher level caller is not doing it, as this indicates a very
serious corruption and we should stop the possibility of doing further
damage.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:37 +02:00
Filipe Manana
611ccc58e1 btrfs: fix leak of source device allocation state after device replace
When a device replace finishes, the source device is freed by calling
btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
allocation state, tracked in the device's alloc_state io tree, is never
freed.

This is a regression recently introduced by commit f0bb5474cf ("btrfs:
remove redundant release of btrfs_device::alloc_state"), which removed a
call to extent_io_tree_release() from btrfs_free_device(), with the
rationale that btrfs_close_one_device() already releases the allocation
state from a device and btrfs_close_one_device() is always called before
a device is freed with btrfs_free_device(). However that is not true for
the device replace case, as btrfs_free_device() is called without any
previous call to btrfs_close_one_device().

The issue is trivial to reproduce, for example, by running test btrfs/027
from fstests:

  $ ./check btrfs/027
  $ rmmod btrfs
  $ dmesg
  (...)
  [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
  [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
  [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
  [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
  [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
  [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
  [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
  [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
  [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
  [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
  [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
  [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1

So fix this by adding back the call to extent_io_tree_release() at
btrfs_free_device().

Fixes: f0bb5474cf ("btrfs: remove redundant release of btrfs_device::alloc_state")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:31 +02:00
xiaoshoukui
ac868bc9d1 btrfs: fix assertion of exclop condition when starting balance
Balance as exclusive state is compatible with paused balance and device
add, which makes some things more complicated. The assertion of valid
states when starting from paused balance needs to take into account two
more states, the combinations can be hit when there are several threads
racing to start balance and device add. This won't typically happen when
the commands are started from command line.

Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE.

Concurrently adding multiple devices to the same mount point and
btrfs_exclop_finish executed finishes before assertion in
btrfs_exclop_balance, exclusive_operation will changed to
BTRFS_EXCLOP_NONE state which lead to assertion failed:

  fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
  fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD,
  in fs/btrfs/ioctl.c:456
  Call Trace:
   <TASK>
   btrfs_exclop_balance+0x13c/0x310
   ? memdup_user+0xab/0xc0
   ? PTR_ERR+0x17/0x20
   btrfs_ioctl_add_dev+0x2ee/0x320
   btrfs_ioctl+0x9d5/0x10d0
   ? btrfs_ioctl_encoded_write+0xb80/0xb80
   __x64_sys_ioctl+0x197/0x210
   do_syscall_64+0x3c/0xb0
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED.

Concurrently adding multiple devices to the same mount point and
btrfs_exclop_balance executed finish before the latter thread execute
assertion in btrfs_exclop_balance, exclusive_operation will changed to
BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed:

  fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
  fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD ||
  fs_info->exclusive_operation == BTRFS_EXCLOP_NONE,
  fs/btrfs/ioctl.c:458
  Call Trace:
   <TASK>
   btrfs_exclop_balance+0x240/0x410
   ? memdup_user+0xab/0xc0
   ? PTR_ERR+0x17/0x20
   btrfs_ioctl_add_dev+0x2ee/0x320
   btrfs_ioctl+0x9d5/0x10d0
   ? btrfs_ioctl_encoded_write+0xb80/0xb80
   __x64_sys_ioctl+0x197/0x210
   do_syscall_64+0x3c/0xb0
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

An example of the failed assertion is below, which shows that the
paused balance is also needed to be checked.

  root@syzkaller:/home/xsk# ./repro
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  [  416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0
  Failed to add device /dev/vda, errno 14
  [  416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3
  [  416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  Failed to add device /dev/vda, errno 14
  [  416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3
  Failed to add device /dev/vda, errno 14
  [  416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3
  [  416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1
  Failed to add device /dev/vda, errno 14
  [  416.637759][ T7984] assertion failed: fs_info->exclusive_operation ==
  BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation ==
  BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation ==
  BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458
  [  416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  [  416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7
  [  416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
  [  416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
  [  416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
  [  416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
  [  416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
  [  416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
  [  416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
  [  416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
  [  416.648785][ T7984] FS:  00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
  [  416.649616][ T7984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
  [  416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  416.652502][ T7984] PKRU: 55555554
  [  416.652888][ T7984] Call Trace:
  [  416.653241][ T7984]  <TASK>
  [  416.653527][ T7984]  btrfs_exclop_balance+0x240/0x410
  [  416.654036][ T7984]  ? memdup_user+0xab/0xc0
  [  416.654465][ T7984]  ? PTR_ERR+0x17/0x20
  [  416.654874][ T7984]  btrfs_ioctl_add_dev+0x2ee/0x320
  [  416.655380][ T7984]  btrfs_ioctl+0x9d5/0x10d0
  [  416.655822][ T7984]  ? btrfs_ioctl_encoded_write+0xb80/0xb80
  [  416.656400][ T7984]  __x64_sys_ioctl+0x197/0x210
  [  416.656874][ T7984]  do_syscall_64+0x3c/0xb0
  [  416.657346][ T7984]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [  416.657922][ T7984] RIP: 0033:0x4546af
  [  416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [  416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af
  [  416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003
  [  416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f
  [  416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640
  [  416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000
  [  416.664703][ T7984]  </TASK>
  [  416.665040][ T7984] Modules linked in:
  [  416.665590][ T7984] ---[ end trace 0000000000000000 ]---
  [  416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
  [  416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
  [  416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
  [  416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
  [  416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
  [  416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
  [  416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
  [  416.673501][ T7984] FS:  00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
  [  416.674425][ T7984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
  [  416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:27 +02:00
Filipe Manana
6f932d4ef0 btrfs: fix btrfs_prev_leaf() to not return the same key twice
A call to btrfs_prev_leaf() may end up returning a path that points to the
same item (key) again. This happens if while btrfs_prev_leaf(), after we
release the path, a concurrent insertion happens, which moves items off
from a sibling into the front of the previous leaf, and an item with the
computed previous key does not exists.

For example, suppose we have the two following leaves:

  Leaf A

  -------------------------------------------------------------
  | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
  -------------------------------------------------------------
              slot 20             slot 21             slot 22

  Leaf B

  -------------------------------------------------------------
  | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
  -------------------------------------------------------------
      slot 0             slot 1             slot 2

If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
a path pointing to leaf B and slot 0 and the following happens:

1) At btrfs_prev_leaf() we compute the previous key to search as:
   (300 96 19), which is a key that does not exists in the tree;

2) Then we call btrfs_release_path() at btrfs_prev_leaf();

3) Some other task inserts a key at leaf A, that sorts before the key at
   slot 20, for example it has an objectid of 299. In order to make room
   for the new key, the key at slot 22 is moved to the front of leaf B.
   This happens at push_leaf_right(), called from split_leaf().

   After this leaf B now looks like:

  --------------------------------------------------------------------------------
  | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
  --------------------------------------------------------------------------------
       slot 0              slot 1             slot 2             slot 3

4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
   previous key: (300 96 19). Since the key does not exists,
   btrfs_search_slot() returns 1 and with a path pointing to leaf B
   and slot 1, the item with key (300 96 20);

5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
   leaf B, the same key as before it was called, since the key at slot 0
   of leaf B (300 96 16) is less than the computed previous key, which is
   (300 96 19);

6) As a consequence btrfs_previous_item() returns a path that points again
   to the item with key (300 96 20).

For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
be functional a problem, despite not making sense to return a new path
pointing again to the same item/key. However for a caller such as
tree-log.c:log_dir_items(), this has a bad consequence, as it can result
in not logging some dir index deletions in case the directory is being
logged without holding the inode's VFS lock (logging triggered while
logging a child inode for example) - for the example scenario above, in
case the dir index keys 17, 18 and 19 were deleted in the current
transaction.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:16:30 +02:00
Linus Torvalds
33afd4b763 Mainly singleton patches all over the place. Series of note are:
- updates to scripts/gdb from Glenn Washburn
 
 - kexec cleanups from Bjorn Helgaas
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr+6wAKCRDdBJ7gKXxA
 jn4NAP4u/hj/kR2dxYehcVLuQqJspCRZZBZlAReFJyHNQO6voAEAk0NN9rtG2+/E
 r0G29CJhK+YL0W6mOs8O1yo9J1rZnAM=
 =2CUV
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "Mainly singleton patches all over the place.

  Series of note are:

   - updates to scripts/gdb from Glenn Washburn

   - kexec cleanups from Bjorn Helgaas"

* tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits)
  mailmap: add entries for Paul Mackerras
  libgcc: add forward declarations for generic library routines
  mailmap: add entry for Oleksandr
  ocfs2: reduce ioctl stack usage
  fs/proc: add Kthread flag to /proc/$pid/status
  ia64: fix an addr to taddr in huge_pte_offset()
  checkpatch: introduce proper bindings license check
  epoll: rename global epmutex
  scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry()
  scripts/gdb: create linux/vfs.py for VFS related GDB helpers
  uapi/linux/const.h: prefer ISO-friendly __typeof__
  delayacct: track delays from IRQ/SOFTIRQ
  scripts/gdb: timerlist: convert int chunks to str
  scripts/gdb: print interrupts
  scripts/gdb: raise error with reduced debugging information
  scripts/gdb: add a Radix Tree Parser
  lib/rbtree: use '+' instead of '|' for setting color.
  proc/stat: remove arch_idle_time()
  checkpatch: check for misuse of the link tags
  checkpatch: allow Closes tags with links
  ...
2023-04-27 19:57:00 -07:00
Linus Torvalds
7fa8a8ee94 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
 
 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
 
 - zsmalloc performance improvements from Sergey Senozhatsky.
 
 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.
 
 - VFS rationalizations from Christoph Hellwig:
 
   - removal of most of the callers of write_one_page().
 
   - make __filemap_get_folio()'s return value more useful
 
 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing.  Use `mount -o noswap'.
 
 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.
 
 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).
 
 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.
 
 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
   than exclusive, and has fixed a bunch of errors which were caused by its
   unintuitive meaning.
 
 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.
 
 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.
 
 - Cleanups to do_fault_around() by Lorenzo Stoakes.
 
 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.
 
 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.
 
 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.
 
 - Lorenzo has also contributed further cleanups of vma_merge().
 
 - Chaitanya Prakash provides some fixes to the mmap selftesting code.
 
 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.
 
 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.
 
 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.
 
 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.
 
 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.
 
 - Yosry Ahmed has improved the performance of memcg statistics flushing.
 
 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.
 
 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.
 
 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.
 
 - Pankaj Raghav has made page_endio() unneeded and has removed it.
 
 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.
 
 - Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
 
 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.
 
 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.
 
 - Peter Xu fixes a few issues with uffd-wp at fork() time.
 
 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
 jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
 eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
 =J+Dj
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
   switching from a user process to a kernel thread.

 - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
   Raghav.

 - zsmalloc performance improvements from Sergey Senozhatsky.

 - Yue Zhao has found and fixed some data race issues around the
   alteration of memcg userspace tunables.

 - VFS rationalizations from Christoph Hellwig:
     - removal of most of the callers of write_one_page()
     - make __filemap_get_folio()'s return value more useful

 - Luis Chamberlain has changed tmpfs so it no longer requires swap
   backing. Use `mount -o noswap'.

 - Qi Zheng has made the slab shrinkers operate locklessly, providing
   some scalability benefits.

 - Keith Busch has improved dmapool's performance, making part of its
   operations O(1) rather than O(n).

 - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
   permitting userspace to wr-protect anon memory unpopulated ptes.

 - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
   rather than exclusive, and has fixed a bunch of errors which were
   caused by its unintuitive meaning.

 - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
   which causes minor faults to install a write-protected pte.

 - Vlastimil Babka has done some maintenance work on vma_merge():
   cleanups to the kernel code and improvements to our userspace test
   harness.

 - Cleanups to do_fault_around() by Lorenzo Stoakes.

 - Mike Rapoport has moved a lot of initialization code out of various
   mm/ files and into mm/mm_init.c.

 - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
   DRM, but DRM doesn't use it any more.

 - Lorenzo has also coverted read_kcore() and vread() to use iterators
   and has thereby removed the use of bounce buffers in some cases.

 - Lorenzo has also contributed further cleanups of vma_merge().

 - Chaitanya Prakash provides some fixes to the mmap selftesting code.

 - Matthew Wilcox changes xfs and afs so they no longer take sleeping
   locks in ->map_page(), a step towards RCUification of pagefaults.

 - Suren Baghdasaryan has improved mmap_lock scalability by switching to
   per-VMA locking.

 - Frederic Weisbecker has reworked the percpu cache draining so that it
   no longer causes latency glitches on cpu isolated workloads.

 - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
   logic.

 - Liu Shixin has changed zswap's initialization so we no longer waste a
   chunk of memory if zswap is not being used.

 - Yosry Ahmed has improved the performance of memcg statistics
   flushing.

 - David Stevens has fixed several issues involving khugepaged,
   userfaultfd and shmem.

 - Christoph Hellwig has provided some cleanup work to zram's IO-related
   code paths.

 - David Hildenbrand has fixed up some issues in the selftest code's
   testing of our pte state changing.

 - Pankaj Raghav has made page_endio() unneeded and has removed it.

 - Peter Xu contributed some rationalizations of the userfaultfd
   selftests.

 - Yosry Ahmed has fixed an issue around memcg's page recalim
   accounting.

 - Chaitanya Prakash has fixed some arm-related issues in the
   selftests/mm code.

 - Longlong Xia has improved the way in which KSM handles hwpoisoned
   pages.

 - Peter Xu fixes a few issues with uffd-wp at fork() time.

 - Stefan Roesch has changed KSM so that it may now be used on a
   per-process and per-cgroup basis.

* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
  mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
  shmem: restrict noswap option to initial user namespace
  mm/khugepaged: fix conflicting mods to collapse_file()
  sparse: remove unnecessary 0 values from rc
  mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
  hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
  maple_tree: fix allocation in mas_sparse_area()
  mm: do not increment pgfault stats when page fault handler retries
  zsmalloc: allow only one active pool compaction context
  selftests/mm: add new selftests for KSM
  mm: add new KSM process and sysfs knobs
  mm: add new api to enable ksm per process
  mm: shrinkers: fix debugfs file permissions
  mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
  migrate_pages_batch: fix statistics for longterm pin retry
  userfaultfd: use helper function range_in_vma()
  lib/show_mem.c: use for_each_populated_zone() simplify code
  mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
  fs/buffer: convert create_page_buffers to folio_create_buffers
  fs/buffer: add folio_create_empty_buffers helper
  ...
2023-04-27 19:42:02 -07:00
Linus Torvalds
0835b5ee87 pstore update for v6.4-rc1
- Revert pmsg_lock back to a normal mutex (John Stultz)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmRJaPkWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtADD/42DSLBg9dAvFHUpXSuffbbhL/w
 HhbfPlsmSujWWWRE3xmEaWJrUQ+Ag6NHyHR7Euko6tBhtj1MhtPle4Di57H5sMid
 8R+7C+3XDmy5WeUF60dribiiKjtNiRIzWefsQyHn4fguaZ5SWHN+iwtvmBofWC44
 YQXaLR5lbxukZTKwiPjdJefS139/QMsKXx3mKu7IdtjjZ5yemH8iTvsQS/2nLkIS
 LWgBN2boopSVtJJslam/29JIhtT9UGoS/ooFJGkoFKXJrVY1+aiqxrYDihgH1K6b
 FoEb/+G/z9M9KxCNGOqv/h+Nl2Oa5L8hdvBy5UsUxhGUNG8/nqsjIwWjJmba9fJu
 3bJfMpsEja955Omq73UFVsgR8OTuy5z91XbR3jJk+4YQlXWgcqvoAYiM0SHX4z7W
 tB1OPCTGDaNLInYA6YHESlbiAmtk/Peizgs9n4PkOeCN26LWGV/FfjR+zorO+6xO
 NNbM1XN/Xdzp/oNwnU3TqRdI6F7v81uQfIiS0VDJoJ7jpHAVQA042l2zwihoopC2
 ErIBKUqpgfGUDxu29QEdfhdwkSfofyjfOzZ5iHYVsvxhn7oS7Xx+zxyp/mFReoIF
 bsqUsAZdCeMgye8wZZmNDlGaLsmLJB/bnt6XqNYMtSzp6ktpIkyBn/rRqhQYRrZK
 g//x5fMMz8fNZK1z0w==
 =5Jr7
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore update from Kees Cook:

 - Revert pmsg_lock back to a normal mutex (John Stultz)

* tag 'pstore-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Revert pmsg_lock back to a normal mutex
2023-04-27 17:03:40 -07:00
Linus Torvalds
888d3c9f7f sysctl-6.4-rc1
This pull request goes with only a few sysctl moves from the
 kernel/sysctl.c file, the rest of the work has been put towards
 deprecating two API calls which incur recursion and prevent us
 from simplifying the registration process / saving memory per
 move. Most of the changes have been soaking on linux-next since
 v6.3-rc3.
 
 I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
 feedback that we should see if we could *save* memory with these
 moves instead of incurring more memory. We currently incur more
 memory since when we move a syctl from kernel/sysclt.c out to its
 own file we end up having to add a new empty sysctl used to register
 it. To achieve saving memory we want to allow syctls to be passed
 without requiring the end element being empty, and just have our
 registration process rely on ARRAY_SIZE(). Without this, supporting
 both styles of sysctls would make the sysctl registration pretty
 brittle, hard to read and maintain as can be seen from Meng Tang's
 efforts to do just this [0]. Fortunately, in order to use ARRAY_SIZE()
 for all sysctl registrations also implies doing the work to deprecate
 two API calls which use recursion in order to support sysctl
 declarations with subdirectories.
 
 And so during this development cycle quite a bit of effort went into
 this deprecation effort. I've annotated the following two APIs are
 deprecated and in few kernel releases we should be good to remove them:
 
   * register_sysctl_table()
   * register_sysctl_paths()
 
 During this merge window we should be able to deprecate and unexport
 register_sysctl_paths(), we can probably do that towards the end
 of this merge window.
 
 Deprecating register_sysctl_table() will take a bit more time but
 this pull request goes with a few example of how to do this.
 
 As it turns out each of the conversions to move away from either of
 these two API calls *also* saves memory. And so long term, all these
 changes *will* prove to have saved a bit of memory on boot.
 
 The way I see it then is if remove a user of one deprecated call, it
 gives us enough savings to move one kernel/sysctl.c out from the
 generic arrays as we end up with about the same amount of bytes.
 
 Since deprecating register_sysctl_table() and register_sysctl_paths()
 does not require maintainer coordination except the final unexport
 you'll see quite a bit of these changes from other pull requests, I've
 just kept the stragglers after rc3.
 
 Most of these changes have been soaking on linux-next since around rc3.
 
 [0] https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRHAjQSHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinTzgQAI/uKHKi0VlUR1l2Psl0XbseUVueuyj3
 ZDxSJpbVUmsoDf2MlLjzB8mYE3ricnNTDbLr7qOyA6pXdM1N0mY5LQmRVRu8/ffd
 2T1hQ5pl7YnJdWP5dPhcF9Y+jnu1tjX1MW5DS4fzllwK7FnD86HuIruGq52RAPS/
 /FH+BD9eodLWWXk6A/o2GFqoWxPKQI0GLxEYWa7Hg7yt8E/3PQL9QsRzn8i6U+HW
 BrN/+G3YD1VCCzXu0UAeXnm+i1Z7CdvqNdZuSkvE3DObiZ5WpOS+/i7FrDB7zdiu
 zAbHaifHnDPtcK3w2ZodbLAAwEWD/mG4iwIjE2kgIMVYxBv7TFDBRREXAWYAevIT
 UUuZnWDQsGaWdjywrebaUycEfd6dytKyan0fTXgMFkcoWRjejhitfdM2iZDdQROg
 q453p4HqOw4vTrhy4ov4zOX7J3EFiBzpZdl+SmLqcXk+jbLVb/Q9snUWz1AFtHBl
 gHoP5bS82uVktGG3MsObjgTzYYMQjO9YGIrVuW1VP9uWs8WaoWx6M9FQJIIhtwE+
 h6wG2s7CjuFWnS0/IxWmDOn91QyUn1w7ohiz9TuvYj/5GLSBpBDGCJHsNB5T2WS1
 qbQRaZ2Kg3j9TeyWfXxdlxBx7bt3ni+J/IXDY0zom2sTpGHKl8D2g5AzmEXJDTpl
 kd7Z3gsmwhDh
 =0U0W
 -----END PGP SIGNATURE-----

Merge tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull sysctl updates from Luis Chamberlain:
 "This only does a few sysctl moves from the kernel/sysctl.c file, the
  rest of the work has been put towards deprecating two API calls which
  incur recursion and prevent us from simplifying the registration
  process / saving memory per move. Most of the changes have been
  soaking on linux-next since v6.3-rc3.

  I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
  feedback that we should see if we could *save* memory with these moves
  instead of incurring more memory. We currently incur more memory since
  when we move a syctl from kernel/sysclt.c out to its own file we end
  up having to add a new empty sysctl used to register it. To achieve
  saving memory we want to allow syctls to be passed without requiring
  the end element being empty, and just have our registration process
  rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
  would make the sysctl registration pretty brittle, hard to read and
  maintain as can be seen from Meng Tang's efforts to do just this [0].
  Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
  also implies doing the work to deprecate two API calls which use
  recursion in order to support sysctl declarations with subdirectories.

  And so during this development cycle quite a bit of effort went into
  this deprecation effort. I've annotated the following two APIs are
  deprecated and in few kernel releases we should be good to remove
  them:

   - register_sysctl_table()
   - register_sysctl_paths()

  During this merge window we should be able to deprecate and unexport
  register_sysctl_paths(), we can probably do that towards the end of
  this merge window.

  Deprecating register_sysctl_table() will take a bit more time but this
  pull request goes with a few example of how to do this.

  As it turns out each of the conversions to move away from either of
  these two API calls *also* saves memory. And so long term, all these
  changes *will* prove to have saved a bit of memory on boot.

  The way I see it then is if remove a user of one deprecated call, it
  gives us enough savings to move one kernel/sysctl.c out from the
  generic arrays as we end up with about the same amount of bytes.

  Since deprecating register_sysctl_table() and register_sysctl_paths()
  does not require maintainer coordination except the final unexport
  you'll see quite a bit of these changes from other pull requests, I've
  just kept the stragglers after rc3"

Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]

* tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
  fs: fix sysctls.c built
  mm: compaction: remove incorrect #ifdef checks
  mm: compaction: move compaction sysctl to its own file
  mm: memory-failure: Move memory failure sysctls to its own file
  arm: simplify two-level sysctl registration for ctl_isa_vars
  ia64: simplify one-level sysctl registration for kdump_ctl_table
  utsname: simplify one-level sysctl registration for uts_kern_table
  ntfs: simplfy one-level sysctl registration for ntfs_sysctls
  coda: simplify one-level sysctl registration for coda_table
  fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
  xfs: simplify two-level sysctl registration for xfs_table
  nfs: simplify two-level sysctl registration for nfs_cb_sysctls
  nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
  lockd: simplify two-level sysctl registration for nlm_sysctls
  proc_sysctl: enhance documentation
  xen: simplify sysctl registration for balloon
  md: simplify sysctl registration
  hv: simplify sysctl registration
  scsi: simplify sysctl registration with register_sysctl()
  csky: simplify alignment sysctl registration
  ...
2023-04-27 16:52:33 -07:00
Linus Torvalds
b6a7828502 modules-6.4-rc1
The summary of the changes for this pull requests is:
 
  * Song Liu's new struct module_memory replacement
  * Nick Alcock's MODULE_LICENSE() removal for non-modules
  * My cleanups and enhancements to reduce the areas where we vmalloc
    module memory for duplicates, and the respective debug code which
    proves the remaining vmalloc pressure comes from userspace.
 
 Most of the changes have been in linux-next for quite some time except
 the minor fixes I made to check if a module was already loaded
 prior to allocating the final module memory with vmalloc and the
 respective debug code it introduces to help clarify the issue. Although
 the functional change is small it is rather safe as it can only *help*
 reduce vmalloc space for duplicates and is confirmed to fix a bootup
 issue with over 400 CPUs with KASAN enabled. I don't expect stable
 kernels to pick up that fix as the cleanups would have also had to have
 been picked up. Folks on larger CPU systems with modules will want to
 just upgrade if vmalloc space has been an issue on bootup.
 
 Given the size of this request, here's some more elaborate details
 on this pull request.
 
 The functional change change in this pull request is the very first
 patch from Song Liu which replaces the struct module_layout with a new
 struct module memory. The old data structure tried to put together all
 types of supported module memory types in one data structure, the new
 one abstracts the differences in memory types in a module to allow each
 one to provide their own set of details. This paves the way in the
 future so we can deal with them in a cleaner way. If you look at changes
 they also provide a nice cleanup of how we handle these different memory
 areas in a module. This change has been in linux-next since before the
 merge window opened for v6.3 so to provide more than a full kernel cycle
 of testing. It's a good thing as quite a bit of fixes have been found
 for it.
 
 Jason Baron then made dynamic debug a first class citizen module user by
 using module notifier callbacks to allocate / remove module specific
 dynamic debug information.
 
 Nick Alcock has done quite a bit of work cross-tree to remove module
 license tags from things which cannot possibly be module at my request
 so to:
 
   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area
      is active with no clear solution in sight.
 
   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags
 
 In so far as a) is concerned, although module license tags are a no-op
 for non-modules, tools which would want create a mapping of possible
 modules can only rely on the module license tag after the commit
 8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin
 or tristate.conf").  Nick has been working on this *for years* and
 AFAICT I was the only one to suggest two alternatives to this approach
 for tooling. The complexity in one of my suggested approaches lies in
 that we'd need a possible-obj-m and a could-be-module which would check
 if the object being built is part of any kconfig build which could ever
 lead to it being part of a module, and if so define a new define
 -DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
 suggested would be to have a tristate in kconfig imply the same new
 -DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
 mapping to modules always, and I don't think that's the case today. I am
 not aware of Nick or anyone exploring either of these options. Quite
 recently Josh Poimboeuf has pointed out that live patching, kprobes and
 BPF would benefit from resolving some part of the disambiguation as
 well but for other reasons. The function granularity KASLR (fgkaslr)
 patches were mentioned but Joe Lawrence has clarified this effort has
 been dropped with no clear solution in sight [1].
 
 In the meantime removing module license tags from code which could never
 be modules is welcomed for both objectives mentioned above. Some
 developers have also welcomed these changes as it has helped clarify
 when a module was never possible and they forgot to clean this up,
 and so you'll see quite a bit of Nick's patches in other pull
 requests for this merge window. I just picked up the stragglers after
 rc3. LWN has good coverage on the motivation behind this work [2] and
 the typical cross-tree issues he ran into along the way. The only
 concrete blocker issue he ran into was that we should not remove the
 MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
 they can never be modules. Nick ended up giving up on his efforts due
 to having to do this vetting and backlash he ran into from folks who
 really did *not understand* the core of the issue nor were providing
 any alternative / guidance. I've gone through his changes and dropped
 the patches which dropped the module license tags where an SPDX
 license tag was missing, it only consisted of 11 drivers.  To see
 if a pull request deals with a file which lacks SPDX tags you
 can just use:
 
   ./scripts/spdxcheck.py -f \
 	$(git diff --name-only commid-id | xargs echo)
 
 You'll see a core module file in this pull request for the above,
 but that's not related to his changes. WE just need to add the SPDX
 license tag for the kernel/module/kmod.c file in the future but
 it demonstrates the effectiveness of the script.
 
 Most of Nick's changes were spread out through different trees,
 and I just picked up the slack after rc3 for the last kernel was out.
 Those changes have been in linux-next for over two weeks.
 
 The cleanups, debug code I added and final fix I added for modules
 were motivated by David Hildenbrand's report of boot failing on
 a systems with over 400 CPUs when KASAN was enabled due to running
 out of virtual memory space. Although the functional change only
 consists of 3 lines in the patch "module: avoid allocation if module is
 already present and ready", proving that this was the best we can
 do on the modules side took quite a bit of effort and new debug code.
 
 The initial cleanups I did on the modules side of things has been
 in linux-next since around rc3 of the last kernel, the actual final
 fix for and debug code however have only been in linux-next for about a
 week or so but I think it is worth getting that code in for this merge
 window as it does help fix / prove / evaluate the issues reported
 with larger number of CPUs. Userspace is not yet fixed as it is taking
 a bit of time for folks to understand the crux of the issue and find a
 proper resolution. Worst come to worst, I have a kludge-of-concept [3]
 of how to make kernel_read*() calls for modules unique / converge them,
 but I'm currently inclined to just see if userspace can fix this
 instead.
 
 [0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
 [1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
 [2] https://lwn.net/Articles/927569/
 [3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
 ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
 hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
 oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
 jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
 YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
 WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
 p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
 Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
 c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
 eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
 tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
 agiwdHUMVN7X
 =56WK
 -----END PGP SIGNATURE-----

Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux

Pull module updates from Luis Chamberlain:
 "The summary of the changes for this pull requests is:

   - Song Liu's new struct module_memory replacement

   - Nick Alcock's MODULE_LICENSE() removal for non-modules

   - My cleanups and enhancements to reduce the areas where we vmalloc
     module memory for duplicates, and the respective debug code which
     proves the remaining vmalloc pressure comes from userspace.

  Most of the changes have been in linux-next for quite some time except
  the minor fixes I made to check if a module was already loaded prior
  to allocating the final module memory with vmalloc and the respective
  debug code it introduces to help clarify the issue. Although the
  functional change is small it is rather safe as it can only *help*
  reduce vmalloc space for duplicates and is confirmed to fix a bootup
  issue with over 400 CPUs with KASAN enabled. I don't expect stable
  kernels to pick up that fix as the cleanups would have also had to
  have been picked up. Folks on larger CPU systems with modules will
  want to just upgrade if vmalloc space has been an issue on bootup.

  Given the size of this request, here's some more elaborate details:

  The functional change change in this pull request is the very first
  patch from Song Liu which replaces the 'struct module_layout' with a
  new 'struct module_memory'. The old data structure tried to put
  together all types of supported module memory types in one data
  structure, the new one abstracts the differences in memory types in a
  module to allow each one to provide their own set of details. This
  paves the way in the future so we can deal with them in a cleaner way.
  If you look at changes they also provide a nice cleanup of how we
  handle these different memory areas in a module. This change has been
  in linux-next since before the merge window opened for v6.3 so to
  provide more than a full kernel cycle of testing. It's a good thing as
  quite a bit of fixes have been found for it.

  Jason Baron then made dynamic debug a first class citizen module user
  by using module notifier callbacks to allocate / remove module
  specific dynamic debug information.

  Nick Alcock has done quite a bit of work cross-tree to remove module
  license tags from things which cannot possibly be module at my request
  so to:

   a) help him with his longer term tooling goals which require a
      deterministic evaluation if a piece a symbol code could ever be
      part of a module or not. But quite recently it is has been made
      clear that tooling is not the only one that would benefit.
      Disambiguating symbols also helps efforts such as live patching,
      kprobes and BPF, but for other reasons and R&D on this area is
      active with no clear solution in sight.

   b) help us inch closer to the now generally accepted long term goal
      of automating all the MODULE_LICENSE() tags from SPDX license tags

  In so far as a) is concerned, although module license tags are a no-op
  for non-modules, tools which would want create a mapping of possible
  modules can only rely on the module license tag after the commit
  8b41fc4454 ("kbuild: create modules.builtin without
  Makefile.modbuiltin or tristate.conf").

  Nick has been working on this *for years* and AFAICT I was the only
  one to suggest two alternatives to this approach for tooling. The
  complexity in one of my suggested approaches lies in that we'd need a
  possible-obj-m and a could-be-module which would check if the object
  being built is part of any kconfig build which could ever lead to it
  being part of a module, and if so define a new define
  -DPOSSIBLE_MODULE [0].

  A more obvious yet theoretical approach I've suggested would be to
  have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
  well but that means getting kconfig symbol names mapping to modules
  always, and I don't think that's the case today. I am not aware of
  Nick or anyone exploring either of these options. Quite recently Josh
  Poimboeuf has pointed out that live patching, kprobes and BPF would
  benefit from resolving some part of the disambiguation as well but for
  other reasons. The function granularity KASLR (fgkaslr) patches were
  mentioned but Joe Lawrence has clarified this effort has been dropped
  with no clear solution in sight [1].

  In the meantime removing module license tags from code which could
  never be modules is welcomed for both objectives mentioned above. Some
  developers have also welcomed these changes as it has helped clarify
  when a module was never possible and they forgot to clean this up, and
  so you'll see quite a bit of Nick's patches in other pull requests for
  this merge window. I just picked up the stragglers after rc3. LWN has
  good coverage on the motivation behind this work [2] and the typical
  cross-tree issues he ran into along the way. The only concrete blocker
  issue he ran into was that we should not remove the MODULE_LICENSE()
  tags from files which have no SPDX tags yet, even if they can never be
  modules. Nick ended up giving up on his efforts due to having to do
  this vetting and backlash he ran into from folks who really did *not
  understand* the core of the issue nor were providing any alternative /
  guidance. I've gone through his changes and dropped the patches which
  dropped the module license tags where an SPDX license tag was missing,
  it only consisted of 11 drivers. To see if a pull request deals with a
  file which lacks SPDX tags you can just use:

    ./scripts/spdxcheck.py -f \
	$(git diff --name-only commid-id | xargs echo)

  You'll see a core module file in this pull request for the above, but
  that's not related to his changes. WE just need to add the SPDX
  license tag for the kernel/module/kmod.c file in the future but it
  demonstrates the effectiveness of the script.

  Most of Nick's changes were spread out through different trees, and I
  just picked up the slack after rc3 for the last kernel was out. Those
  changes have been in linux-next for over two weeks.

  The cleanups, debug code I added and final fix I added for modules
  were motivated by David Hildenbrand's report of boot failing on a
  systems with over 400 CPUs when KASAN was enabled due to running out
  of virtual memory space. Although the functional change only consists
  of 3 lines in the patch "module: avoid allocation if module is already
  present and ready", proving that this was the best we can do on the
  modules side took quite a bit of effort and new debug code.

  The initial cleanups I did on the modules side of things has been in
  linux-next since around rc3 of the last kernel, the actual final fix
  for and debug code however have only been in linux-next for about a
  week or so but I think it is worth getting that code in for this merge
  window as it does help fix / prove / evaluate the issues reported with
  larger number of CPUs. Userspace is not yet fixed as it is taking a
  bit of time for folks to understand the crux of the issue and find a
  proper resolution. Worst come to worst, I have a kludge-of-concept [3]
  of how to make kernel_read*() calls for modules unique / converge
  them, but I'm currently inclined to just see if userspace can fix this
  instead"

Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]

* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
  module: add debugging auto-load duplicate module support
  module: stats: fix invalid_mod_bytes typo
  module: remove use of uninitialized variable len
  module: fix building stats for 32-bit targets
  module: stats: include uapi/linux/module.h
  module: avoid allocation if module is already present and ready
  module: add debug stats to help identify memory pressure
  module: extract patient module check into helper
  modules/kmod: replace implementation with a semaphore
  Change DEFINE_SEMAPHORE() to take a number argument
  module: fix kmemleak annotations for non init ELF sections
  module: Ignore L0 and rename is_arm_mapping_symbol()
  module: Move is_arm_mapping_symbol() to module_symbol.h
  module: Sync code of is_arm_mapping_symbol()
  scripts/gdb: use mem instead of core_layout to get the module address
  interconnect: remove module-related code
  interconnect: remove MODULE_LICENSE in non-modules
  zswap: remove MODULE_LICENSE in non-modules
  zpool: remove MODULE_LICENSE in non-modules
  x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
  ...
2023-04-27 16:36:55 -07:00
Chuck Lever
9280c57743 NFSD: Handle new xprtsec= export option
Enable administrators to require clients to use transport layer
security when accessing particular exports.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Chuck Lever
22b620ec0b NFSD: Clean up xattr memory allocation flags
Tetsuo Handa points out:
> Since GFP_KERNEL is "GFP_NOFS | __GFP_FS", usage like
> "GFP_KERNEL | GFP_NOFS" does not make sense.

The original intent was to hold the inode lock while estimating
the buffer requirements for the requested information. Frank van
der Linden, the author of NFSD's xattr code, says:

> ... you need inode_lock to get an atomic view of an xattr. Since
> both nfsd_getxattr and nfsd_listxattr to the standard trick of
> querying the xattr length with a NULL buf argument (just getting
> the length back), allocating the right buffer size, and then
> querying again, they need to hold the inode lock to avoid having
> the xattr changed from under them while doing that.
>
> From that then flows the requirement that GFP_FS could cause
> problems while holding i_rwsem, so I added GFP_NOFS.

However, Dave Chinner states:
> You can do GFP_KERNEL allocations holding the i_rwsem just fine.
> All that it requires is the caller holds a reference to the
> inode ...

Since these code paths acquire a dentry, they do indeed hold a
reference. It is therefore safe to use GFP_KERNEL for these memory
allocations. In particular, that's what this code is already doing;
but now the C source code looks sane too.

At a later time we can revisit in order to remove the inode lock in
favor of simply retrying if the estimated buffer size is too small.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Dai Ngo
147abcacee NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop
The following request sequence to the same file causes the NFS client and
server getting into an infinite loop with COMMIT and NFS4ERR_DELAY:

OPEN
REMOVE
WRITE
COMMIT

Problem reported by recall11, recall12, recall14, recall20, recall22,
recall40, recall42, recall48, recall50 of nfstest suite.

This patch restores the handling of race condition in nfsd_file_do_acquire
with unlink to that prior of the regression.

Fixes: ac3a2585f0 ("nfsd: rework refcounting in filecache")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-27 18:49:24 -04:00
Linus Torvalds
556eb8b791 Driver core changes for 6.4-rc1
Here is the large set of driver core changes for 6.4-rc1.
 
 Once again, a busy development cycle, with lots of changes happening in
 the driver core in the quest to be able to move "struct bus" and "struct
 class" into read-only memory, a task now complete with these changes.
 
 This will make the future rust interactions with the driver core more
 "provably correct" as well as providing more obvious lifetime rules for
 all busses and classes in the kernel.
 
 The changes required for this did touch many individual classes and
 busses as many callbacks were changed to take const * parameters
 instead.  All of these changes have been submitted to the various
 subsystem maintainers, giving them plenty of time to review, and most of
 them actually did so.
 
 Other than those changes, included in here are a small set of other
 things:
   - kobject logging improvements
   - cacheinfo improvements and updates
   - obligatory fw_devlink updates and fixes
   - documentation updates
   - device property cleanups and const * changes
   - firwmare loader dependency fixes.
 
 All of these have been in linux-next for a while with no reported
 problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
 LEGadNS38k5fs+73UaxV
 =7K4B
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the large set of driver core changes for 6.4-rc1.

  Once again, a busy development cycle, with lots of changes happening
  in the driver core in the quest to be able to move "struct bus" and
  "struct class" into read-only memory, a task now complete with these
  changes.

  This will make the future rust interactions with the driver core more
  "provably correct" as well as providing more obvious lifetime rules
  for all busses and classes in the kernel.

  The changes required for this did touch many individual classes and
  busses as many callbacks were changed to take const * parameters
  instead. All of these changes have been submitted to the various
  subsystem maintainers, giving them plenty of time to review, and most
  of them actually did so.

  Other than those changes, included in here are a small set of other
  things:

   - kobject logging improvements

   - cacheinfo improvements and updates

   - obligatory fw_devlink updates and fixes

   - documentation updates

   - device property cleanups and const * changes

   - firwmare loader dependency fixes.

  All of these have been in linux-next for a while with no reported
  problems"

* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
  device property: make device_property functions take const device *
  driver core: update comments in device_rename()
  driver core: Don't require dynamic_debug for initcall_debug probe timing
  firmware_loader: rework crypto dependencies
  firmware_loader: Strip off \n from customized path
  zram: fix up permission for the hot_add sysfs file
  cacheinfo: Add use_arch[|_cache]_info field/function
  arch_topology: Remove early cacheinfo error message if -ENOENT
  cacheinfo: Check cache properties are present in DT
  cacheinfo: Check sib_leaf in cache_leaves_are_shared()
  cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
  cacheinfo: Add arm64 early level initializer implementation
  cacheinfo: Add arch specific early level initializer
  tty: make tty_class a static const structure
  driver core: class: remove struct class_interface * from callbacks
  driver core: class: mark the struct class in struct class_interface constant
  driver core: class: make class_register() take a const *
  driver core: class: mark class_release() as taking a const *
  driver core: remove incorrect comment for device_create*
  MIPS: vpe-cmp: remove module owner pointer from struct class usage.
  ...
2023-04-27 11:53:57 -07:00
Bharath SM
d906be3fa5 SMB3: Close deferred file handles in case of handle lease break
We should not cache deferred file handles if we dont have
handle lease on a file. And we should immediately close all
deferred handles in case of handle lease break.

Fixes: 9e31678fb4 ("SMB3: fix lease break timeout when multiple deferred close handles for the same file.")
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-27 11:03:33 -05:00
Bharath SM
ab9ddc87a9 SMB3: Add missing locks to protect deferred close file list
cifs_del_deferred_close function has a critical section which modifies
the deferred close file list. We must acquire deferred_lock before
calling cifs_del_deferred_close function.

Fixes: ca08d0eac0 ("cifs: Fix memory leak on the deferred close")
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Acked-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-27 11:00:10 -05:00
Linus Torvalds
6e98b09da9 Networking changes for 6.4.
Core
 ----
 
  - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
    default value allows for better BIG TCP performances.
 
  - Reduce compound page head access for zero-copy data transfers.
 
  - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when possible.
 
  - Threaded NAPI improvements, adding defer skb free support and unneeded
    softirq avoidance.
 
  - Address dst_entry reference count scalability issues, via false
    sharing avoidance and optimize refcount tracking.
 
  - Add lockless accesses annotation to sk_err[_soft].
 
  - Optimize again the skb struct layout.
 
  - Extends the skb drop reasons to make it usable by multiple
    subsystems.
 
  - Better const qualifier awareness for socket casts.
 
 BPF
 ---
 
  - Add skb and XDP typed dynptrs which allow BPF programs for more
    ergonomic and less brittle iteration through data and variable-sized
    accesses.
 
  - Add a new BPF netfilter program type and minimal support to hook
    BPF programs to netfilter hooks such as prerouting or forward.
 
  - Add more precise memory usage reporting for all BPF map types.
 
  - Adds support for using {FOU,GUE} encap with an ipip device operating
    in collect_md mode and add a set of BPF kfuncs for controlling encap
    params.
 
  - Allow BPF programs to detect at load time whether a particular kfunc
    exists or not, and also add support for this in light skeleton.
 
  - Bigger batch of BPF verifier improvements to prepare for upcoming BPF
    open-coded iterators allowing for less restrictive looping capabilities.
 
  - Rework RCU enforcement in the verifier, add kptr_rcu and enforce BPF
    programs to NULL-check before passing such pointers into kfunc.
 
  - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and in
    local storage maps.
 
  - Enable RCU semantics for task BPF kptrs and allow referenced kptr
    tasks to be stored in BPF maps.
 
  - Add support for refcounted local kptrs to the verifier for allowing
    shared ownership, useful for adding a node to both the BPF list and
    rbtree.
 
  - Add BPF verifier support for ST instructions in convert_ctx_access()
    which will help new -mcpu=v4 clang flag to start emitting them.
 
  - Add ARM32 USDT support to libbpf.
 
  - Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations.
 
 Protocols
 ---------
 
  - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
    indicates the provenance of the IP address.
 
  - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition.
 
  - Add the handshake upcall mechanism, allowing the user-space
    to implement generic TLS handshake on kernel's behalf.
 
  - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
    resilience to nodes failures.
 
  - SCTP: add support for Fair Capacity and Weighted Fair Queueing
    schedulers.
 
  - MPTCP: delay first subflow allocation up to its first usage. This
    will allow for later better LSM interaction.
 
  - xfrm: Remove inner/outer modes from input/output path. These are
    not needed anymore.
 
  - WiFi:
    - reduced neighbor report (RNR) handling for AP mode
    - HW timestamping support
    - support for randomized auth/deauth TA for PASN privacy
    - per-link debugfs for multi-link
    - TC offload support for mac80211 drivers
    - mac80211 mesh fast-xmit and fast-rx support
    - enable Wi-Fi 7 (EHT) mesh support
 
 Netfilter
 ---------
 
  - Add nf_tables 'brouting' support, to force a packet to be routed
    instead of being bridged.
 
  - Update bridge netfilter and ovs conntrack helpers to handle
    IPv6 Jumbo packets properly, i.e. fetch the packet length
    from hop-by-hop extension header. This is needed for BIT TCP
    support.
 
  - The iptables 32bit compat interface isn't compiled in by default
    anymore.
 
  - Move ip(6)tables builtin icmp matches to the udptcp one.
    This has the advantage that icmp/icmpv6 match doesn't load the
    iptables/ip6tables modules anymore when iptables-nft is used.
 
  - Extended netlink error report for netdevice in flowtables and
    netdev/chains. Allow for incrementally add/delete devices to netdev
    basechain. Allow to create netdev chain without device.
 
 Driver API
 ----------
 
  - Remove redundant Device Control Error Reporting Enable, as PCI core
    has already error reporting enabled at enumeration time.
 
  - Move Multicast DB netlink handlers to core, allowing devices other
    then bridge to use them.
 
  - Allow the page_pool to directly recycle the pages from safely
    localized NAPI.
 
  - Implement lockless TX queue stop/wake combo macros, allowing for
    further code de-duplication and sanitization.
 
  - Add YNL support for user headers and struct attrs.
 
  - Add partial YNL specification for devlink.
 
  - Add partial YNL specification for ethtool.
 
  - Add tc-mqprio and tc-taprio support for preemptible traffic classes.
 
  - Add tx push buf len param to ethtool, specifies the maximum number
    of bytes of a transmitted packet a driver can push directly to the
    underlying device.
 
  - Add basic LED support for switch/phy.
 
  - Add NAPI documentation, stop relaying on external links.
 
  - Convert dsa_master_ioctl() to netdev notifier. This is a preparatory
    work to make the hardware timestamping layer selectable by user
    space.
 
  - Add transceiver support and improve the error messages for CAN-FD
    controllers.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - AMD/Pensando core device support
    - MediaTek MT7981 SoC
    - MediaTek MT7988 SoC
    - Broadcom BCM53134 embedded switch
    - Texas Instruments CPSW9G ethernet switch
    - Qualcomm EMAC3 DWMAC ethernet
    - StarFive JH7110 SoC
    - NXP CBTX ethernet PHY
 
  - WiFi:
    - Apple M1 Pro/Max devices
    - RealTek rtl8710bu/rtl8188gu
    - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
 
  - Bluetooth:
    - Realtek RTL8821CS, RTL8851B, RTL8852BS
    - Mediatek MT7663, MT7922
    - NXP w8997
    - Actions Semi ATS2851
    - QTI WCN6855
    - Marvell 88W8997
 
  - Can:
    - STMicroelectronics bxcan stm32f429
 
 Drivers
 -------
  - Ethernet NICs:
    - Intel (1G, icg):
      - add tracking and reporting of QBV config errors.
      - add support for configuring max SDU for each Tx queue.
    - Intel (100G, ice):
      - refactor mailbox overflow detection to support Scalable IOV
      - GNSS interface optimization
    - Intel (i40e):
      - support XDP multi-buffer
    - nVidia/Mellanox:
      - add the support for linux bridge multicast offload
      - enable TC offload for egress and engress MACVLAN over bond
      - add support for VxLAN GBP encap/decap flows offload
      - extend packet offload to fully support libreswan
      - support tunnel mode in mlx5 IPsec packet offload
      - extend XDP multi-buffer support
      - support MACsec VLAN offload
      - add support for dynamic msix vectors allocation
      - drop RX page_cache and fully use page_pool
      - implement thermal zone to report NIC temperature
    - Netronome/Corigine:
      - add support for multi-zone conntrack offload
    - Solarflare/Xilinx:
      - support offloading TC VLAN push/pop actions to the MAE
      - support TC decap rules
      - support unicast PTP
 
  - Other NICs:
    - Broadcom (bnxt): enforce software based freq adjustments only
 		on shared PHC NIC
    - RealTek (r8169): refactor to addess ASPM issues during NAPI poll.
    - Micrel (lan8841): add support for PTP_PF_PEROUT
    - Cadence (macb): enable PTP unicast
    - Engleder (tsnep): add XDP socket zero-copy support
    - virtio-net: implement exact header length guest feature
    - veth: add page_pool support for page recycling
    - vxlan: add MDB data path support
    - gve: add XDP support for GQI-QPL format
    - geneve: accept every ethertype
    - macvlan: allow some packets to bypass broadcast queue
    - mana: add support for jumbo frame
 
  - Ethernet high-speed switches:
    - Microchip (sparx5): Add support for TC flower templates.
 
  - Ethernet embedded switches:
    - Broadcom (b54):
      - configure 6318 and 63268 RGMII ports
    - Marvell (mv88e6xxx):
      - faster C45 bus scan
    - Microchip:
      - lan966x:
        - add support for IS1 VCAP
        - better TX/RX from/to CPU performances
      - ksz9477: add ETS Qdisc support
      - ksz8: enhance static MAC table operations and error handling
      - sama7g5: add PTP capability
    - NXP (ocelot):
      - add support for external ports
      - add support for preemptible traffic classes
    - Texas Instruments:
      - add CPSWxG SGMII support for J7200 and J721E
 
  - Intel WiFi (iwlwifi):
    - preparation for Wi-Fi 7 EHT and multi-link support
    - EHT (Wi-Fi 7) sniffer support
    - hardware timestamping support for some devices/firwmares
    - TX beacon protection on newer hardware
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - MU-MIMO parameters support
    - ack signal support for management packets
 
  - RealTek WiFi (rtw88):
    - SDIO bus support
    - better support for some SDIO devices
      (e.g. MAC address from efuse)
 
  - RealTek WiFi (rtw89):
    - HW scan support for 8852b
    - better support for 6 GHz scanning
    - support for various newer firmware APIs
    - framework firmware backwards compatibility
 
  - MediaTek WiFi (mt76):
    - P2P support
    - mesh A-MSDU support
    - EHT (Wi-Fi 7) support
    - coredump support
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRI/mUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkgO0QAJGxpuN67YgYV0BIM+/atWKEEexJYG7B
 9MMpU4jMO3EW/pUS5t7VRsBLUybLYVPmqCZoHodObDfnu59jiPOegb6SikJv/ZwJ
 Zw62PVk5MvDnQjlu4e6kDcGwkplteN08TlgI+a49BUTedpdFitrxHAYGW8f2fRO6
 cK2XSld+ZucMoym5vRwf8yWS1BwdxnslPMxDJ+/8ZbWBZv44qAnG2vMB/kIx7ObC
 Vel/4m6MzTwVsLYBsRvcwMVbNNlZ9GuhztlTzEbfGA4ZhTadIAMgb5VTWXB84Ws7
 Aic5wTdli+q+x6/2cxhbyeoVuB9HHObYmLBAciGg4GNljP5rnQBY3X3+KVZ/x9TI
 HQB7CmhxmAZVrO9pLARFV+ECrMTH2/dy3NyrZ7uYQ3WPOXJi8hJZjOTO/eeEGL7C
 eTjdz0dZBWIBK2gON/6s4nExXVQUTEF2ZsPi52jTTClKjfe5pz/ddeFQIWaY1DTm
 pInEiWPAvd28JyiFmhFNHsuIBCjX/Zqe2JuMfMBeBibDAC09o/OGdKJYUI15AiRf
 F46Pdb7use/puqfrYW44kSAfaPYoBiE+hj1RdeQfen35xD9HVE4vdnLNeuhRlFF9
 aQfyIRHYQofkumRDr5f8JEY66cl9NiKQ4IVW1xxQfYDNdC6wQqREPG1md7rJVMrJ
 vP7ugFnttneg
 =ITVa
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
     default value allows for better BIG TCP performances

   - Reduce compound page head access for zero-copy data transfers

   - RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
     possible

   - Threaded NAPI improvements, adding defer skb free support and
     unneeded softirq avoidance

   - Address dst_entry reference count scalability issues, via false
     sharing avoidance and optimize refcount tracking

   - Add lockless accesses annotation to sk_err[_soft]

   - Optimize again the skb struct layout

   - Extends the skb drop reasons to make it usable by multiple
     subsystems

   - Better const qualifier awareness for socket casts

  BPF:

   - Add skb and XDP typed dynptrs which allow BPF programs for more
     ergonomic and less brittle iteration through data and
     variable-sized accesses

   - Add a new BPF netfilter program type and minimal support to hook
     BPF programs to netfilter hooks such as prerouting or forward

   - Add more precise memory usage reporting for all BPF map types

   - Adds support for using {FOU,GUE} encap with an ipip device
     operating in collect_md mode and add a set of BPF kfuncs for
     controlling encap params

   - Allow BPF programs to detect at load time whether a particular
     kfunc exists or not, and also add support for this in light
     skeleton

   - Bigger batch of BPF verifier improvements to prepare for upcoming
     BPF open-coded iterators allowing for less restrictive looping
     capabilities

   - Rework RCU enforcement in the verifier, add kptr_rcu and enforce
     BPF programs to NULL-check before passing such pointers into kfunc

   - Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
     in local storage maps

   - Enable RCU semantics for task BPF kptrs and allow referenced kptr
     tasks to be stored in BPF maps

   - Add support for refcounted local kptrs to the verifier for allowing
     shared ownership, useful for adding a node to both the BPF list and
     rbtree

   - Add BPF verifier support for ST instructions in
     convert_ctx_access() which will help new -mcpu=v4 clang flag to
     start emitting them

   - Add ARM32 USDT support to libbpf

   - Improve bpftool's visual program dump which produces the control
     flow graph in a DOT format by adding C source inline annotations

  Protocols:

   - IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
     indicates the provenance of the IP address

   - IPv6: optimize route lookup, dropping unneeded R/W lock acquisition

   - Add the handshake upcall mechanism, allowing the user-space to
     implement generic TLS handshake on kernel's behalf

   - Bridge: support per-{Port, VLAN} neighbor suppression, increasing
     resilience to nodes failures

   - SCTP: add support for Fair Capacity and Weighted Fair Queueing
     schedulers

   - MPTCP: delay first subflow allocation up to its first usage. This
     will allow for later better LSM interaction

   - xfrm: Remove inner/outer modes from input/output path. These are
     not needed anymore

   - WiFi:
      - reduced neighbor report (RNR) handling for AP mode
      - HW timestamping support
      - support for randomized auth/deauth TA for PASN privacy
      - per-link debugfs for multi-link
      - TC offload support for mac80211 drivers
      - mac80211 mesh fast-xmit and fast-rx support
      - enable Wi-Fi 7 (EHT) mesh support

  Netfilter:

   - Add nf_tables 'brouting' support, to force a packet to be routed
     instead of being bridged

   - Update bridge netfilter and ovs conntrack helpers to handle IPv6
     Jumbo packets properly, i.e. fetch the packet length from
     hop-by-hop extension header. This is needed for BIT TCP support

   - The iptables 32bit compat interface isn't compiled in by default
     anymore

   - Move ip(6)tables builtin icmp matches to the udptcp one. This has
     the advantage that icmp/icmpv6 match doesn't load the
     iptables/ip6tables modules anymore when iptables-nft is used

   - Extended netlink error report for netdevice in flowtables and
     netdev/chains. Allow for incrementally add/delete devices to netdev
     basechain. Allow to create netdev chain without device

  Driver API:

   - Remove redundant Device Control Error Reporting Enable, as PCI core
     has already error reporting enabled at enumeration time

   - Move Multicast DB netlink handlers to core, allowing devices other
     then bridge to use them

   - Allow the page_pool to directly recycle the pages from safely
     localized NAPI

   - Implement lockless TX queue stop/wake combo macros, allowing for
     further code de-duplication and sanitization

   - Add YNL support for user headers and struct attrs

   - Add partial YNL specification for devlink

   - Add partial YNL specification for ethtool

   - Add tc-mqprio and tc-taprio support for preemptible traffic classes

   - Add tx push buf len param to ethtool, specifies the maximum number
     of bytes of a transmitted packet a driver can push directly to the
     underlying device

   - Add basic LED support for switch/phy

   - Add NAPI documentation, stop relaying on external links

   - Convert dsa_master_ioctl() to netdev notifier. This is a
     preparatory work to make the hardware timestamping layer selectable
     by user space

   - Add transceiver support and improve the error messages for CAN-FD
     controllers

  New hardware / drivers:

   - Ethernet:
      - AMD/Pensando core device support
      - MediaTek MT7981 SoC
      - MediaTek MT7988 SoC
      - Broadcom BCM53134 embedded switch
      - Texas Instruments CPSW9G ethernet switch
      - Qualcomm EMAC3 DWMAC ethernet
      - StarFive JH7110 SoC
      - NXP CBTX ethernet PHY

   - WiFi:
      - Apple M1 Pro/Max devices
      - RealTek rtl8710bu/rtl8188gu
      - RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset

   - Bluetooth:
      - Realtek RTL8821CS, RTL8851B, RTL8852BS
      - Mediatek MT7663, MT7922
      - NXP w8997
      - Actions Semi ATS2851
      - QTI WCN6855
      - Marvell 88W8997

   - Can:
      - STMicroelectronics bxcan stm32f429

  Drivers:

   - Ethernet NICs:
      - Intel (1G, icg):
         - add tracking and reporting of QBV config errors
         - add support for configuring max SDU for each Tx queue
      - Intel (100G, ice):
         - refactor mailbox overflow detection to support Scalable IOV
         - GNSS interface optimization
      - Intel (i40e):
         - support XDP multi-buffer
      - nVidia/Mellanox:
         - add the support for linux bridge multicast offload
         - enable TC offload for egress and engress MACVLAN over bond
         - add support for VxLAN GBP encap/decap flows offload
         - extend packet offload to fully support libreswan
         - support tunnel mode in mlx5 IPsec packet offload
         - extend XDP multi-buffer support
         - support MACsec VLAN offload
         - add support for dynamic msix vectors allocation
         - drop RX page_cache and fully use page_pool
         - implement thermal zone to report NIC temperature
      - Netronome/Corigine:
         - add support for multi-zone conntrack offload
      - Solarflare/Xilinx:
         - support offloading TC VLAN push/pop actions to the MAE
         - support TC decap rules
         - support unicast PTP

   - Other NICs:
      - Broadcom (bnxt): enforce software based freq adjustments only on
        shared PHC NIC
      - RealTek (r8169): refactor to addess ASPM issues during NAPI poll
      - Micrel (lan8841): add support for PTP_PF_PEROUT
      - Cadence (macb): enable PTP unicast
      - Engleder (tsnep): add XDP socket zero-copy support
      - virtio-net: implement exact header length guest feature
      - veth: add page_pool support for page recycling
      - vxlan: add MDB data path support
      - gve: add XDP support for GQI-QPL format
      - geneve: accept every ethertype
      - macvlan: allow some packets to bypass broadcast queue
      - mana: add support for jumbo frame

   - Ethernet high-speed switches:
      - Microchip (sparx5): Add support for TC flower templates

   - Ethernet embedded switches:
      - Broadcom (b54):
         - configure 6318 and 63268 RGMII ports
      - Marvell (mv88e6xxx):
         - faster C45 bus scan
      - Microchip:
         - lan966x:
            - add support for IS1 VCAP
            - better TX/RX from/to CPU performances
         - ksz9477: add ETS Qdisc support
         - ksz8: enhance static MAC table operations and error handling
         - sama7g5: add PTP capability
      - NXP (ocelot):
         - add support for external ports
         - add support for preemptible traffic classes
      - Texas Instruments:
         - add CPSWxG SGMII support for J7200 and J721E

   - Intel WiFi (iwlwifi):
      - preparation for Wi-Fi 7 EHT and multi-link support
      - EHT (Wi-Fi 7) sniffer support
      - hardware timestamping support for some devices/firwmares
      - TX beacon protection on newer hardware

   - Qualcomm 802.11ax WiFi (ath11k):
      - MU-MIMO parameters support
      - ack signal support for management packets

   - RealTek WiFi (rtw88):
      - SDIO bus support
      - better support for some SDIO devices (e.g. MAC address from
        efuse)

   - RealTek WiFi (rtw89):
      - HW scan support for 8852b
      - better support for 6 GHz scanning
      - support for various newer firmware APIs
      - framework firmware backwards compatibility

   - MediaTek WiFi (mt76):
      - P2P support
      - mesh A-MSDU support
      - EHT (Wi-Fi 7) support
      - coredump support"

* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
  net: phy: hide the PHYLIB_LEDS knob
  net: phy: marvell-88x2222: remove unnecessary (void*) conversions
  tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
  net: amd: Fix link leak when verifying config failed
  net: phy: marvell: Fix inconsistent indenting in led_blink_set
  lan966x: Don't use xdp_frame when action is XDP_TX
  tsnep: Add XDP socket zero-copy TX support
  tsnep: Add XDP socket zero-copy RX support
  tsnep: Move skb receive action to separate function
  tsnep: Add functions for queue enable/disable
  tsnep: Rework TX/RX queue initialization
  tsnep: Replace modulo operation with mask
  net: phy: dp83867: Add led_brightness_set support
  net: phy: Fix reading LED reg property
  drivers: nfc: nfcsim: remove return value check of `dev_dir`
  net: phy: dp83867: Remove unnecessary (void*) conversions
  net: ethtool: coalesce: try to make user settings stick twice
  net: mana: Check if netdev/napi_alloc_frag returns single page
  net: mana: Rename mana_refill_rxoob and remove some empty lines
  net: veth: add page_pool stats
  ...
2023-04-26 16:07:23 -07:00
Dave Chinner
9419092fb2 xfs: fix livelock in delayed allocation at ENOSPC
On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.

When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.

This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.

Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.

Reported-and-tested-by: Brian Foster <bfoster@redhat.com>
Fixes: 8584332709 ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-04-27 09:02:11 +10:00
Linus Torvalds
5b9a7bb72f for-6.4/io_uring-2023-04-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4
 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj
 wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5
 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50
 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8
 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH
 vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB
 iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo
 gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI
 MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY
 BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX
 +YERd/ndAw==
 =Erfk
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Cleanup of the io-wq per-node mapping, notably getting rid of it so
   we just have a single io_wq entry per ring (Breno)

 - Followup to the above, move accounting to io_wq as well and
   completely drop struct io_wqe (Gabriel)

 - Enable KASAN for the internal io_uring caches (Breno)

 - Add support for multishot timeouts. Some applications use timeouts to
   wake someone waiting on completion entries, and this makes it a bit
   easier to just have a recurring timer rather than needing to rearm it
   every time (David)

 - Support archs that have shared cache coloring between userspace and
   the kernel, and hence have strict address requirements for mmap'ing
   the ring into userspace. This should only be parisc/hppa. (Helge, me)

 - XFS has supported O_DIRECT writes without needing to lock the inode
   exclusively for a long time, and ext4 now supports it as well. This
   is true for the common cases of not extending the file size. Flag the
   fs as having that feature, and utilize that to avoid serializing
   those writes in io_uring (me)

 - Enable completion batching for uring commands (me)

 - Revert patch adding io_uring restriction to what can be GUP mapped or
   not. This does not belong in io_uring, as io_uring isn't really
   special in this regard. Since this is also getting in the way of
   cleanups and improvements to the GUP code, get rid of if (me)

 - A few series greatly reducing the complexity of registered resources,
   like buffers or files. Not only does this clean up the code a lot,
   the simplified code is also a LOT more efficient (Pavel)

 - Series optimizing how we wait for events and run task_work related to
   it (Pavel)

 - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel)

 - Misc cleanups and improvements (Pavel, me)

* tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits)
  Revert "io_uring/rsrc: disallow multi-source reg buffers"
  io_uring: add support for multishot timeouts
  io_uring/rsrc: disassociate nodes and rsrc_data
  io_uring/rsrc: devirtualise rsrc put callbacks
  io_uring/rsrc: pass node to io_rsrc_put_work()
  io_uring/rsrc: inline io_rsrc_put_work()
  io_uring/rsrc: add empty flag in rsrc_node
  io_uring/rsrc: merge nodes and io_rsrc_put
  io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal
  io_uring/rsrc: remove unused io_rsrc_node::llist
  io_uring/rsrc: refactor io_queue_rsrc_removal
  io_uring/rsrc: simplify single file node switching
  io_uring/rsrc: clean up __io_sqe_buffers_update()
  io_uring/rsrc: inline switch_start fast path
  io_uring/rsrc: remove rsrc_data refs
  io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce
  io_uring/rsrc: use wq for quiescing
  io_uring/rsrc: refactor io_rsrc_ref_quiesce
  io_uring/rsrc: remove io_rsrc_node::done
  io_uring/rsrc: use nospec'ed indexes
  ...
2023-04-26 12:40:31 -07:00
Linus Torvalds
5c7ecada25 f2fs update for 6.4-rc1
In this round, we've mainly modified to support non-power-of-two zone size,
 which is not required for f2fs by design. In order to avoid arch dependency,
 we refactored the messy rb_entry structure shared across different extent_cache.
 In addition to the improvement, we've also fixed several subtle bugs and
 error cases.
 
 Enhancement:
 - support non-power-of-two zone size for zoned device
 - remove sharing the rb_entry structure in extent cache
 - refactor f2fs_gc to call checkpoint in urgent condition
 - support iopoll
 
 Bug fix:
 - fix potential corruption when moving a directory
 - fix to avoid use-after-free for cached IPU bio
 - fix the folio private usage
 - avoid kernel warnings or panics in the cp_error case
 - fix to recover quota data correctly
 - fix some bugs in atomic operations
 - fix system crash due to lack of free space in LFS
 - fix null pointer panic in tracepoint in __replace_atomic_write_block
 - fix iostat lock protection
 - fix scheduling while atomic in decompression path
 - preserve direct write semantics when buffering is forced
 - fix to call f2fs_wait_on_page_writeback() in f2fs_write_raw_pages()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmRIHMMACgkQQBSofoJI
 UNI3Mw//eQvxUXaWtCjTJQtXPotaah6ZcvnMMtfl6Cf0Z8Sq4L9q4yQMA16MXbLU
 zz3cexKXIHTzqWfqFLunaj6cmH/THAY3L3fTkFhE+dx1H2IaFprGLW3H8hW/58tr
 j9365RPVY2d/3agB1KikTj6FQ5OTGibkZagjsC28VmQ30VLIm+4jnHdIoX92UP+k
 87JQ/fbG2XAiHX/ifcVuMXY3++db9jaZahsmhdJ1LNTZzztO241RzrNoBsLcSwSZ
 DkPgJXARQzFNDRfveRXSbV3ygR9C62pNITtSGC86ZRLyoAmko9se+nMEFH7YEkUy
 Rhf0Qzq2Gy6ThiVo8ZjuLvNycF0oj3OefX1PQLT6vzkv3Sv4Yij48bN1HqPdYsKH
 3hPZd2V7A3o2LCJPPPNjZ/6nuKhrX+kU33FjUrxiYqz7Lt74j70vVEHQ7vSCGkrQ
 YpQYVXFr1hdejdemCpwgdvcEegNlV0GfqCG5KL1f7jJiGHfvxZnOEJ3x9dCQFTIE
 xVoWTzw9pbmBkTudrFNVRlX2RSQYSvgLFwUhQ3WE0qNu0mUMP+4E+50iKHYraJ7R
 W1TajZ+ttUJAnZ076vGGEOxabefEdtReOtdstohcJlDaGm5sI9I9CXQRvY4ZSymW
 l7ZHY/b+/IzP+/fLEX7DgTnWip37H14FImvjYRGpSEzc6sXiOUU=
 =qHTl
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs update from Jaegeuk Kim:
 "In this round, we've mainly modified to support non-power-of-two zone
  size, which is not required for f2fs by design. In order to avoid arch
  dependency, we refactored the messy rb_entry structure shared across
  different extent_cache. In addition to the improvement, we've also
  fixed several subtle bugs and error cases.

  Enhancements:
   - support non-power-of-two zone size for zoned device
   - remove sharing the rb_entry structure in extent cache
   - refactor f2fs_gc to call checkpoint in urgent condition
   - support iopoll

  Bug fixes:
   - fix potential corruption when moving a directory
   - fix to avoid use-after-free for cached IPU bio
   - fix the folio private usage
   - avoid kernel warnings or panics in the cp_error case
   - fix to recover quota data correctly
   - fix some bugs in atomic operations
   - fix system crash due to lack of free space in LFS
   - fix null pointer panic in tracepoint in __replace_atomic_write_block
   - fix iostat lock protection
   - fix scheduling while atomic in decompression path
   - preserve direct write semantics when buffering is forced
   - fix to call f2fs_wait_on_page_writeback() in f2fs_write_raw_pages()"

* tag 'f2fs-for-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
  f2fs: remove unnessary comment in __may_age_extent_tree
  f2fs: allocate node blocks for atomic write block replacement
  f2fs: use cow inode data when updating atomic write
  f2fs: remove power-of-two limitation of zoned device
  f2fs: allocate trace path buffer from names_cache
  f2fs: add has_enough_free_secs()
  f2fs: relax sanity check if checkpoint is corrupted
  f2fs: refactor f2fs_gc to call checkpoint in urgent condition
  f2fs: remove folio_detach_private() in .invalidate_folio and .release_folio
  f2fs: remove bulk remove_proc_entry() and unnecessary kobject_del()
  f2fs: support iopoll method
  f2fs: remove batched_trim_sections node description
  f2fs: fix to check return value of inc_valid_block_count()
  f2fs: fix to check return value of f2fs_do_truncate_blocks()
  f2fs: fix passing relative address when discard zones
  f2fs: fix potential corruption when moving a directory
  f2fs: add radix_tree_preload_end in error case
  f2fs: fix to recover quota data correctly
  f2fs: fix to check readonly condition correctly
  docs: f2fs: Correct instruction to disable checkpoint
  ...
2023-04-26 09:42:10 -07:00
Linus Torvalds
fbfaf03eba dlm for 6.4
Remove some unused features (related to lock timeouts) that have been
 previously scheduled for removal.
 
 Fix a bug where the pending callback flag would be incorrectly cleared,
 which could potentially result in missing a completion callback.
 
 Use an unbound workqueue for dlm socket handling so that socket
 operations can be processed with less delay.
 
 Fix possible lockspace join connection errors with large clusters (e.g.
 over 16 nodes) caused by a small socket backlog setting.
 
 Use atomic bit ops for internal flags to help avoid mistakes copying
 flag values from messages.
 
 Fix recently introduced bug where memory for lvb data could be
 unnecessarily allocated for a lock.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJkR/JxAAoJEDgbc8f8gGmq2p0P/j3OcxdOXkx9LVpKdjuAhK/M
 A4Q46J9Z/Ytl4vT7x6i4x2fZXk8cTOkQZ3TkYMGbIc8LV4FZpO4/pIP+GnSDkmwY
 1/AJgauskoewPuCNe+SUCawFMKPhUCXKOXNhyfAHlBS+NS8iIJhVKOP7mOMUawKs
 4WiI6fEFBig2z56LHgnAIdzx1UGJ+AE1XlwEca1jVvtMYW03HX9h5Tt1BP1NeAKx
 oPrYQaUsctHlAI+fo0PnjimGcjXa2av9YNnTHpqVfYw6ntSdriEXQVosX/tcRV8E
 dVXmQP2Ox20mNRu9vHkAK8x2LQLj7HuwmohdUvvDFp1cIEglbiRdF6IolyhD2Up1
 ETeJ5EU/6ktJOHrOewUCTcJoLrAcwmklgj9fzvW7fPN4zNsEUCYgoenuSK8bW+la
 jDbO26Rh+cMg2pC6Rz8z8azadwTQlFXDjvTDOnmb2jr+tfIEWj/VTWCTQ4Un3EEg
 4xJ1dCvlLCRkYoDN3QTF9TxapSTohtjtQl85yrLULhdUWhQuRQd93rHeNgwDf8Ys
 NUzuL+IR+SEfvVyRezYfdyUBT26ld8pGVzJqlKvXUXTPBypnw9aoGAYhXx3A8fYo
 QLIz7lc0/pvkWaQ9zNyedf8LDIYCbZuTcMpmTEgLy+q6BTJWS2ZJT5oeNvGvnRw5
 Zc7iUmGNQ5XKuVEhtTFA
 =900J
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:

 - Remove some unused features (related to lock timeouts) that have been
   previously scheduled for removal

 - Fix a bug where the pending callback flag would be incorrectly
   cleared, which could potentially result in missing a completion
   callback

 - Use an unbound workqueue for dlm socket handling so that socket
   operations can be processed with less delay

 - Fix possible lockspace join connection errors with large clusters
   (e.g. over 16 nodes) caused by a small socket backlog setting

 - Use atomic bit ops for internal flags to help avoid mistakes copying
   flag values from messages

 - Fix recently introduced bug where memory for lvb data could be
   unnecessarily allocated for a lock

* tag 'dlm-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: stop unnecessarily filling zero ms_extra bytes
  fs: dlm: switch lkb_sbflags to atomic ops
  fs: dlm: rsb hash table flag value to atomic ops
  fs: dlm: move internal flags to atomic ops
  fs: dlm: change dflags to use atomic bits
  fs: dlm: store lkb distributed flags into own value
  fs: dlm: remove DLM_IFL_LOCAL_MS flag
  fs: dlm: rename stub to local message flag
  fs: dlm: remove deprecated code parts
  DLM: increase socket backlog to avoid hangs with 16 nodes
  fs: dlm: add unbound flag to dlm_io workqueue
  fs: dlm: fix DLM_IFL_CB_PENDING gets overwritten
2023-04-26 09:36:55 -07:00
Linus Torvalds
e0fcc9c68d gfs2 fixes
- Fix revoke processing at unmount and on read-only remount.
 
 - Refuse reading in inodes with an impossible indirect block height.
 
 - Various minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmRH00EUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpfSw//a0IG/6tVfPwlCPd1qPKN/JbxHQVa
 3JtRJXr5+RJ3vIKmJb+AFfEiPHRKgCk28d9hMbW8j3ieKSy/dPJe8ARIZmu8phpK
 dOSiue87v4Pz6Tz2qpb6KgzUENohsqjTy0mEAEWxxZb7mWxUqj7t057hiEp/9NX6
 jXcfoNInMxnf2HX3d11LoHXYOX37xvEjyhbZ4OmPn+xzmYQcfhuO96CTrIOe7lKe
 U8dbFzDwtoUxYljb16do0OLva/YxaNdH33yCzrixVi9ki1aCIH6A2paznlUCzgF1
 wzOhzV3kbz5e4flov34X1Jnd5RVkBmzMMM5bkcJzNg6JOSGL0bLg0PGYiXASAfje
 KmsNVvq0yMWLLQIr2gMcEVI1IVDxkNKJruizbFfUs/SWHI55eDUZtI5jjA0v0cCw
 5HRWialgCNdC9Wd/4HW+DfpztwxR5+j52TCtqgyRGQcGvkNtwGMeGgh9/sreuHhr
 7Lod7LpBVmEzlXD97VIcJUcU2vhXtGA25pKXFHCilXh/MdA2UpmBWkVUkNVS/awr
 ZTzoi/WkUIWLLVHxHruLOoJBWit8Mhnuxh1rlY0WD3wDEbnwCZKX8ziTYIsVvkWc
 d/xjjafrolC6g1m/RoW3rXO/MLXaE0rRy4nb3MLhwsqaAQyZE6puwo40Kj/xNQqf
 68kLauI2QFNpFm8=
 =l1E9
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Andreas Gruenbacher:

 - Fix revoke processing at unmount and on read-only remount

 - Refuse reading in inodes with an impossible indirect block height

 - Various minor cleanups

* tag 'gfs2-v6.3-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: gfs2_ail_empty_gl no log flush on error
  gfs2: Issue message when revokes cannot be written
  gfs2: Perform second log flush in gfs2_make_fs_ro
  gfs2: return errors from gfs2_ail_empty_gl
  gfs2: Move variable assignment behind a null pointer check in inode_go_dump
  gfs2: Use gfs2_holder_initialized for jindex
  gfs2: Eliminate gfs2_trim_blocks
  gfs2: Fix inode height consistency check
  gfs2: Remove ghs[] from gfs2_unlink
  gfs2: Remove ghs[] from gfs2_link
  gfs2: Remove duplicate i_nlink check from gfs2_link()
2023-04-26 09:28:15 -07:00
Linus Torvalds
85d7ab2463 for-6.4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmRHC3gACgkQxWXV+ddt
 WDvI/A//ZzREEE0wNexbuidoTacDVXVJ6LBb2K1eP+HUKfsmd6GYWQDJ9x/ExpKb
 T1ehLibCYWLeYxEREFbjXI3x9G8mrvLzvzsqXs/MzJPkmEF1igPddFztidBwvLQH
 ey/Bh+cra2bpVhRhkX0Cf09/q/YWp17/d14ZxxW60PMfyhx8RWXejXhHkulOPVv8
 +3FL8E0kc2Zjx9ioUwOy/i18LR6YzsCNVXoHzUZuWyWM4A7NG2TZR6FhuLSjlWSZ
 3RAnROwr+8i5nR0xchcyYaVMO2LMbqH6mBtHnXCtxCr+4pFrfrvKym+CQco/Xriz
 v1y/xDc23XeYXLCVhb0beJ6uRcjaM9+gvDF1oVBSJEv6V7sQr/tEGo/8QRehfEfT
 FTro7Lf89R1GOa1IBSkv/T5S25d9LlIID3/g7PbcUBtXNKvLAjDAGTH9bzL4HS5x
 /MKwN80GvaGs1KyEfUndbVPIpAwNFDYZPHM7nw1x+JTkIBcHgfjRyAMAC9jrJd0D
 730W04c+0nXZtQGtKKsxc3U8y4ewzSJAKx9t7Vgo7+1P6dSRnzvJee3x/5kXV9Yn
 MhxxzYDfIN9EcWbASdSm11gY5WZdG3an609pO7nc1T2K4Tuo0SPs4xOR7c3xuZrY
 MN5z3QFWyI2ustUuTG+nsd5J81j76DEmj5ymWQfG3SBplTneDM0=
 =Jt7p
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Mostly core changes and cleanups, some notable fixes and two
  performance improvements in directory logging.

  The IO path cleanups are removing or refactoring old code, scrub main
  loop has been completely rewritten also refactoring old code.

  There are some changes to non-btrfs code, mostly trivial, the cgroup
  punt bio logic is only moved from generic code.

  Performance improvements:

   - improve logging changes in a directory during one transaction,
     avoid iterating over items and reduce lock contention (fsync time
     4x lower)

   - when logging directory entries during one transaction, reduce
     locking of subvolume trees by checking tree-log instead
     (improvement in throughput and latency for concurrent access to a
     subvolume)

  Notable fixes:

   - dev-replace:
      - properly honor read mode when requested to avoid reading from
        source device
      - target device won't be used for eventual read repair, this is
        unreliable for NODATASUM files
      - when there are unpaired (and unrepairable) metadata during
        replace, exit early with error and don't try to finish whole
        operation

   - scrub ioctl properly rejects unknown flags

   - fix global block reserve calculations

   - fix partial direct io write when there's a page fault in the
     middle, iomap will try to continue with partial request but the
     btrfs part did not match that, this can lead to zeros written
     instead of data

  Core changes:

   - io path:
      - continued cleanups and refactoring around bio handling
      - extent io submit path simplifications and cleanups
      - flush write path simplifications and cleanups
      - rework logic of passing sync mode of bio, with further cleanups

   - rewrite scrub code flow, restructure how the stripes are enumerated
     and verified in a more unified way

   - allow to set lower threshold for block group reclaim in debug mode
     to aid zoned mode testing

   - remove obsolete time-based delayed ref throttling logic when
     truncating items

   - DREW locks are not using percpu variables anymore

   - more warning fixes (-Wmaybe-uninitialized)

   - u64 division simplifications

   - error handling improvements

  Non-btrfs code changes:

   - push cgroup punt bio logic to btrfs code (there was no other user
     of that), the functionality can be now selected separately by
     BLK_CGROUP_PUNT_BIO

   - crc32c_impl removed after removing last uses in btrfs code

   - add btrfs_assertfail() to objtool table"

* tag 'for-6.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
  btrfs: mark btrfs_assertfail() __noreturn
  btrfs: fix uninitialized variable warnings
  btrfs: use log root when iterating over index keys when logging directory
  btrfs: avoid iterating over all indexes when logging directory
  btrfs: dev-replace: error out if we have unrepaired metadata error during
  btrfs: remove pointless loop at btrfs_get_next_valid_item()
  btrfs: scrub: reject unsupported scrub flags
  btrfs: reinterpret async discard iops_limit=0 as no delay
  btrfs: set default discard iops_limit to 1000
  btrfs: remove unused raid56 functions which were dedicated for scrub
  btrfs: scrub: remove scrub_bio structure
  btrfs: scrub: remove scrub_block and scrub_sector structures
  btrfs: scrub: remove the old scrub recheck code
  btrfs: scrub: remove the old writeback infrastructure
  btrfs: scrub: remove scrub_parity structure
  btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub
  btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure
  btrfs: scrub: introduce helper to queue a stripe for scrub
  btrfs: scrub: introduce error reporting functionality for scrub_stripe
  btrfs: scrub: introduce a writeback helper for scrub_stripe
  ...
2023-04-26 09:13:44 -07:00
Linus Torvalds
94fc079266 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmRHq24ACgkQnJ2qBz9k
 QNkowggAuyZ313GrAUNDH2Cvu3PE804LpW6k27VxtIxmgtMSZvuOED0fC99Xt+3y
 SnQrJKfw4HVKIDT2sx3ECaiKa+v1UBMG+KmaZ5u5qLjxfjY8YHy6TbHMwY/NV2Aj
 jGgJYN+5E9AJqStzuu4C9fUzgOgvi4IzgNX4hp0WbESVk6jbDqW2KGNAv6IRM9ku
 qDg54aL8W50uDMgUJysqe180SlDgeYGPBg1OfA7xrzMaBpEK3z5+v52ZlpGNguDP
 MZMGwZgFLCP2R5Z9+zqhyK9UuIHUDNE05y20+Hsm1AxlI0iVVt+CahQ9T/6P9OBT
 QQ+vnepQR6ORtFBlPCVgyiTQgsO3Tw==
 =+MkS
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2, reiserfs, udf, and quota updates from Jan Kara:
 "A couple of small fixes and cleanups for ext2, udf, reiserfs, and
  quota.

  The biggest change is making CONFIG_PRINT_QUOTA_WARNING depend on
  BROKEN with an outlook for removing it completely in an year or so"

* tag 'fs_for_v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: mark PRINT_QUOTA_WARNING as BROKEN
  quota: update Kconfig comment
  reiserfs: remove unused iter variable
  quota: Use register_sysctl_init() for registering fs_dqstats_table
  reiserfs: remove unused sched_count variable
  ext2: remove redundant assignment to pointer end
  quota: make dquot_set_dqinfo return errors from ->write_info
  quota: fixup *_write_file_info() to return proper error code
  quota: simplify two-level sysctl registration for fs_dqstats_table
  udf: use wrapper i_blocksize() in udf_discard_prealloc()
  udf: Use folios in udf_adinicb_writepage()
  ext2: Check block size validity during mount
  ext2: Correct maximum ext2 filesystem block size
2023-04-26 09:07:46 -07:00
Linus Torvalds
0cfcde1faf There are a number of major cleanups in ext4 this cycle:
* The data=journal writepath has been significantly cleaned up and
   simplified, and reduces a large number of data=journal special cases
   by Jan Kara.
 
 * Ojaswin Muhoo has replaced linked list used to track extents that
   have been used for inode preallocation with a red-black tree in the
   multi-block allocator.  This improves performance for workloads
   which do a large number of random allocating writes.
 
 * Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
   multi-block allocator.
 
 * Matthew wilcox has converted the code paths for reading and writing
   ext4 pages to use folios.
 
 * Jason Yan has continued to factor out ext4_fill_super() into smaller
   functions for improve ease of maintenance and comprehension.
 
 * Josh Triplett has created an uapi header for ext4 userspace API's.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmRHS3IACgkQ8vlZVpUN
 gaNN7AgAnFiWfk4UqKpBsUL5iQKJgf2K4tjlNXgPd6ghNns0IdFEyeWSHhr6KLv/
 SQeoMMyiWaUcTvZs9DokD8U/9M1ELPUiE9W5c9GxJjM86SXp8BlLYSZTiRoNHzGJ
 noQpvikj4qTRviK0rA3q5ICTP2eh1ECHMFJy2wcsZQgwnBelUejQHsTGtOwSvFWF
 8wMdfuVtAFDZJjzOxzVKfHP22R5HVRWlAU7P1d97qKjBj4Se3+QchI+zdcIrmU9A
 tTmCXj57NpTDyLjS9dIDmLygtTv93lOzOmZS8glw0BFonPcd3ObI4RHVxR+V9xu1
 lN13YYgBrK6yfApn9L5XL/31PuLfbg==
 =VLBx
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "There are a number of major cleanups in ext4 this cycle:

   - The data=journal writepath has been significantly cleaned up and
     simplified, and reduces a large number of data=journal special
     cases by Jan Kara.

   - Ojaswin Muhoo has replaced linked list used to track extents that
     have been used for inode preallocation with a red-black tree in the
     multi-block allocator. This improves performance for workloads
     which do a large number of random allocating writes.

   - Thanks to Kemeng Shi for a lot of cleanup and bug fixes in the
     multi-block allocator.

   - Matthew wilcox has converted the code paths for reading and writing
     ext4 pages to use folios.

   - Jason Yan has continued to factor out ext4_fill_super() into
     smaller functions for improve ease of maintenance and
     comprehension.

   - Josh Triplett has created an uapi header for ext4 userspace API's"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (105 commits)
  ext4: Add a uapi header for ext4 userspace APIs
  ext4: remove useless conditional branch code
  ext4: remove unneeded check of nr_to_submit
  ext4: move dax and encrypt checking into ext4_check_feature_compatibility()
  ext4: factor out ext4_block_group_meta_init()
  ext4: move s_reserved_gdt_blocks and addressable checking into ext4_check_geometry()
  ext4: rename two functions with 'check'
  ext4: factor out ext4_flex_groups_free()
  ext4: use ext4_group_desc_free() in ext4_put_super() to save some duplicated code
  ext4: factor out ext4_percpu_param_init() and ext4_percpu_param_destroy()
  ext4: factor out ext4_hash_info_init()
  Revert "ext4: Fix warnings when freezing filesystem with journaled data"
  ext4: Update comment in mpage_prepare_extent_to_map()
  ext4: Simplify handling of journalled data in ext4_bmap()
  ext4: Drop special handling of journalled data from ext4_quota_on()
  ext4: Drop special handling of journalled data from ext4_evict_inode()
  ext4: Fix special handling of journalled data from extent zeroing
  ext4: Drop special handling of journalled data from extent shifting operations
  ext4: Drop special handling of journalled data from ext4_sync_file()
  ext4: Commit transaction before writing back pages in data=journal mode
  ...
2023-04-26 08:57:41 -07:00
Linus Torvalds
c3558a6b2a fsverity updates for 6.4
Several cleanups and fixes for fs/verity/, including a couple minor
 fixes to the changes in 6.3 that added support for Merkle tree block
 sizes less than the page size.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZEdweBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8EDAP91ondqa2EZ9C87NsBby+lPdo+wkekh
 VQ4EMxrg1yhUXQD/fvOp+NeSpnqW8BG3Q78KI7Rgvrpr6CtBF4Lfg4jsWQo=
 =EEFA
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux

Pull fsverity updates from Eric Biggers:
 "Several cleanups and fixes for fs/verity/, including a couple minor
  fixes to the changes in 6.3 that added support for Merkle tree block
  sizes less than the page size"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
  fsverity: reject FS_IOC_ENABLE_VERITY on mode 3 fds
  fsverity: explicitly check for buffer overflow in build_merkle_tree()
  fsverity: use WARN_ON_ONCE instead of WARN_ON
  fs-verity: simplify sysctls with register_sysctl()
  fs/buffer.c: use b_folio for fsverity work
2023-04-26 08:51:51 -07:00
Linus Torvalds
dbe0e78d0e fscrypt updates for 6.4
A few cleanups for fs/crypto/, and another patch to prepare for the
 upcoming CephFS encryption support.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZEdv1xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/jyAPwNyGlOQGuiFwsu8Mjtppkx4NUHK83K
 gp0N/aT2So6C4wD+IlqDVXiTinnmKOoMs/orBsML4Ub/8etuZZDGu4FipgY=
 =X+dx
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux

Pull fscrypt updates from Eric Biggers:
 "A few cleanups for fs/crypto/, and another patch to prepare for the
  upcoming CephFS encryption support"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
  fscrypt: optimize fscrypt_initialize()
  fscrypt: use WARN_ON_ONCE instead of WARN_ON
  fscrypt: new helper function - fscrypt_prepare_lookup_partial()
  fs/buffer.c: use b_folio for fscrypt work
2023-04-26 08:37:51 -07:00
Jeff Layton
92e4a6733f nfsd: simplify the delayed disposal list code
When queueing a dispose list to the appropriate "freeme" lists, it
pointlessly queues the objects one at a time to an intermediate list.

Remove a few helpers and just open code a list_move to make it more
clear and efficient. Better document the resulting functions with
kerneldoc comments.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:01 -04:00
Chuck Lever
0f5162480b NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor()
There have been several bugs over the years where the NFSD splice
actor has attempted to write outside the rq_pages array.

This is a "should never happen" condition, but if for some reason
the pipe splice actor should attempt to walk past the end of
rq_pages, it needs to terminate the READ operation to prevent
corruption of the pointer addresses in the fields just beyond the
array.

A server crash is thus prevented. Since the code is not behaving,
the READ operation returns -EIO to the client. None of the READ
payload data can be trusted if the splice actor isn't operating as
expected.

Suggested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2023-04-26 09:05:01 -04:00
NeilBrown
cf64b9bce9 SUNRPC: return proper error from get_expiry()
The get_expiry() function currently returns a timestamp, and uses the
special return value of 0 to indicate an error.

Unfortunately this causes a problem when 0 is the correct return value.

On a system with no RTC it is possible that the boot time will be seen
to be "3".  When exportfs probes to see if a particular filesystem
supports NFS export it tries to cache information with an expiry time of
"3".  The intention is for this to be "long in the past".  Even with no
RTC it will not be far in the future (at most a second or two) so this
is harmless.
But if the boot time happens to have been calculated to be "3", then
get_expiry will fail incorrectly as it converts the number to "seconds
since bootime" - 0.

To avoid this problem we change get_expiry() to report the error quite
separately from the expiry time.  The error is now the return value.
The expiry time is reported through a by-reference parameter.

Reported-by: Jerry Zhang <jerry@skydio.com>
Tested-by: Jerry Zhang <jerry@skydio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
2f90e18ffe lockd: add some client-side tracepoints
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
e59fb6749e nfs: move nfs_fhandle_hash to common include file
lockd needs to be able to hash filehandles for tracepoints. Move the
nfs_fhandle_hash() helper to a common nfs include file.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
244cc19196 lockd: server should unlock lock if client rejects the grant
Currently lockd just dequeues the block and ignores it if the client
sends a GRANT_RES with a status of nlm_lck_denied. That status is an
indicator that the client has rejected the lock, so the right thing to
do is to unlock the lock we were trying to grant.

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
2005f5b9c3 lockd: fix races in client GRANTED_MSG wait logic
After the wait for a grant is done (for whatever reason), nlmclnt_block
updates the status of the nlm_rqst with the status of the block. At the
point it does this, however, the block is still queued its status could
change at any time.

This is particularly a problem when the waiting task is signaled during
the wait. We can end up giving up on the lock just before the GRANTED_MSG
callback comes in, and accept it even though the lock request gets back
an error, leaving a dangling lock on the server.

Since the nlm_wait never lives beyond the end of nlmclnt_lock, put it on
the stack and add functions to allow us to enqueue and dequeue the
block. Enqueue it just before the lock/wait loop, and dequeue it
just after we exit the loop instead of waiting until the end of
the function. Also, scrape the status at the time that we dequeue it to
ensure that it's final.

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
f0aa4852e6 lockd: move struct nlm_wait to lockd.h
The next patch needs struct nlm_wait in fs/lockd/clntproc.c, so move
the definition to a shared header file. As an added clean-up, drop
the unused b_reclaim field.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:05:00 -04:00
Jeff Layton
bfca7a6f0c lockd: purge resources held on behalf of nlm clients when shutting down
It's easily possible for the server to have an outstanding lock when we
go to shut down. When that happens, we often get a warning like this in
the kernel log:

    lockd: couldn't shutdown host module for net f0000000!

This is because the shutdown procedures skip removing any hosts that
still have outstanding resources (locks). Eventually, things seem to get
cleaned up anyway, but the log message is unsettling, and server
shutdown doesn't seem to be working the way it was intended.

Ensure that we tear down any resources held on behalf of a client when
tearing one down for server shutdown.

Reported-by: Yongcheng Yang <yoyang@redhat.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2063818
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Chuck Lever
c4c649ab41 NFSD: Convert filecache to rhltable
While we were converting the nfs4_file hashtable to use the kernel's
resizable hashtable data structure, Neil Brown observed that the
list variant (rhltable) would be better for managing nfsd_file items
as well. The nfsd_file hash table will contain multiple entries for
the same inode -- these should be kept together on a list. And, it
could be possible for exotic or malicious client behavior to cause
the hash table to resize itself on every insertion.

A nice simplification is that rhltable_lookup() can return a list
that contains only nfsd_file items that match a given inode, which
enables us to eliminate specialized hash table helper functions and
use the default functions provided by the rhashtable implementation).

Since we are now storing nfsd_file items for the same inode on a
single list, that effectively reduces the number of hash entries
that have to be tracked in the hash table. The mininum bucket count
is therefore lowered.

Light testing with fstests generic/531 show no regressions.

Suggested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton
dcb779fcd4 nfsd: allow reaping files still under writeback
On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.

The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.

Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton
972cc0e092 nfsd: update comment over __nfsd_file_cache_purge
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton
b2ff1bd71d nfsd: don't take/put an extra reference when putting a file
The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:59 -04:00
Jeff Layton
b680cb9b73 nfsd: add some comments to nfsd_file_do_acquire
David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton
c6593366c0 nfsd: don't kill nfsd_files because of lease break error
An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton
d69b8dbfd0 nfsd: simplify test_bit return in NFSD_FILE_KEY_FULL comparator
test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton
6c31e4c988 nfsd: NFSD_FILE_KEY_INODE only needs to find GC'ed entries
Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.

Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Jeff Layton
b8bea9f6cd nfsd: don't open-code clear_and_wake_up_bit
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-04-26 09:04:58 -04:00
Paolo Abeni
c248b27cfc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-04-26 10:17:46 +02:00