mirror_ubuntu-kernels

mirror of https://git.proxmox.com/git/mirror_ubuntu-kernels.git synced 2025-11-26 22:40:39 +00:00

Author	SHA1	Message	Date
Darrick J. Wong	a24cae8fc1	xfs: reset rootdir extent size hint after growfsrt If growfsrt is run on a filesystem that doesn't have a rt volume, it's possible to change the rt extent size. If the root directory was previously set up with an inherited extent size hint and rtinherit, it's possible that the hint is no longer a multiple of the rt extent size. Although the verifiers don't complain about this, xfs_repair will, so if we detect this situation, log the root directory to clean it up. This is still racy, but it's better than nothing. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Darrick J. Wong	16e1fbdce9	xfs: take m_growlock when running growfsrt Take the grow lock when we're expanding the realtime volume, like we do for the other growfs calls. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Zizhi Wo	ca6448aed4	xfs: Fix missing interval for missing_owner in xfs fsmap In the fsmap query of xfs, there is an interval missing problem: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... BUG: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt [root@fedora ~]# Normally, we should be able to get [104, 107), but we got nothing. The problem is caused by shifting. The query for the problem-triggered scenario is for the missing_owner interval (e.g. freespace in rmapbt/ unknown space in bnobt), which is obtained by subtraction (gap). For this scenario, the interval is obtained by info->last. However, rec_daddr is calculated based on the start_block recorded in key[1], which is converted by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log <= info->next_daddr, no records will be displayed. In the above example, 104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two are reduced to 0 and the gap is ignored: before calculate ----------------> after shifting 104(st) 107(ed) 12(st/ed) \|---------\| \| sector size block size Resolve this issue by introducing the "end_daddr" field in xfs_getfsmap_info. This records \|key[1].fmr_physical + key[1].length\| at the granularity of sector. If the current query is the last, the rec_daddr is end_daddr to prevent missing interval problems caused by shifting. We only need to focus on the last query, because xfs disks are internally aligned with disk blocksize that are powers of two and minimum 512, so there is no problem with shifting in previous queries. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [104..106]: free space 0 (104..106) 3 Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: limit the range of end_addr correctly] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:14 +05:30
Darrick J. Wong	6b35cc8d92	xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this field is null" like the rest of xfs. Cc: wozizhi@huawei.com Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-27 18:32:08 +05:30
Linus Torvalds	3e9bff3bbe	vfs-6.11-rc6.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZsxg5QAKCRCRxhvAZXjc olSiAQDvFvim4YtMmUDagC3yWTBsf+o3lYdAIuzNE0NtSn4vpAEAl/HVhQCaEDjv mcE3jokEsbvyXLnzs78PrY0Heua2mQg= =AHAd -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"	2024-08-27 16:57:35 +12:00
Kent Overstreet	d26935690c	bcachefs: Fix bch2_extents_match() false positive This was caught as a very rare nonce inconsistency, on systems with encryption and replication (and tiering, or some form of rebalance operation running): [Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path [Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd [Wed Jul 17 13:30:03 2024] k: u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd [Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0 [Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only bch2_extents_match() was reporting true for extents that did not actually point to the same data. bch2_extent_match() iterates over pairs of pointers, looking for pointers that point to the same location on disk (with matching generation numbers). However one or both extents may have been trimmed (or merged) and they might not have the same disk offset: it corrects for this by subtracting the key offset and the checksum entry offset. However, this failed when an extent was immediately partially overwritten, and the new overwrite was allocated the next adjacent disk space. Normally, with compression off, this would never cause a bug, since the new extent would have to be immediately after the old extent for the pointer offsets to match, and the rebalance index update path is not looking for an extent outside the range of the extent it moved. However with compression enabled, extents take up less space on disk than they do in the btree index space - and spuriously matching after partial overwrite is possible. To fix this, add a secondary check, that strictly checks that the regions pointed to on disk overlap. https://github.com/koverstreet/bcachefs/issues/717 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-26 20:33:12 -04:00
Kent Overstreet	66927b8928	bcachefs: Fix failure to return error in data_update_index_update() This fixes an assertion pop in io_write.c - if we don't return an error we're supposed to have completed all the btree updates. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-26 20:33:12 -04:00
Qu Wenruo	10d9d8c351	btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() [BUG] There is an internal report that KASAN is reporting use-after-free, with the following backtrace: BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs] Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45 CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] Call Trace: dump_stack_lvl+0x61/0x80 print_address_description.constprop.0+0x5e/0x2f0 print_report+0x118/0x216 kasan_report+0x11d/0x1f0 btrfs_check_read_bio+0xa68/0xb70 [btrfs] process_one_work+0xce0/0x12a0 worker_thread+0x717/0x1250 kthread+0x2e3/0x3c0 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x11/0x20 Allocated by task 20917: kasan_save_stack+0x37/0x60 kasan_save_track+0x10/0x30 __kasan_slab_alloc+0x7d/0x80 kmem_cache_alloc_noprof+0x16e/0x3e0 mempool_alloc_noprof+0x12e/0x310 bio_alloc_bioset+0x3f0/0x7a0 btrfs_bio_alloc+0x2e/0x50 [btrfs] submit_extent_page+0x4d1/0xdb0 [btrfs] btrfs_do_readpage+0x8b4/0x12a0 [btrfs] btrfs_readahead+0x29a/0x430 [btrfs] read_pages+0x1a7/0xc60 page_cache_ra_unbounded+0x2ad/0x560 filemap_get_pages+0x629/0xa20 filemap_read+0x335/0xbf0 vfs_read+0x790/0xcb0 ksys_read+0xfd/0x1d0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Freed by task 20917: kasan_save_stack+0x37/0x60 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x50 __kasan_slab_free+0x4b/0x60 kmem_cache_free+0x214/0x5d0 bio_free+0xed/0x180 end_bbio_data_read+0x1cc/0x580 [btrfs] btrfs_submit_chunk+0x98d/0x1880 [btrfs] btrfs_submit_bio+0x33/0x70 [btrfs] submit_one_bio+0xd4/0x130 [btrfs] submit_extent_page+0x3ea/0xdb0 [btrfs] btrfs_do_readpage+0x8b4/0x12a0 [btrfs] btrfs_readahead+0x29a/0x430 [btrfs] read_pages+0x1a7/0xc60 page_cache_ra_unbounded+0x2ad/0x560 filemap_get_pages+0x629/0xa20 filemap_read+0x335/0xbf0 vfs_read+0x790/0xcb0 ksys_read+0xfd/0x1d0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [CAUSE] Although I cannot reproduce the error, the report itself is good enough to pin down the cause. The call trace is the regular endio workqueue context, but the free-by-task trace is showing that during btrfs_submit_chunk() we already hit a critical error, and is calling btrfs_bio_end_io() to error out. And the original endio function called bio_put() to free the whole bio. This means a double freeing thus causing use-after-free, e.g.: 1. Enter btrfs_submit_bio() with a read bio The read bio length is 128K, crossing two 64K stripes. 2. The first run of btrfs_submit_chunk() 2.1 Call btrfs_map_block(), which returns 64K 2.2 Call btrfs_split_bio() Now there are two bios, one referring to the first 64K, the other referring to the second 64K. 2.3 The first half is submitted. 3. The second run of btrfs_submit_chunk() 3.1 Call btrfs_map_block(), which by somehow failed Now we call btrfs_bio_end_io() to handle the error 3.2 btrfs_bio_end_io() calls the original endio function Which is end_bbio_data_read(), and it calls bio_put() for the original bio. Now the original bio is freed. 4. The submitted first 64K bio finished Now we call into btrfs_check_read_bio() and tries to advance the bio iter. But since the original bio (thus its iter) is already freed, we trigger the above use-after free. And even if the memory is not poisoned/corrupted, we will later call the original endio function, causing a double freeing. [FIX] Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(), which has the extra check on split bios and do the proper refcounting for cloned bios. Furthermore there is already one extra btrfs_cleanup_bio() call, but that is duplicated to btrfs_orig_bbio_end_io() call, so remove that label completely. Reported-by: David Sterba <dsterba@suse.com> Fixes: `852eee62d3` ("btrfs: allow btrfs_submit_bio to split bios") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-27 01:34:08 +02:00
Jeff Layton	7e8ae8486e	fs/nfsd: fix update of inode attrs in CB_GETATTR Currently, we copy the mtime and ctime to the in-core inode and then mark the inode dirty. This is fine for certain types of filesystems, but not all. Some require a real setattr to properly change these values (e.g. ceph or reexported NFS). Fix this code to call notify_change() instead, which is the proper way to effect a setattr. There is one problem though: In this case, the client is holding a write delegation and has sent us attributes to update our cache. We don't want to break the delegation for this since that would defeat the purpose. Add a new ATTR_DELEG flag that makes notify_change bypass the try_break_deleg call. Fixes: `c5967721e1` ("NFSD: handle GETATTR conflict with write delegation") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-26 19:04:00 -04:00
Max Filippov	c6a09e342f	binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined create_elf_fdpic_tables() does not correctly account the space for the AUX vector when an architecture has ELF_HWCAP2 defined. Prior to the commit `10e29251be` ("binfmt_elf_fdpic: fix /proc/<pid>/auxv") it resulted in the last entry of the AUX vector being set to zero, but with that change it results in a kernel BUG. Fix that by adding one to the number of AUXV entries (nitems) when ELF_HWCAP2 is defined. Fixes: `10e29251be` ("binfmt_elf_fdpic: fix /proc/<pid>/auxv") Cc: stable@vger.kernel.org Reported-by: Greg Ungerer <gerg@kernel.org> Closes: https://lore.kernel.org/lkml/5b51975f-6d0b-413c-8b38-39a6a45e8821@westnet.com.au/ Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Tested-by: Greg Ungerer <gerg@kernel.org> Link: https://lore.kernel.org/r/20240826032745.3423812-1-jcmvbkbc@gmail.com Signed-off-by: Kees Cook <kees@kernel.org>	2024-08-26 13:00:38 -07:00
Jeff Layton	1116e0e372	nfsd: fix potential UAF in nfsd4_cb_getattr_release Once we drop the delegation reference, the fields embedded in it are no longer safe to access. Do that last. Fixes: `c5967721e1` ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-26 11:53:05 -04:00
Jeff Layton	da05ba23d4	nfsd: hold reference to delegation when updating it for cb_getattr Once we've dropped the flc_lock, there is nothing that ensures that the delegation that was found will still be around later. Take a reference to it while holding the lock and then drop it when we've finished with the delegation. Fixes: `c5967721e1` ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-26 11:52:40 -04:00
David Sterba	33f58a0480	btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap() There's a warning (probably on some older compiler version): fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]: => 822:19 Initialize the variable to 0 although it's not necessary as it's either properly set or not used after an error. The called function is in the same file so this is a false alert but we want to fix all -Wmaybe-uninitialized reports. Link: https://lore.kernel.org/all/20240819070639.2558629-1-geert@linux-m68k.org/ Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-26 16:58:13 +02:00
Zizhi Wo	68415b349f	xfs: Fix the owner setting issue for rmap query in xfs fsmap I notice a rmap query bug in xfs_io fsmap: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... Bug: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt [root@fedora ~]# Normally, we should be able to get one record, but we got nothing. The root cause of this problem lies in the incorrect setting of rm_owner in the rmap query. In the case of the initial query where the owner is not set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX. This is done to prevent any omissions when comparing rmap items. However, if the current ag is detected to be the last one, the function sets info's high_irec based on the provided key. If high->rm_owner is not specified, it should continue to be set to ULLONG_MAX; otherwise, there will be issues with interval omissions. For example, consider "start" and "end" within the same block. If high->rm_owner == 0, it will be smaller than the founded record in rmapbt, resulting in a query with no records. The main call stack is as follows: xfs_ioc_getfsmap xfs_getfsmap xfs_getfsmap_datadev_rmapbt __xfs_getfsmap_datadev info->high.rm_owner = ULLONG_MAX if (pag->pag_agno == end_ag) xfs_fsmap_owner_to_rmap // set info->high.rm_owner = 0 because fmr_owner == -1ULL dest->rm_owner = 0 // get nothing xfs_getfsmap_datadev_rmapbt_query The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap function internal logic to achieve. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:27 +05:30
Darrick J. Wong	410e8a18f8	xfs: don't bother reporting blocks trimmed via FITRIM Don't bother reporting the number of bytes that we "trimmed" because the underlying storage isn't required to do anything(!) and failed discard IOs aren't reported to the caller anyway. It's not like userspace can use the reported value for anything useful like adjusting the offset parameter of the next call, and it's not like anyone ever wrote a manpage about FITRIM's out parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:13 +05:30
Dave Chinner	95179935be	xfs: xfs_finobt_count_blocks() walks the wrong btree As a result of the factoring in commit `14dd46cf31` ("xfs: split xfs_inobt_init_cursor"), mount started taking a long time on a user's filesystem. For Anders, this made mount times regress from under a second to over 15 minutes for a filesystem with only 30 million inodes in it. Anders bisected it down to the above commit, but even then the bug was not obvious. In this commit, over 20 calls to xfs_inobt_init_cursor() were modified, and some we modified to call a new function named xfs_finobt_init_cursor(). If that takes you a moment to reread those function names to see what the rename was, then you have realised why this bug wasn't spotted during review. And it wasn't spotted on inspection even after the bisect pointed at this commit - a single missing "f" isn't the easiest thing for a human eye to notice.... The result is that xfs_finobt_count_blocks() now incorrectly calls xfs_inobt_init_cursor() so it is now walking the inobt instead of the finobt. Hence when there are lots of allocated inodes in a filesystem, mount takes a -long- time run because it now walks a massive allocated inode btrees instead of the small, nearly empty free inode btrees. It also means all the finobt space reservations are wrong, so mount could potentially given ENOSPC on kernel upgrade. In hindsight, commit `14dd46cf31` should have been two commits - the first to convert the finobt callers to the new API, the second to modify the xfs_inobt_init_cursor() API for the inobt callers. That would have made the bug very obvious during review. Fixes: `14dd46cf31` ("xfs: split xfs_inobt_init_cursor") Reported-by: Anders Blomdell <anders.blomdell@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:52:00 +05:30
Darrick J. Wong	5335affcff	xfs: fix folio dirtying for XFILE_ALLOC callers willy pointed out that folio_mark_dirty is the correct function to use to mark an xfile folio dirty because it calls out to the mapping's aops to mark it dirty. For tmpfs this likely doesn't matter much since it currently uses nop_dirty_folio, but let's use the abstractions properly. Reported-by: willy@infradead.org Fixes: `6907e3c00a` ("xfs: add file_{get,put}_folio") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:51:27 +05:30
Darrick J. Wong	e21fea4ac3	xfs: fix di_onlink checking for V1/V2 inodes "KjellR" complained on IRC that an old V4 filesystem suddenly stopped mounting after upgrading from 6.9.11 to 6.10.3, with the following splat when trying to read the rt bitmap inode: 00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30 ........C...!..0 00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00 C...!..0........ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 ................ 00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ As Dave Chinner points out, this is a V1 inode with both di_onlink and di_nlink set to 1 and di_flushiter == 0. In other words, this inode was formatted this way by mkfs and hasn't been touched since then. Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc would set di_nlink, but if the filesystem didn't have NLINK, it would then set di_version = 1. libxfs_iflush_int later sees the V1 inode and copies the value of di_nlink to di_onlink without zeroing di_onlink. Eventually this filesystem must have been upgraded to support NLINK because 6.10 doesn't support !NLINK filesystems, which is how we tripped over this old behavior. The filesystem doesn't have a realtime section, so that's why the rtbitmap inode has never been touched. Fix this by removing the di_onlink/di_nlink checking for all V1/V2 inodes because this is a muddy mess. The V3 inode handling code has always supported NLINK and written di_onlink==0 so keep that check. The removal of the V1 inode handling code when we dropped support for !NLINK obscured this old behavior. Reported-by: kjell.m.randa@gmail.com Fixes: `40cb8613d6` ("xfs: check unused nlink fields in the ondisk inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-26 09:50:41 +05:30
Josef Bacik	2d34472610	btrfs: run delayed iputs when flushing delalloc We have transient failures with btrfs/301, specifically in the part where we do for i in $(seq 0 10); do write 50m to file rm -f file done Sometimes this will result in a transient quota error, and it's because sometimes we start writeback on the file which results in a delayed iput, and thus the rm doesn't actually clean the file up. When we're flushing the quota space we need to run the delayed iputs to make sure all the unlinks that we think have completed have actually completed. This removes the small window where we could fail to find enough space in our quota. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-25 19:15:34 +02:00
David Howells	416871f4fb	cifs: Fix FALLOC_FL_PUNCH_HOLE support The cifs filesystem doesn't quite emulate FALLOC_FL_PUNCH_HOLE correctly (note that due to lack of protocol support, it can't actually implement it directly). Whilst it will (partially) invalidate dirty folios in the pagecache, it doesn't write them back first, and so the EOF marker on the server may be lower than inode->i_size. This presents a problem, however, as if the punched hole invalidates the tail of the locally cached dirty data, writeback won't know it needs to move the EOF over to account for the hole punch (which isn't supposed to move the EOF). We could just write zeroes over the punched out region of the pagecache and write that back - but this is supposed to be a deallocatory operation. Fix this by manually moving the EOF over on the server after the operation if the hole punched would corrupt it. Note that the FSCTL_SET_ZERO_DATA RPC and the setting of the EOF should probably be compounded to stop a third party interfering (or, at least, massively reduce the chance). This was reproducible occasionally by using fsx with the following script: truncate 0x0 0x375e2 0x0 punch_hole 0x2f6d3 0x6ab5 0x375e2 truncate 0x0 0x3a71f 0x375e2 mapread 0xee05 0xcf12 0x3a71f write 0x2078e 0x5604 0x3a71f write 0x3ebdf 0x1421 0x3a71f * punch_hole 0x379d0 0x8630 0x40000 * mapread 0x2aaa2 0x85b 0x40000 fallocate 0x1b401 0x9ada 0x40000 read 0x15f2 0x7d32 0x40000 read 0x32f37 0x7a3b 0x40000 * The second "write" should extend the EOF to 0x40000, and the "punch_hole" should operate inside of that - but that depends on whether the VM gets in and writes back the data first. If it doesn't, the file ends up 0x3a71f in size, not 0x40000. Fixes: `31742c5a33` ("enable fallocate punch hole ("fallocate -p") for SMB3") Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-25 09:06:25 -05:00
Stefan Metzmacher	017d170174	smb/client: fix rdma usage in smb2_async_writev() rqst.rq_iter needs to be truncated otherwise we'll also send the bytes into the stream socket... This is the logic behind rqst.rq_npages = 0, which was removed in "cifs: Change the I/O paths to use an iterator rather than a page list" (`d08089f649`). Cc: stable@vger.kernel.org Fixes: `d08089f649` ("cifs: Change the I/O paths to use an iterator rather than a page list") Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-25 09:06:25 -05:00
Stefan Metzmacher	b608e2c318	smb/client: remove unused rq_iter_size from struct smb_rqst Reviewed-by: David Howells <dhowells@redhat.com> Fixes: `d08089f649` ("cifs: Change the I/O paths to use an iterator rather than a page list") Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-25 09:06:25 -05:00
Stefan Metzmacher	c724b2ab6a	smb/client: avoid dereferencing rdata=NULL in smb2_new_read_req() This happens when called from SMB2_read() while using rdma and reaching the rdma_readwrite_threshold. Cc: stable@vger.kernel.org Fixes: `a6559cc1d3` ("cifs: split out smb3_use_rdma_offload() helper") Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-25 09:06:25 -05:00
Linus Torvalds	72bea05cb1	bcachefs fixes for 6.11-rc5, v2 - rhashtable conversion for vfs inodes - rcu_pending, btree key cache conversion + nocow deadlock fix + fix for new rebalance_work accounting -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbKB9wACgkQE6szbY3K bnYMtw/+LSGV/eqwLwdeuABggU5gehWjxqkWF/uGE7fPP8pP0dJQnvCLRVKtAro2 0mJh6j+kM602fU5jH/W8WNn1h8J7dAdkyqI7P/D3ZgTdaCtxso+A0Nj95CYpNdWY ESX9CLUYSxtFatT/kfWvlRvqlSJBYo7WgNsV6tcnPdpC+Oki6Kwlq22iI+ma9Ty7 uDbgd05/R9KCSxaaV+9iojCsEq6h/tuFH8Z3f3SevA8H29odh5mt0UWNn05pf3mt rAnDUJ5TQVYubMIcbS6MhjVoLZ3AxOefkk4pctdbdmGSPJcssDeXvATn/wHYl6Fp +et1ECRU3Sc3dqcmT0RaTm/yxYytdtKA4HVxS4ELKbsIM2xU0Pjq3JQwKzHRwXDd a3r0WXa+LqHkBP37g0HhuxhxAECnbpUM9bvDivgGssVDLyxfMKUkhDsuzegjrHAF v5H08myk5maKvLv+dD6e23t0l1i9eB/bSsw1iNGOgZP4k9gsUlESvppFGw/10F+Q 1Y/qeSiNTG9kJyo9PQTOZ6rFVxrfaZ9NFP4EAXcWId81OsQHYY8XnE5XaJATxnwF MzCgNdmzuf67X6Q8fCeNCJtiZ5sCmbyENGd6hbyYFDg+R02p0NOM4ABVN6BBfXJ+ eHPyu2bvusIZt8MD6c7fOxyGsGdgLxIv/SkqLayZdxEaY3VvS2g= =ejxu -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: - assorted syzbot fixes - some upgrade fixes for old (pre 1.0) filesystems - fix for moving data off a device that was switched to durability=0 after data had been written to it. - nocow deadlock fix - fix for new rebalance_work accounting * tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits) bcachefs: Fix rebalance_work accounting bcachefs: Fix failure to flush moves before sleeping in copygc bcachefs: don't use rht_bucket() in btree_key_cache_scan() bcachefs: add missing inode_walker_exit() bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() bcachefs: Fix double assignment in check_dirent_to_subvol() bcachefs: Fix refcounting in discard path bcachefs: Fix compat issue with old alloc_v4 keys bcachefs: Fix warning in bch2_fs_journal_stop() fs/super.c: improve get_tree() error message bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() bcachefs: Fix replay_now_at() assert bcachefs: Fix locking in bch2_ioc_setlabel() bcachefs: fix failure to relock in btree_node_fill() bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() bcachefs: unlock_long() before resort in journal replay bcachefs: fix missing bch2_err_str() bcachefs: fix time_stats_to_text() bcachefs: Fix bch2_bucket_gens_init() bcachefs: Fix bch2_trigger_alloc assert ...	2024-08-25 17:20:48 +12:00
Linus Torvalds	780bdc1ba7	five ksmbd server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbJteoACgkQiiy9cAdy T1F+Pwv/RHXSnQD+jkFEfQCEgsZZOfWD0V74VZqm90N48gfB3giZw9mtV4I1jQzI 0+UerZjN7lHIDC4f6qp48TSEodHpprAxLfsg5JJN/OxDE+0MSbctTjLeHlduVzw6 iHEdaE3jWN0p4YZRdbyrUCaOoTEk9cKwiG7r2DjArNyQ8kClveeqrGfdZUDTHNkv IIs6CJ8PFo7dicpAIGPmMz1TGq5Lh2EFjZTYEweSSlyXUNKaWgz3BXBIXD4LwK6w mFjGPxGNBDorcvzHcOUZnrpfACB3WNOSPN/WK5sQL6LXGCx3sWtUvGxLFkxFwjSq D7gvo7qnBuycNyR03RfmWyXYx+2KzdYoAUGTNV114zMJskBC0QhIIF6JK+xZdPZX XHxbr4CRR7fsaZOur5MTWXEzVJxvC1irULKoBp7lvYpEoAV6yXpK3XegAHIASKUE /Cw9qikIvxrMg4BjWPP1JhbKRw92uL2ty4oO913hbnBsScS8jCystuNl6ataiXWq PN5rN4sy =bGOb -----END PGP SIGNATURE----- Merge tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: - query directory flex array fix - fix potential null ptr reference in open - fix error message in some open cases - two minor cleanups * tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd: smb/server: update misguided comment of smb2_allocate_rsp_buf() smb/server: remove useless assignment of 'file_present' in smb2_open() smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open() smb/server: fix return value of smb2_open() ksmbd: the buffer of smb2 query dir response has at least 1 byte	2024-08-25 12:15:04 +12:00
Kent Overstreet	49aa783039	bcachefs: Fix rebalance_work accounting rebalance_work was keying off of the presence of rebelance_opts in the extent - but that was incorrect, we keep those around after rebalance for indirect extents since the inode's options are not directly available Fixes: `20ac515a9c` ("bcachefs: bch_acct_rebalance_work") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-24 10:16:21 -04:00
Kent Overstreet	d3204616a6	bcachefs: Fix failure to flush moves before sleeping in copygc This fixes an apparent deadlock - rebalance would get stuck trying to take nocow locks because they weren't being released by copygc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-24 10:16:21 -04:00
David Howells	e00e99ba6c	netfs: Fix interaction of streaming writes with zero-point tracker When a folio that is marked for streaming write (dirty, but not uptodate, with partial content specified in the private data) is written back, the folio is effectively switched to the blank state upon completion of the write. This means that if we want to read it in future, we need to reread the whole folio. However, if the folio is above the zero_point position, when it is read back, it will just be cleared and the read skipped, leading to apparent local corruption. Fix this by increasing the zero_point to the end of the dirty data in the folio when clearing the folio state after writeback. This is analogous to the folio having ->release_folio() called upon it. This was causing the config.log generated by configuring a cpython tree on a cifs share to get corrupted because the scripts involved were appending text to the file in small pieces. Fixes: `288ace2f57` ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/563286.1724500613@warthog.procyon.org.uk cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:09:17 +02:00
David Howells	950b03d0f6	netfs: Fix missing iterator reset on retry of short read Fix netfs_rreq_perform_resubmissions() to reset before retrying a short read, otherwise the wrong part of the output buffer will be used. Fixes: `92b6cc5d1e` ("netfs: Add iov_iters to (sub)requests to describe various buffers") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240823200819.532106-6-dhowells@redhat.com cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:09:17 +02:00
David Howells	cce6bfa6ca	netfs: Fix trimming of streaming-write folios in netfs_inval_folio() When netfslib writes to a folio that it doesn't have data for, but that data exists on the server, it will make a 'streaming write' whereby it stores data in a folio that is marked dirty, but not uptodate. When it does this, it attaches a record to folio->private to track the dirty region. When truncate() or fallocate() wants to invalidate part of such a folio, it will call into ->invalidate_folio(), specifying the part of the folio that is to be invalidated. netfs_invalidate_folio(), on behalf of the filesystem, must then determine how to trim the streaming write record. In a couple of cases, however, it does this incorrectly (the reduce-length and move-start cases are switched over and don't, in any case, calculate the value correctly). Fix this by making the logic tree more obvious and fixing the cases. Fixes: `9ebff83e64` ("netfs: Prep to use folio->private for write grouping and streaming write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240823200819.532106-5-dhowells@redhat.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:09:16 +02:00
David Howells	7dfc8f0c61	netfs: Fix netfs_release_folio() to say no if folio dirty Fix netfs_release_folio() to say no (ie. return false) if the folio is dirty (analogous with iomap's behaviour). Without this, it will say yes to the release of a dirty page by split_huge_page_to_list_to_order(), which will result in the loss of untruncated data in the folio. Without this, the generic/075 and generic/112 xfstests (both fsx-based tests) fail with minimum folio size patches applied[1]. Fixes: `c1ec4d7c2e` ("netfs: Provide invalidate_folio and release_folio calls") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1] Link: https://lore.kernel.org/r/20240823200819.532106-4-dhowells@redhat.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:09:16 +02:00
David Howells	a74ee0e878	afs: Fix post-setattr file edit to do truncation correctly At the end of an kAFS RPC operation, there is an "edit" phase (originally intended for post-directory modification ops to edit the local image) that the setattr VFS op uses to fix up the pagecache if the RPC that requested truncation of a file was successful. afs_setattr_edit_file() calls truncate_setsize() which sets i_size, expands the pagecache if needed and truncates the pagecache. The first two of those, however, are redundant as they've already been done by afs_setattr_success() under the io_lock and the first is also done under the callback lock (cb_lock). Fix afs_setattr_edit_file() to call truncate_pagecache() instead (which is called by truncate_setsize(), thereby skipping the redundant parts. Fixes: `100ccd18bb` ("netfs: Optimise away reads above the point at which there can be no data") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240823200819.532106-3-dhowells@redhat.com cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:09:16 +02:00
Christian Brauner	d10771d51b	Merge patch series "ovl: simplify ovl_parse_param_lowerdir()" Simplify and fix overlayfs layer parsing so the maximum of 500 layers can be used. * patches from https://lore.kernel.org/r/20240705011510.794025-1-chengzhihao1@huawei.com: ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() Link: https://lore.kernel.org/r/20240705011510.794025-1-chengzhihao1@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-24 16:00:46 +02:00
Linus Torvalds	60f0560f53	NFS Client Bugfixes for Linux 6.11-rc Bugfixes: * Fix rpcrdma refcounting in xa_alloc * Fix rpcrdma usage of XA_FLAGS_ALLOC * Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS * Fix attribute bitmap decoder to handle a 3rd word * Add reschedule points when returning delegations to avoid soft lockups * Fix clearing layout segments in layoutreturn * Avoid unnecessary rescanning of the per-server delegation list -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmbIywQACgkQ18tUv7Cl QOtvHQ//VJO6iTh3kmONCru2ohxgAl7+qX+3HHbMR62S8AWCp0ujId0nir7CuxDa 49c+R03s36lGXkTX5x0d3Idhbv12a5Jdy21oZuJU+RCm6Z7MdNXdDl9HN/gXXqdl 0Z3Wk8r3Pi4/tgejau0i2zN2wXxVNKSQgovETnuI/BQLHupDvDy8Sd8lrIqqoiXY sffCiKSTbCFWg6JLEF1UWZZ1VtLUsDZRBQJD+67l1NbjSX/tiBsY0CquWcHjXAlY 2VGDXdFCZwsQyYuqNdMVh1Cr95hcT0F1YZLOT+vn+6b6rA+UbtmPlURt8iR4gBFo Fadpp5pRziYb9wyg/DgFABihB6PzcboIg5Lm0rx870WEuzxSs8NQeQ9sw5hJC797 At8C4I+cNLOaPU5nUEcG53+svEl9F2jDI2jFc8aa5zAW2hHAtpZhLVre5to0CDb/ hu/H+h2yvjJyfSB7kCdVqlU93PJM96P7F1KEdVYmkuXQQMhkZknntVnu41w7KOst SKy0iU29idlU7SFHvKYyc4URC63kTKLWmTZZn3uDJouwDRYudCRPFQpiwnoDJOY3 wRi3jRPhmZQXn7ArChQSPrHqjRVTu9Y0gUcrtAj4YODv+bkngxmZbG3J3IqZKw21 AkMiDZESyriKVOTurX0Fzaj63zHSrIc+TwyTTXwFCGpCbVCf5/k= =OnQe -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: - Fix rpcrdma refcounting in xa_alloc - Fix rpcrdma usage of XA_FLAGS_ALLOC - Fix requesting FATTR4_WORD2_OPEN_ARGUMENTS - Fix attribute bitmap decoder to handle a 3rd word - Add reschedule points when returning delegations to avoid soft lockups - Fix clearing layout segments in layoutreturn - Avoid unnecessary rescanning of the per-server delegation list * tag 'nfs-for-6.11-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Avoid unnecessary rescanning of the per-server delegation list NFSv4: Fix clearing of layout segments in layoutreturn NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations nfs: fix bitmap decoder to handle a 3rd word nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS rpcrdma: Trace connection registration and unregistration rpcrdma: Use XA_FLAGS_ALLOC instead of XA_FLAGS_ALLOC1 rpcrdma: Device kref is over-incremented on error from xa_alloc	2024-08-24 09:03:25 +08:00
Linus Torvalds	66ace9a8f9	four cifs.ko client fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbIqhgACgkQiiy9cAdy T1EAPgwAnW+vu15huT1zQn2BtFcn85zdBGXL/avjbbMLDwNHj5Lpae+PbbRa4gZ0 VN6OQdq5Rt3Z2pJDfFZtFECKq4AN1Lxn1ur4wujBIzez3CxyFCXjDeS5/3lRP6c+ 0CiHVtRe7IgncGUnnhvwPhiG6/cjTNiXlImb6SgmFLP/0U7ZnWl5p3LmR7exfVY9 Fubqq3HF0UpxMUD3thM055ftqT/xP6RdrITX2K2Led+BlJAJm1x+0E//4nApQ2IX C3VeBRZTvQtBC+pay754BqSnfAifgVObF8cfswDMS4U7ImV5gS+CxSx4vlg4bF7o 2f32mZAXz9U3yMIBMjtBT/q/LbN28SRSjo1x35CJ9LCUK6IzARHiLZG/PVltK3Cj copuH3n5ZV0nGVdsv10Uheo3euFlrKKylPn8xAEhMsQzG7Q6ek/pT+avb+xl6MWf i8eOnMobCFiOEJtSk/uV23579wf8maVQM92M2rf2UO6K5eHIceOq0HGfSoeVV9dZ 1rgZb1D6 =8U5O -----END PGP SIGNATURE----- Merge tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - fix refcount leak (can cause rmmod fail) - fix byte range locking problem with cached reads - fix for mount failure if reparse point unrecognized - minor typo * tag 'v6.11-rc4-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock smb: client: ignore unhandled reparse tags smb3: fix problem unloading module due to leaked refcount on shutdown smb3: fix broken cached reads when posix locks	2024-08-24 08:50:21 +08:00
Zhihao Cheng	441e36ef5b	ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err Add '\n' for pr_err in function ovl_parse_param_lowerdir(), which ensures that error message is displayed at once. Fixes: `b36a5780cb` ("ovl: modify layer parameter parsing") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240705011510.794025-4-chengzhihao1@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-23 19:58:59 +02:00
Zhihao Cheng	ca76ac36bb	ovl: fix wrong lowerdir number check for parameter Opt_lowerdir The max count of lowerdir is OVL_MAX_STACK[500], which is broken by commit 37f32f526438("ovl: fix memory leak in ovl_parse_param()") for parameter Opt_lowerdir. Since commit 819829f0319a("ovl: refactor layer parsing helpers") and commit 24e16e385f22("ovl: add support for appending lowerdirs one by one") added check ovl_mount_dir_check() in function ovl_parse_param_lowerdir(), the 'ctx->nr' should be smaller than OVL_MAX_STACK, after commit 37f32f526438("ovl: fix memory leak in ovl_parse_param()") is applied, the 'ctx->nr' is updated before the check ovl_mount_dir_check(), which leads the max count of lowerdir to become 499 for parameter Opt_lowerdir. Fix it by replacing lower layers parsing code with the existing helper function ovl_parse_layer(). Fixes: `37f32f5264` ("ovl: fix memory leak in ovl_parse_param()") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240705011510.794025-3-chengzhihao1@huawei.com Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-23 19:56:38 +02:00
Christian Brauner	7eff3453cb	ovl: pass string to ovl_parse_layer() So it can be used for parsing the Opt_lowerdir. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240705011510.794025-2-chengzhihao1@huawei.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-23 19:56:38 +02:00
Olga Kornievskaia	a204501e17	nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open Prior to commit `3f29cc82a8` ("nfsd: split sc_status out of sc_type") states_show() relied on sc_type field to be of valid type before calling into a subfunction to show content of a particular stateid. From that commit, we split the validity of the stateid into sc_status and no longer changed sc_type to 0 while unhashing the stateid. This resulted in kernel oopsing for nfsv4.0 opens that stay around and in nfs4_show_open() would derefence sc_file which was NULL. Instead, for closed open stateids forgo displaying information that relies of having a valid sc_file. To reproduce: mount the server with 4.0, read and close a file and then on the server cat /proc/fs/nfsd/clients/2/states [ 513.590804] Call trace: [ 513.590925] _raw_spin_lock+0xcc/0x160 [ 513.591119] nfs4_show_open+0x78/0x2c0 [nfsd] [ 513.591412] states_show+0x44c/0x488 [nfsd] [ 513.591681] seq_read_iter+0x5d8/0x760 [ 513.591896] seq_read+0x188/0x208 [ 513.592075] vfs_read+0x148/0x470 [ 513.592241] ksys_read+0xcc/0x178 Fixes: `3f29cc82a8` ("nfsd: split sc_status out of sc_type") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-23 12:25:46 -04:00
Ed Tsai	996b37da1e	backing-file: convert to using fops->splice_write Filesystems may define their own splice write. Therefore, use the file fops instead of invoking iter_file_splice_write() directly. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Link: https://lore.kernel.org/r/20240708072208.25244-1-ed.tsai@mediatek.com Fixes: `5ca7346861` ("fuse: implement splice read/write passthrough") Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-23 13:08:31 +02:00
Trond Myklebust	f92214e4c3	NFS: Avoid unnecessary rescanning of the per-server delegation list If the call to nfs_delegation_grab_inode() fails, we will not have dropped any locks that require us to rescan the list. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Trond Myklebust	d72b796311	NFSv4: Fix clearing of layout segments in layoutreturn Make sure that we clear the layout segments in cases where we see a fatal error, and also in the case where the layout is invalid. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Trond Myklebust	a017ad1313	NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations We're seeing reports of soft lockups when iterating through the loops, so let's add rescheduling points. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Jeff Layton	95832998fb	nfs: fix bitmap decoder to handle a 3rd word It only decodes the first two words at this point. Have it decode the third word as well. Without this, the client doesn't send delegated timestamps in the CB_GETATTR response. With this change we also need to expand the on-stack bitmap in decode_recallany_args to 3 elements, in case the server sends a larger bitmap than expected. Fixes: `43df7110f4` ("NFSv4: Add CB_GETATTR support for delegated attributes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Jeff Layton	cb78f9b7d0	nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS The client doesn't properly request FATTR4_OPEN_ARGUMENTS in the initial SERVER_CAPS getattr. Add FATTR4_WORD2_OPEN_ARGUMENTS to the initial request. Fixes: `707f13b3d0` (NFSv4: Add support for the FATTR4_OPEN_ARGUMENTS attribute) Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:09 -04:00
ChenXiaoSong	5e51224d2a	smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock The comments have typos, fix that to not confuse readers. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 15:44:19 -05:00
Jeff Layton	f58bab6fd4	nfsd: ensure that nfsd4_fattr_args.context is zeroed out If nfsd4_encode_fattr4 ends up doing a "goto out" before we get to checking for the security label, then args.context will be set to uninitialized junk on the stack, which we'll then try to free. Initialize it early. Fixes: `f59388a579` ("NFSD: Add nfsd4_encode_fattr4_sec_label()") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-22 14:49:10 -04:00
Paulo Alcantara	ec68680411	smb: client: ignore unhandled reparse tags Just ignore reparse points that the client can't parse rather than bailing out and not opening the file or directory. Reported-by: Marc <1marc1@gmail.com> Closes: https://lore.kernel.org/r/CAMHwNVv-B+Q6wa0FEXrAuzdchzcJRsPKDDRrNaYZJd6X-+iJzw@mail.gmail.com Fixes: `539aad7f14` ("smb: client: introduce ->parse_reparse_point()") Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 12:37:16 -05:00
Steve French	15179cf280	smb3: fix problem unloading module due to leaked refcount on shutdown The shutdown ioctl can leak a refcount on the tlink which can prevent rmmod (unloading the cifs.ko) module from working. Found while debugging xfstest generic/043 Fixes: `69ca1f5755` ("smb3: add dynamic tracepoints for shutdown ioctl") Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 12:36:57 -05:00
ChenXiaoSong	2b7e0573a4	smb/server: update misguided comment of smb2_allocate_rsp_buf() smb2_allocate_rsp_buf() will return other error code except -ENOMEM. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	0dd771b7d6	smb/server: remove useless assignment of 'file_present' in smb2_open() The variable is already true here. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	4e8771a366	smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open() null-ptr-deref will occur when (req_op_level == SMB2_OPLOCK_LEVEL_LEASE) and parse_lease_state() return NULL. Fix this by check if 'lease_ctx_info' is NULL. Additionally, remove the redundant parentheses in parse_durable_handle_context(). Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	2186a11653	smb/server: fix return value of smb2_open() In most error cases, error code is not returned in smb2_open(), __process_request() will not print error message. Fix this by returning the correct value at the end of smb2_open(). Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
Namjae Jeon	ce61b605a0	ksmbd: the buffer of smb2 query dir response has at least 1 byte When STATUS_NO_MORE_FILES status is set to smb2 query dir response, ->StructureSize is set to 9, which mean buffer has 1 byte. This issue occurs because ->Buffer[1] in smb2_query_directory_rsp to flex-array. Fixes: `eb3e28c1e8` ("smb3: Replace smb2pdu 1-element arrays with flex-arrays") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
Kent Overstreet	a592cdf516	bcachefs: don't use rht_bucket() in btree_key_cache_scan() rht_bucket() does strange complicated things when a rehash is in progress. Instead, just skip scanning when a rehash is in progress: scanning is going to be more expensive (many more empty slots to cover), and some sort of infinite loop is being observed Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 10:04:41 -04:00
Kent Overstreet	3e878fe5a0	bcachefs: add missing inode_walker_exit() fix a small leak Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 10:04:41 -04:00
Kent Overstreet	87313ac1f1	bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() bch2_btree_key_cache_drop() evicts the key cache entry - it's used when we're doing an update that bypasses the key cache, because for cache coherency reasons a key can't be in the key cache unless it also exists in the btree - i.e. creates have to bypass the cache. After evicting, the path no longer points to a key cache key, and relock() will always fail if should_be_locked is true. Prep for improving path->should_be_locked assertions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 03:12:57 -04:00
Yuesong Li	dedb2fe375	bcachefs: Fix double assignment in check_dirent_to_subvol() ret was assigned twice in check_dirent_to_subvol(). Reported by cocci. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:40:52 -04:00
Kent Overstreet	0b50b7313e	bcachefs: Fix refcounting in discard path bch_dev->io_ref does not protect against the filesystem going away; bch_fs->writes does. Thus the filesystem write ref needs to be the last ref we release. Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	8ed823b192	bcachefs: Fix compat issue with old alloc_v4 keys we allow new fields to be added to existing key types, and new versions should treat them as being zeroed; this was not handled in alloc_v4_validate. Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	7f2de6947f	bcachefs: Fix warning in bch2_fs_journal_stop() j->last_empty_seq needs to match j->seq when the journal is empty Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	06f67437ab	fs/super.c: improve get_tree() error message seeing an odd bug where we fail to correctly return an error from .get_tree(): https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66 we need to be able to distinguish between accidently returning a positive error (as implied by the log) and no error. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	bdbdd4759f	bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	cab18be695	bcachefs: Fix replay_now_at() assert Journal replay, in the slowpath where we insert keys in journal order, was inserting keys in the wrong order; keys from early repair come last. Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	6575b8c987	bcachefs: Fix locking in bch2_ioc_setlabel() Fixes: `7a254053a5` ("bcachefs: support FS_IOC_SETFSLABEL") Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	5dbfc4ef72	bcachefs: fix failure to relock in btree_node_fill() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	3c5d0b72a8	bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() We weren't always so strict about trans->locked state - but now we are, and new assertions are shaking some bugs out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	1dceae4cc1	bcachefs: unlock_long() before resort in journal replay Fix another SRCU splat - this one pretty harmless. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	cecc328240	bcachefs: fix missing bch2_err_str() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	b8db1bd802	bcachefs: fix time_stats_to_text() Fixes: `7423330e30` ("bcachefs: prt_printf() now respects \r\n\t") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	c2a503f3e9	bcachefs: Fix bch2_bucket_gens_init() Comparing the wrong bpos - this was missed because normally bucket_gens_init() runs on brand new filesystems, but this bug caused it to overwrite bucket_gens keys with 0s when upgrading ancient filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	e150a7e89c	bcachefs: Fix bch2_trigger_alloc assert On testing on an old mangled filesystem, we missed a case. Fixes: `bd864bc2d9` ("bcachefs: Fix bch2_trigger_alloc when upgrading from old versions") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	49203a6b9d	bcachefs: Fix failure to relock in btree_node_get() discovered by new trans->locked asserts Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	548e7f5167	bcachefs: setting bcachefs_effective.* xattrs is a noop bcachefs_effective.* xattrs show the options inherited from parent directories (as well as explicitly set); this namespace is not for setting bcachefs options. Change the .set() handler to a noop so that if e.g. rsync is copying xattrs it'll do the right thing, and only copy xattrs in the bcachefs.* namespace. We don't want to return an error, because that will cause rsync to bail out or get spammy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	8cc0e50614	bcachefs: Fix "trying to move an extent, but nr_replicas=0" data_update_init() does a bunch of complicated stuff to decide how many replicas to add, since we only want to increase an extent's durability on an explicit rereplicate, but extent pointers may be on devices with different durability settings. There was a corner case when evacuating a device that had been set to durability=0 after data had been written to it, and extents on that device had already been rereplicated - then evacuate only needs to drop pointers on that device, not move them. So the assert for !m->op.nr_replicas was spurious; this was a perfectly legitimate case that needed to be handled. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	3f53d05041	bcachefs: bch2_data_update_init() cleanup Factor out some helpers - this function has gotten much too big. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Linus Torvalds	5c6154ffd4	Changes since last update: - Allow large folios on compressed inodes; - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially fails; - Two minor cleanups. -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmbF1s8RHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qqzghAAjSi/OITIcpjYe3GOxDNEbIxTFf+5d7ge 66KRMBm5FG6xBaXOK4xhit2EdFsDVMAna6FjYyRJYpDDQX5MAUsifDeZeGZCKOV5 KpUk2CfhIp1zi9CeVA5Tl6rfdz0b0SKz77uRK9BN8PWmCdbeenOKgK6HU06S5enC uPud9QegQELXad0bIe+uo+4zGQnqxwGMfkNEW+cfbp200m6kpig/SgttoXCtz1mm CEZpv87MlrQLbLtox+KcesFhYofxWdx/My5ckJVbaHGZdM29HKehRLIsBFX3di5N 4OkGnFbh16lhx6s4a/b1pCeDNH1NcmP7uqCoGr+dCVKQ79glSZQvfiV4myQk7vRN 4Orj+zq4aBSmYOvizIFMH++IGDioPBxkhKv3ReeCe4t/L8LHPkGEaeT+iM9b+b4n uK0faj6qREygf2rHIRr6ciq7GdxlMwdlF7AeNUIyCHNO84J39+GkAID3Nqs+vzuF wSJAuqp959WgJf2Co5ugb4vjbrpq8TKfZG0yk7KnIOWJkhAKDcIwta+ooloIYtD8 5gHhMRodIgKjYKPOi9vx9cQPlISTMsqVWpRuyzH9vCvqNvkzDriAez1kFq01tZTO gcK5QGekSgft/5h7PLFrbVV1kVfVdjNapsIY+XAEmBGGCJCScCrlIxjbsRwsseFW JIpQXilIU3c= =gFZW -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "As I mentioned in the merge window pull request, there is a regression which could cause system hang due to page migration. The corresponding fix landed upstream through MM tree last week (commit `2e6506e1c4`: "mm/migrate: fix deadlock in migrate_pages_batch() on large folios"), therefore large folios can be safely allowed for compressed inodes and stress tests have been running on my fleet for over 20 days without any regression. Users have explicitly requested this for months, so let's allow large folios for EROFS full cases now for wider testing. Additionally, there is a fix which addresses invalid memory accesses on a failure path triggered by fault injection and two minor cleanups to simplify the codebase. Summary: - Allow large folios on compressed inodes - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially fails - Two minor cleanups" * tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails erofs: allow large folios for compressed files erofs: get rid of check_layout_compatibility() erofs: simplify readdir operation	2024-08-22 06:06:09 +08:00
Christian Brauner	524b2c6dc8	romfs: fix romfs_read_folio() Add the correct offset to folio_zero_tail(). Fixes: `d86f2de026` ("romfs: Convert romfs_read_folio() to use a folio") Reported-by: Greg Ungerer <gregungerer@westnet.com.au> Link: https://lore.kernel.org/r/Zr0GTnPHfeA0P8nb@casper.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-21 22:32:58 +02:00
David Howells	92764e8822	netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty" This partially reverts commit `2ff1e97587`. In addition to reverting the removal of PG_private_2 wrangling from the buffered read code[1][2], the removal of the waits for PG_private_2 from netfs_release_folio() and netfs_invalidate_folio() need reverting too. It also adds a wait into ceph_evict_inode() to wait for netfs read and copy-to-cache ops to complete. Fixes: `2ff1e97587` ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2] Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com cc: Max Kellermann <max.kellermann@ionos.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-21 22:32:58 +02:00
Gao Xiang	0005e01e1e	erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails If z_erofs_gbuf_growsize() partially fails on a global buffer due to memory allocation failure or fault injection (as reported by syzbot [1]), new pages need to be freed by comparing to the existing pages to avoid memory leaks. However, the old gbuf->pages[] array may not be large enough, which can lead to null-ptr-deref or out-of-bound access. Fix this by checking against gbuf->nrpages in advance. [1] https://lore.kernel.org/r/000000000000f7b96e062018c6e3@google.com Reported-by: syzbot+242ee56aaa9585553766@syzkaller.appspotmail.com Fixes: `d6db47e571` ("erofs: do not use pagepool in z_erofs_gbuf_growsize()") Cc: <stable@vger.kernel.org> # 6.10+ Reviewed-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240820085619.1375963-1-hsiangkao@linux.alibaba.com	2024-08-21 08:12:05 +08:00
Kent Overstreet	2102bdac67	bcachefs: Extra debug for data move path We don't have sufficient information to debug: https://github.com/koverstreet/bcachefs/issues/726 - print out durability of extent ptrs, when non default - print the number of replicas we need in data_update_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-19 21:42:08 -04:00
Gao Xiang	e080a26725	erofs: allow large folios for compressed files As commit `2e6506e1c4` ("mm/migrate: fix deadlock in migrate_pages_batch() on large folios") has landed upstream, large folios can be safely enabled for compressed inodes since all prerequisites have already landed in 6.11-rc1. Stress tests has been running on my fleet for over 20 days without any regression. Additionally, users [1] have requested it for months. Let's allow large folios for EROFS full cases upstream now for wider testing. [1] https://lore.kernel.org/r/CAGsJ_4wtE8OcpinuqVwG4jtdx6Qh5f+TON6wz+4HMCq=A2qFcA@mail.gmail.com Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> [ Gao Xiang: minor commit typo fixes. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240819025207.3808649-1-hsiangkao@linux.alibaba.com	2024-08-19 16:10:04 +08:00
Hongzhen Luo	2c534624ae	erofs: get rid of check_layout_compatibility() Simple enough to just open-code it. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240806112208.150323-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-08-19 11:06:20 +08:00
Hongzhen Luo	5b5c96c63d	erofs: simplify readdir operation - Use i_size instead of i_size_read() due to immutable fses; - Get rid of an unneeded goto since erofs_fill_dentries() also works; - Remove unnecessary lines. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-08-19 11:06:20 +08:00
Kent Overstreet	47cdc7b144	bcachefs: Fix incorrect gfp flags fixes: 00488 WARNING: CPU: 9 PID: 194 at mm/page_alloc.c:4410 __alloc_pages_noprof+0x1818/0x1888 00488 Modules linked in: 00488 CPU: 9 UID: 0 PID: 194 Comm: kworker/u66:1 Not tainted 6.11.0-rc1-ktest-g18fa10d6495f #2931 00488 Hardware name: linux,dummy-virt (DT) 00488 Workqueue: writeback wb_workfn (flush-bcachefs-2) 00488 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00488 pc : __alloc_pages_noprof+0x1818/0x1888 00488 lr : __alloc_pages_noprof+0x5f4/0x1888 00488 sp : ffffff80ccd8ed00 00488 x29: ffffff80ccd8ed00 x28: 0000000000000000 x27: dfffffc000000000 00488 x26: 0000000000000010 x25: 0000000000000002 x24: 0000000000000000 00488 x23: 0000000000000000 x22: 1ffffff0199b1dbe x21: ffffff80cc680900 00488 x20: 0000000000000000 x19: ffffff80ccd8eed0 x18: 0000000000000000 00488 x17: ffffff80cc58a010 x16: dfffffc000000000 x15: 1ffffff00474e518 00488 x14: 1ffffff00474e518 x13: 1ffffff00474e518 x12: ffffffb8104701b9 00488 x11: 1ffffff8104701b8 x10: ffffffb8104701b8 x9 : ffffffc08043cde8 00488 x8 : 00000047efb8fe48 x7 : ffffff80ccd8ee20 x6 : 0000000000048000 00488 x5 : 1ffffff810470138 x4 : 0000000000000050 x3 : 1ffffff0199b1d94 00488 x2 : ffffffb0199b1d94 x1 : 0000000000000001 x0 : ffffffc082387448 00488 Call trace: 00488 __alloc_pages_noprof+0x1818/0x1888 00488 new_slab+0x284/0x2f0 00488 ___slab_alloc+0x208/0x8e0 00488 __kmalloc_noprof+0x328/0x340 00488 __bch2_writepage+0x106c/0x1830 00488 write_cache_pages+0xa0/0xe8 due to __GFP_NOFAIL without allowing reclaim Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:42:05 -04:00
Kent Overstreet	d9f49c3106	bcachefs: fix field-spanning write warning attempts to retrofit memory safety onto C are increasingly annoying ------------[ cut here ]------------ memcpy: detected field-spanning write (size 4) of single field "&k.replicas" at fs/bcachefs/replicas.c:454 (size 3) WARNING: CPU: 5 PID: 6525 at fs/bcachefs/replicas.c:454 bch2_replicas_gc2+0x2cb/0x400 [bcachefs] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) Modules linked in: dm_mod tun nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay msr sctp bcachefs lz4hc_compress lz4_compress libcrc32c xor raid6_pq lz4_decompress pps_ldisc pps_core wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic libchacha sit tunnel4 ip_tunnel af_packet bridge stp llc ip6table_nat ip6table_filter ip6_tables xt_MASQUERADE xt_conntrack iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tcp_bbr sch_fq_codel efivarfs nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet r8152 input_leds joydev mii amdgpu mousedev hid_generic usbhid hid ath10k_pci amd_atl edac_mce_amd ath10k_core kvm_amd ath kvm mac80211 bfq crc32_pclmul crc32c_intel polyval_clmulni polyval_generic sha512_ssse3 sha256_ssse3 sha1_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg i2c_algo_bit drm_exec snd_hda_codec r8169 drm_suballoc_helper aesni_intel gf128mul crypto_simd amdxcp realtek mfd_core tpm_crb drm_buddy snd_hwdep mdio_devres libarc4 cryptd tpm_tis wmi_bmof cfg80211 evdev libphy snd_hda_core tpm_tis_core gpu_sched rapl xhci_pci xhci_hcd snd_pcm drm_display_helper snd_timer tpm sp5100_tco rfkill efi_pstore mpt3sas drm_ttm_helper ahci usbcore libaescfb ccp snd ttm 8250 libahci watchdog soundcore raid_class sha1_generic acpi_cpufreq k10temp 8250_base usb_common scsi_transport_sas i2c_piix4 hwmon video serial_mctrl_gpio serial_base ecdh_generic wmi rtc_cmos backlight ecc gpio_amdpt rng_core gpio_generic button CPU: 5 UID: 0 PID: 6525 Comm: bcachefs Tainted: G W 6.11.0-rc1-ojab-00058-g224bc118aec9 #6 6d5debde398d2a84851f42ab300dae32c2992027 Tainted: [W]=WARN RIP: 0010:bch2_replicas_gc2+0x2cb/0x400 [bcachefs] Code: c7 c2 60 91 d1 c1 48 89 c6 48 c7 c7 98 91 d1 c1 4c 89 14 24 44 89 5c 24 08 48 89 44 24 20 c6 05 fa 68 04 00 01 e8 05 a3 40 e4 <0f> 0b 4c 8b 14 24 44 8b 5c 24 08 48 8b 44 24 20 e9 55 fe ff ff 8b RSP: 0018:ffffb434c9263d60 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9a8efa79cc00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb434c9263de0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005 R13: ffff9a8efa73c300 R14: ffff9a8d9e880000 R15: ffff9a8d9e8806f8 FS: 0000000000000000(0000) GS:ffff9a9410c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565423373090 CR3: 0000000164e30000 CR4: 00000000003506f0 Call Trace: <TASK> ? __warn+0x97/0x150 ? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) ? report_bug+0x196/0x1c0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x17/0x80 ? __wake_up_klogd.part.0+0x4c/0x80 ? asm_exc_invalid_op+0x16/0x20 ? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) ? bch2_dev_usage_read+0xa0/0xa0 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_dev_usage_read+0xa0/0xa0: discard_in_flight_remove at /home/ojab/src/bcachefs/fs/bcachefs/alloc_background.c:1712 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:42:05 -04:00
Kent Overstreet	d6d539c9a7	bcachefs: Reallocate table when we're increasing size Fixes: `c2f6e16a67` ("bcachefs: Increase size of cuckoo hash table on too many rehashes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:41:50 -04:00
Thorsten Blum	7c525dddbe	ksmbd: Replace one-element arrays with flexible-array members Replace the deprecated one-element arrays with flexible-array members in the structs filesystem_attribute_info and filesystem_device_info. There are no binary differences after this conversion. Link: https://github.com/KSPP/linux/issues/79 Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Namjae Jeon	76e98a158b	ksmbd: fix race condition between destroy_previous_session() and smb2 operations() If there is ->PreviousSessionId field in the session setup request, The session of the previous connection should be destroyed. During this, if the smb2 operation requests in the previous session are being processed, a racy issue could happen with ksmbd_destroy_file_table(). This patch sets conn->status to KSMBD_SESS_NEED_RECONNECT to block incoming operations and waits until on-going operations are complete (i.e. idle) before desctorying the previous session. Fixes: `c8efcc7861` ("ksmbd: add support for durable handles v1/v2") Cc: stable@vger.kernel.org # v6.6+ Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25040 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Namjae Jeon	dfd046d0ce	ksmbd: Use unsafe_memcpy() for ntlm_negotiate rsp buffer is allocated larger than spnego_blob from smb2_allocate_rsp_buf(). Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Steve French	e4be320eec	smb3: fix broken cached reads when posix locks Mandatory locking is enforced for cached reads, which violates default posix semantics, and also it is enforced inconsistently. This affected recent versions of libreoffice, and can be demonstrated by opening a file twice from the same client, locking it from handle one and trying to read from it from handle two (which fails, returning EACCES). There is already a mount option "forcemandatorylock" (which defaults to off), so with this change only when the user intentionally specifies "forcemandatorylock" on mount will we break posix semantics on read to a locked range (ie we will only fail in this case, if the user mounts with "forcemandatorylock"). An earlier patch fixed the write path. Fixes: `85160e03a7` ("CIFS: Implement caching mechanism for mandatory brlocks") Cc: stable@vger.kernel.org Cc: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: David Howells <dhowells@redhat.com> Reported-by: abartlet@samba.org Reported-by: Kevin Ottens <kevin.ottens@enioka.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:01:06 -05:00
Linus Torvalds	57b14823ea	for-6.11-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbB6dwACgkQxWXV+ddt WDu/Eg/9EXEoSPqRYxsRa2vLQjSbBbBCBDW1G75F5oUsvRKtfNhZ+w02JAqubirF wdBNaXoGQ9zJq/E0JtHHDqv9M6FV+g3aO0xM+ntmp9cZdFBVXRkrB3TlewMesKfI lXZW5kn35q6aeNi2MaJjk2G5Pr0MYjGGRezBuloc7TcIlgVijjLBlcnKEz263C1/ rXvENxowxPA20LiWviA4ZjlqlRQLBrgxqpSXLGg7mZs93XdbtPa2ZvzS7ffuZTI+ PUCYGEwI4E2Dpv+mswFb21SUdUPPmAycubERJvABqnZxCWupkevgvv+6MeWC5A2p 7OjoTmINRDDsNWYSvyHQ04U+0XkPmHCBKEkAuy0ZIajHJU4G6rpeDySX04/Cwzht mJZ37FzMGZ9LjEpL1uoPifWKcH0nUW9sWw4Tw9tgeuBG9RfI/BxZqRT9WLEkiXUI 7Bdq2Ir6fzv8IVKWkuqO6No+LDa//qF2ci0nWvCwbgdNquGFa9DmNmVDiKDqKW3R RlP+6laXEPCfKSycTpp94gASVeEEcKNYYC0B/FCBLJmXVcCJQ41qsMn/fYUn8Yn3 vVdmAuGKThYjO4RNjqy4FgNVLdY9280OGazH0B9t3HhL+U+9JLN8fnhD5H4hEpR/ dDC1quKmo6WIKAem6O6r3GRnCj8lbLaXpmp+MpyEUI7M5FyjsAc= =YBfJ -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: "A more fixes. We got reports that shrinker added in 6.10 still causes latency spikes and the fixes don't handle all corner cases. Due to summer holidays we're taking a shortcut to disable it for release builds and will fix it in the near future. - only enable extent map shrinker for DEBUG builds, temporary quick fix to avoid latency spikes for regular builds - update target inode's ctime on unlink, mandated by POSIX - properly take lock to read/update block group's zoned variables - add counted_by() annotations" * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: only enable extent map shrinker for DEBUG builds btrfs: zoned: properly take lock to read/update block group's zoned variables btrfs: tree-checker: add dev extent item checks btrfs: update target inode's ctime on unlink btrfs: send: annotate struct name_cache_entry with __counted_by()	2024-08-18 08:50:36 -07:00
Jann Horn	3c0da3d163	fuse: Initialize beyond-EOF page contents before setting uptodate fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: `a1d75f2582` ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-08-18 08:45:39 -07:00
Linus Torvalds	e0fac5fc8b	three client fixes, including two for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbBGkoACgkQiiy9cAdy T1HAJAv9G2efGXOuLHuDKM4IkoUBoeAsC/o5g5sVbZfINON1Ra0vQBLmRLunhAlW xIY2Ln92jMdvM6wNwFcsAI5bIWTiIrjdqP/HY9kiKRU5O5NvqNWeyPEDOB3aM41O UXq8jNKyyyyFD1P4QJNYMeZucTZatLJVb7WRZHGDEDcVMrCWdDVcnPwnMfyNeD0w GndMPAAxiQxV+AoL+RgE6+nfVr4EwHI3VFG/h3FyNcaMp2ZSzYHDu/TIwmGBHq6P DCJyxjKMJoXKzKO+3hVp3tKzKZ9EuE3ljb8liBbZ8g6J4quCHbQWC3Mh8Jhmgav6 1KhDRKI6vjHZwu8tWjBEgadhwcRBHMuz/YZL+zrx3QHjA/AgV20Y7oyvyXKusj9t G5C1bTExusdhLnEOGN4+udxjAHrMkW36R6Vux5D85WYmhR3k2AbIdZevA+mLADKU veTye1VAX5vy9h0atyV69Zta9aBU6q3Mhcpgrcbj0u3C/Iuu1DafrEmb5hGgW7Dw xnGynYax =af3x -----END PGP SIGNATURE----- Merge tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - fix for clang warning - additional null check - fix for cached write with posix locks - flexible structure fix * tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: smb2pdu.h: Use static_assert() to check struct sizes smb3: fix lock breakage for cached writes smb/client: avoid possible NULL dereference in cifs_free_subrequest()	2024-08-17 16:31:12 -07:00
Linus Torvalds	d09840f8b3	Bug fixes for 6.11-rc4: * Check for presence of only 'attr' feature before scrubbing an inode's attribute fork. * Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for long (i.e. 50ms) sleep durations to prevent high load averages. * Do not allow users to change the realtime flag of a file unless the datadev and rtdev both support fsdax access modes. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZr1wqwAKCRAH7y4RirJu 9MYxAQCgHoAK8rqxb4obrrGmqVcHJdnHDYqSFRqbbvytRHybZgEA2hfaNbNpuQYT JOV5pGOUJf1LiSc5D6MBepg2BAFRNwo= =7Ibh -----END PGP SIGNATURE----- Merge tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Chandan Babu: - Check for presence of only 'attr' feature before scrubbing an inode's attribute fork. - Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for long (i.e. 50ms) sleep durations to prevent high load averages. - Do not allow users to change the realtime flag of a file unless the datadev and rtdev both support fsdax access modes. * tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set xfs: revert AIL TASK_KILLABLE threshold xfs: attr forks require attr, not attr2	2024-08-17 09:51:28 -07:00
Linus Torvalds	b718175853	bcachefs fixes for 6.11-rc4 - New on disk format version, bcachefs_metadata_version_disk_accounting_inum This adds one more disk accounting counter, which counts disk usage and number of extents per inode number. This lets us track fragmentation, for implementing defragmentation later, and it also counts disk usage per inode in all snapshots, which will be a useful thing to expose to users. - One performance issue we've observed is threads spinning when they should be waiting for dirty keys in the key cache to be flushed by journal reclaim, so we now have hysteresis for the waiting thread, as well as improving the tracepoint and a new time_stat, for tracking time blocked waiting on key cache flushing. And, various assorted smaller fixes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma/9QkACgkQE6szbY3K bnYcBw/+LBSZ415gWSjPktdecf5rc6K4KxETxAxV0f0KesYzxqAtQzN0SCDvKt65 3aALU03wM8vWITiLS38/ckT+j6S2BpXcOxdu/OC0nRYQEUg9ZLvqEG5lQ3a/LliV Q64N33qsSr6QaKszFllLYcN4tGduKg8HoMlHn6+vJ7HNPjdfv0HHERSUsc7K84/w jkRtDE2NxsRJZKMEvIFp8hd5KXUR5zyBz/kc4P0WliLXpSyJLITzhKw1JV7ikKVD 0mO2bJ/0i7wPIabAD2HJahvbC7fl+2fkYFxUJ2XnvMTgU/+QyeGHEufbcbVrVSp0 BpzBTmSMFbGXBkbQBruFX5rJetzXeBqdYf0Yfavd4KDhGvYlSfDZQUapXT1QKC2q aHSB/s+2r7Crr/MBJyjbeFgXFTNGvI5yerlbdp2yj1kxjYJHHaKrp6h7n6XXk21W /mGF5tkIMkFTv98rQnIaky4neJzOPsLTTgxeR8zEudCgMaVUqEcaMdIFvARDjY/3 n52VR0zl3olV3vu7LgHaHfgH6lfaMV0sHPaGNYGL0YL+bCJD+lYM8a6l9aaks8vk md7+mFcOS4FUdDdS8MEKIN/k/gkEOC/EpmI864i9rIl0SiNXNy7FPTDKON8b+Ury 5omBMUQMEe9Q/pgKGXfpJWFynhSPEVf4y1DIOsrXk/jeBqenFyo= =BPGT -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent OverstreetL - New on disk format version, bcachefs_metadata_version_disk_accounting_inum This adds one more disk accounting counter, which counts disk usage and number of extents per inode number. This lets us track fragmentation, for implementing defragmentation later, and it also counts disk usage per inode in all snapshots, which will be a useful thing to expose to users. - One performance issue we've observed is threads spinning when they should be waiting for dirty keys in the key cache to be flushed by journal reclaim, so we now have hysteresis for the waiting thread, as well as improving the tracepoint and a new time_stat, for tracking time blocked waiting on key cache flushing. ... and various assorted smaller fixes. * tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs: bcachefs: Fix locking in __bch2_trans_mark_dev_sb() bcachefs: fix incorrect i_state usage bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru bcachefs: Fix forgetting to pass trans to fsck_err() bcachefs: Increase size of cuckoo hash table on too many rehashes bcachefs: bcachefs_metadata_version_disk_accounting_inum bcachefs: Kill __bch2_accounting_mem_mod() bcachefs: Make bkey_fsck_err() a wrapper around fsck_err() bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in bcachefs: Add a time_stat for blocked on key cache flush bcachefs: Improve trans_blocked_journal_reclaim tracepoint bcachefs: Add hysteresis to waiting on btree key cache flush lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc() bcachefs: Convert for_each_btree_node() to lockrestart_do() bcachefs: Add missing downgrade table entry bcachefs: disk accounting: ignore unknown types bcachefs: bch2_accounting_invalid() fixup bcachefs: Fix bch2_trigger_alloc when upgrading from old versions bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()	2024-08-17 09:46:10 -07:00
Kent Overstreet	0e49d3ff12	bcachefs: Fix locking in __bch2_trans_mark_dev_sb() We run this in full RW mode now, so we have to guard against the superblock buffer being reallocated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 20:45:15 -04:00
Qu Wenruo	534f7eff92	btrfs: only enable extent map shrinker for DEBUG builds Although there are several patches improving the extent map shrinker, there are still reports of too frequent shrinker behavior, taking too much CPU for the kswapd process. So let's only enable extent shrinker for now, until we got more comprehensive understanding and a better solution. Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ Fixes: `956a17d9d0` ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-16 21:22:39 +02:00
Kent Overstreet	99c87fe0f5	bcachefs: fix incorrect i_state usage Reported-by: syzbot+95e40eae71609e40d851@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	9482f3b053	bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru Reported-by: syzbot+510b0b28f8e6de64d307@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	075cabf324	bcachefs: Fix forgetting to pass trans to fsck_err() Reported-by: syzbot+e3938cd6d761b78750e6@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	c2f6e16a67	bcachefs: Increase size of cuckoo hash table on too many rehashes Also, improve the calculation of the new table size, so that it can shrink when needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Gustavo A. R. Silva	5b4f3af39b	smb: smb2pdu.h: Use static_assert() to check struct sizes Commit `9f9bef9bc5` ("smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end warnings") introduced tagged `struct create_context_hdr`. We want to ensure that when new members need to be added to the flexible structure, they are always included within this tagged struct. So, we use `static_assert()` to ensure that the memory layout for both the flexible structure and the tagged struct is the same after any changes. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 16:06:55 -05:00
Steve French	836bb3268d	smb3: fix lock breakage for cached writes Mandatory locking is enforced for cached writes, which violates default posix semantics, and also it is enforced inconsistently. This apparently breaks recent versions of libreoffice, but can also be demonstrated by opening a file twice from the same client, locking it from handle one and writing to it from handle two (which fails, returning EACCES). Since there was already a mount option "forcemandatorylock" (which defaults to off), with this change only when the user intentionally specifies "forcemandatorylock" on mount will we break posix semantics on write to a locked range (ie we will only fail the write in this case, if the user mounts with "forcemandatorylock"). Fixes: `85160e03a7` ("CIFS: Implement caching mechanism for mandatory brlocks") Cc: stable@vger.kernel.org Cc: Pavel Shilovsky <piastryyy@gmail.com> Reported-by: abartlet@samba.org Reported-by: Kevin Ottens <kevin.ottens@enioka.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 16:04:47 -05:00
Su Hui	74c2ab6d65	smb/client: avoid possible NULL dereference in cifs_free_subrequest() Clang static checker (scan-build) warning: cifsglob.h:line 890, column 3 Access to field 'ops' results in a dereference of a null pointer. Commit `519be98971` ("cifs: Add a tracepoint to track credits involved in R/W requests") adds a check for 'rdata->server', and let clang throw this warning about NULL dereference. When 'rdata->credits.value != 0 && rdata->server == NULL' happens, add_credits_and_wake_if() will call rdata->server->ops->add_credits(). This will cause NULL dereference problem. Add a check for 'rdata->server' to avoid NULL dereference. Cc: stable@vger.kernel.org Fixes: `69c3c023af` ("cifs: Implement netfslib hooks") Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 15:32:30 -05:00
Naohiro Aota	e30729d4bd	btrfs: zoned: properly take lock to read/update block group's zoned variables __btrfs_add_free_space_zoned() references and modifies bg's alloc_offset, ro, and zone_unusable, but without taking the lock. It is mostly safe because they monotonically increase (at least for now) and this function is mostly called by a transaction commit, which is serialized by itself. Still, taking the lock is a safer and correct option and I'm going to add a change to reset zone_unusable while a block group is still alive. So, add locking around the operations. Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:56 +02:00
Qu Wenruo	008e2512dc	btrfs: tree-checker: add dev extent item checks [REPORT] There is a corruption report that btrfs refused to mount a fs that has overlapping dev extents: BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272 BTRFS error (device sdc): failed to verify dev extents against chunks: -117 BTRFS error (device sdc): open_ctree failed [CAUSE] The direct cause is very obvious, there is a bad dev extent item with incorrect length. With btrfs check reporting two overlapping extents, the second one shows some clue on the cause: ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272 ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144 ERROR: errors found in extent allocation tree or chunk allocation The second one looks like a bitflip happened during new chunk allocation: hex(2257707008000) = 0x20da9d30000 hex(2257707270144) = 0x20da9d70000 diff = 0x00000040000 So it looks like a bitflip happened during new dev extent allocation, resulting the second overlap. Currently we only do the dev-extent verification at mount time, but if the corruption is caused by memory bitflip, we really want to catch it before writing the corruption to the storage. Furthermore the dev extent items has the following key definition: (<device id> DEV_EXTENT <physical offset>) Thus we can not just rely on the generic key order check to make sure there is no overlapping. [ENHANCEMENT] Introduce dedicated dev extent checks, including: - Fixed member checks * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3) * chunk_objectid should always be BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256) - Alignment checks * chunk_offset should be aligned to sectorsize * length should be aligned to sectorsize * key.offset should be aligned to sectorsize - Overlap checks If the previous key is also a dev-extent item, with the same device id, make sure we do not overlap with the previous dev extent. Reported: Stefan N <stefannnau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:52 +02:00
Jeff Layton	3bc2ac2f8f	btrfs: update target inode's ctime on unlink Unlink changes the link count on the target inode. POSIX mandates that the ctime must also change when this occurs. According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html: "Upon successful completion, unlink() shall mark for update the last data modification and last file status change timestamps of the parent directory. Also, if the file's link count is not 0, the last file status change timestamp of the file shall be marked for update." Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ add link to the opengroup docs ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:44 +02:00
Thorsten Blum	c0247d289e	btrfs: send: annotate struct name_cache_entry with __counted_by() Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:32 +02:00
Linus Torvalds	1fb918967b	for-6.11-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAma9KksACgkQxWXV+ddt WDvMYw/8Ds6A3IcMd1AByPDryhHZnOpqU/YQS/HhneisTg08MYHD0TMZLX02GXw0 vzqUyBHQ9yOYEt18SVtX67Dzapy0SWZWTK8Er/tJCn+14SLWVMdRiJCO2Y3Rdm8T J3S/b610iEVl5z6S6SpFD+zc56liCyVHfpK6obwSFBCzAyN6vm2p0ls5vq+hGsjb s/dOPJfOMnFTOFXVIumJ5KJRCubGuwG+PhZO9engwxiFIr1O4xxedzhKocXNSiiE +jt96gnKwW/K/Wh59YFGLbKk7h/jsIzM2CqE+JrEZlwIN+oNODFvP+Z37DbSC6jN 0x7G8gqY8vXgxOUk1rub+IqP+/wIjMmCTzkxpO2uy80hB0h3YzqzZnUNHk6DlIu/ zOhgcMnp5SMuvtJIXBpP2HRzzG/UbxfrjaPKDmUvwKvCUydrw0xL+XWdDMhi3bSn NPDW/Ixl1XGMS131YYla1v/KXTKJwZ2hK045svx7A8Aok0WaXAcYfwValQ9GMO0j gk89DRN0tMikBebaVD3aakE1FqyC/3dEZ80D3LSs7cgDGQA27wjdESHcSTtY373/ +fCBXDH/N9ubanZwsu+gUEzNp3DukEoaw2r73IicxcbsiskDpEqvddcKQmrZQ1xW 03UVziw1LpzkNSsDMZrlBwIoo5SnqbEbsEkTCMtjivJYOzhfyYk= =din2 -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - extend tree-checker verification of directory item type - fix regression in page/folio and extent state tracking in xarray, the dirty status can get out of sync and can cause problems e.g. a hang - in send, detect last extent and allow to clone it instead of sending it as write, reduces amount of data transferred in the stream - fix checking extent references when cleaning deleted subvolumes - fix one more case in the extent map shrinker, let it run only in the kswapd context so it does not cause latency spikes during other operations * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix invalid mapping of extent xarray state btrfs: send: allow cloning non-aligned extent if it ends at i_size btrfs: only run the extent map shrinker from kswapd tasks btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type btrfs: check delayed refs when we're checking if a ref exists	2024-08-14 17:56:15 -07:00
Linus Torvalds	4ac0f08f44	vfs-6.11-rc4.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg= =HZBL -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Fix the name of file lease slab cache. When file leases were split out of file locks the name of the file lock slab cache was used for the file leases slab cache as well. - Fix a type in take_fd() helper. - Fix infinite directory iteration for stable offsets in tmpfs. - When the icache is pruned all reclaimable inodes are marked with I_FREEING and other processes that try to lookup such inodes will block. But some filesystems like ext4 can trigger lookups in their inode evict callback causing deadlocks. Ext4 does such lookups if the ea_inode feature is used whereby a separate inode may be used to store xattrs. Introduce I_LRU_ISOLATING which pins the inode while its pages are reclaimed. This avoids inode deletion during inode_lru_isolate() avoiding the deadlock and evict is made to wait until I_LRU_ISOLATING is done. netfs: - Fault in smaller chunks for non-large folio mappings for filesystems that haven't been converted to large folios yet. - Fix the CONFIG_NETFS_DEBUG config option. The config option was renamed a short while ago and that introduced two minor issues. First, it depended on CONFIG_NETFS whereas it wants to depend on CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter does. Second, the documentation for the config option wasn't fixed up. - Revert the removal of the PG_private_2 writeback flag as ceph is using it and fix how that flag is handled in netfs. - Fix DIO reads on 9p. A program watching a file on a 9p mount wouldn't see any changes in the size of the file being exported by the server if the file was changed directly in the source filesystem. Fix this by attempting to read the full size specified when a DIO read is requested. - Fix a NULL pointer dereference bug due to a data race where a cachefiles cookies was retired even though it was still in use. Check the cookie's n_accesses counter before discarding it. nsfs: - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as the kernel is writing to userspace. pidfs: - Prevent the creation of pidfds for kthreads until we have a use-case for it and we know the semantics we want. It also confuses userspace why they can get pidfds for kthreads. squashfs: - Fix an unitialized value bug reported by KMSAN caused by a corrupted symbolic link size read from disk. Check that the symbolic link size is not larger than expected" * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Squashfs: sanity check symbolic link size 9p: Fix DIO read through netfs vfs: Don't evict inode under the inode lru traversing context netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" file: fix typo in take_fd() comment pidfd: prevent creation of pidfds for kthreads netfs: clean up after renaming FSCACHE_DEBUG config libfs: fix infinite directory reads for offset dir nsfs: fix ioctl declaration fs/netfs/fscache_cookie: add missing "n_accesses" check filelock: fix name of file_lease slab cache netfs: Fault in smaller chunks for non-large folio mappings	2024-08-14 09:06:28 -07:00
Darrick J. Wong	8d16762047	xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set If a file has the S_DAX flag (aka fsdax access mode) set, we cannot allow users to change the realtime flag unless the datadev and rtdev both support fsdax access modes. Even if there are no extents allocated to the file, the setattr thread could be racing with another thread that has already started down the write code paths. Fixes: `ba23cba9b3` ("fs: allow per-device dax status checking for filesystems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:20:24 +05:30
Darrick J. Wong	04d6dbb553	xfs: revert AIL TASK_KILLABLE threshold In commit `9adf40249e`, we changed the behavior of the AIL thread to set its own task state to KILLABLE whenever the timeout value is nonzero. Unfortunately, this missed the fact that xfsaild_push will return 50ms (aka a longish sleep) when we reach the push target or the AIL becomes empty, so xfsaild goes to sleep for a long period of time in uninterruptible D state. This results in artificially high load averages because KILLABLE processes are UNINTERRUPTIBLE, which contributes to load average even though the AIL is asleep waiting for someone to interrupt it. It's not blocked on IOs or anything, but people scrap ps for processes that look like they're stuck in D state, so restore the previous threshold. Fixes: `9adf40249e` ("xfs: AIL doesn't need manual pushing") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:34 +05:30
Darrick J. Wong	73c34b0b85	xfs: attr forks require attr, not attr2 It turns out that I misunderstood the difference between the attr and attr2 feature bits. "attr" means that at some point an attr fork was created somewhere in the filesystem. "attr2" means that inodes have variable-sized forks, but says nothing about whether or not there actually /are/ attr forks in the system. If we have an attr fork, we only need to check that attr is set. Fixes: `99d9d8d05d` ("xfs: scrub inode block mappings") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:14 +05:30
Kent Overstreet	58474f76a7	bcachefs: bcachefs_metadata_version_disk_accounting_inum This adds another disk accounting counter to track usage per inode number (any snapshot ID). This will be used for a couple things: - It'll give us a way to tell the user how much space a given file ista consuming in all snapshots; i.e. how much extra space it's consuming due to snapshot versioning. - It counts number of extents and total size of extents (both in btree keyspace sectors and actual disk usage), meaning it gives us average extent size: that is, it'll let us cheaply find fragmented files that should be defragmented. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	5132b99bb6	bcachefs: Kill __bch2_accounting_mem_mod() The next patch will be adding a disk accounting counter type which is not kept in the in-memory eytzinger tree. As prep, fold __bch2_accounting_mem_mod() into bch2_accounting_mem_mod_locked() so that we can check for that counter type and bail out without calling bpos_to_disk_accounting_pos() twice. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	d97de0d017	bcachefs: Make bkey_fsck_err() a wrapper around fsck_err() bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error counter was incremented in the superblock. This is a cleanup and bugfix patch that converts it to a wrapper around fsck_err(). This is needed to fix an issue with the upgrade path to disk_accounting_v3, where the "silent fix" error list now includes bkey_fsck errors; fsck_err() handles this in a unified way, and since we need to change printing of bkey fsck errors from the caller to the inner bkey_fsck_err() calls, this ends up being a pretty big change. Als,, rename .invalid() methods to .validate(), for clarity, while we're changing the function signature anyways (to drop the printbuf argument). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	c99471024f	bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	06a8693b89	bcachefs: Add a time_stat for blocked on key cache flush Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	790666c8ac	bcachefs: Improve trans_blocked_journal_reclaim tracepoint include information about the state of the btree key cache Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	7254555c44	bcachefs: Add hysteresis to waiting on btree key cache flush This helps ensure key cache reclaim isn't contending with threads waiting for the key cache to be helped, and fixes a severe performance bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	968feb854a	bcachefs: Convert for_each_btree_node() to lockrestart_do() for_each_btree_node() now works similarly to for_each_btree_key(), where the loop body is passed as an argument to be passed to lockrestart_do(). This now calls trans_begin() on every loop iteration - which fixes an SRCU warning in backpointers fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	48d6cc1b48	bcachefs: Add missing downgrade table entry Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	486d920735	bcachefs: disk accounting: ignore unknown types forward compat fix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	d9e615762b	bcachefs: bch2_accounting_invalid() fixup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	bd864bc2d9	bcachefs: Fix bch2_trigger_alloc when upgrading from old versions bch2_trigger_alloc was assuming that the new key would always be newly created and thus always an alloc_v4 key, but - not when called from btree_gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	a24e6e7146	bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached() bch2_btree_path_traverse_cached() was previously checking if it could just relock the path, which is a common idiom in path traversal. However, it was using btree_node_relock(), not btree_path_relock(); btree_path_relock() only succeeds if the path was in state BTREE_ITER_NEED_RELOCK. If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is needed; this led to a null ptr deref in bch2_btree_path_traverse_cached(). And the short circuit check here isn't needed, since it was already done in the main bch2_btree_path_traverse_one(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Linus Torvalds	6b0f8db921	execve fixes for v6.11-rc4 - binfmt_flat: Fix corruption when not offsetting data start - exec: Fix ToCToU between perm check and set-uid/gid usage -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZrvBmwAKCRA2KwveOeQk u9Z2AQDgtypF4Kficiwn9BwZL5OxCxr9XnFuCYUue8Ufzm2WdQEAwhl1tcbs+Auf VAqpr6gZTRlpBjbl55LHeyMdxIbwhg4= =L28M -----END PGP SIGNATURE----- Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - binfmt_flat: Fix corruption when not offsetting data start - exec: Fix ToCToU between perm check and set-uid/gid usage * tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Fix ToCToU between perm check and set-uid/gid usage binfmt_flat: Fix corruption when not offsetting data start	2024-08-13 16:10:32 -07:00
Kees Cook	f50733b45d	exec: Fix ToCToU between perm check and set-uid/gid usage When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges. For example, if a file could change permissions from executable and not set-id: ---------x 1 root root 16048 Aug 7 13:16 target to set-id and non-executable: ---S------ 1 root root 16048 Aug 7 13:16 target it is possible to gain root privileges when execution should have been disallowed. While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root: -rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target becomes: -rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root". Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal. Reported-by: Marco Vanotti <mvanotti@google.com> Tested-by: Marco Vanotti <mvanotti@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org>	2024-08-13 13:24:29 -07:00
Linus Torvalds	6b4aa469f0	2 smb3 server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma6liAACgkQiiy9cAdy T1Eh4wwAuTQDHjehfvCDspMn6lG8IXAtb3oio2cntkII3warxxQ/dRiIyG1JcG5Z 38e+dokvRkaUF6ntrmudUbHOerw+NRl2ozYF5pQv0+ECyJLXHDqVGnuxNvNPAsD7 RtHfFf50PdgzGKmXjmUg0GbXMgA6eLSHe9r+wwDkqmIwZHMxaJ2nGuwVjHoO/+uJ oynxpYHIUROa2DeQiQKZAz/KHwpdSAGR4+KJRutvVCjInlb9bmSGp//BG34W4vva nyQIpnqskmlFg4elV/ktOgCp1rbHc4lgQwsWoCDYrNOyKX83HEIRRWHUEIi7fi+Y PBcFgTblrnuhYbUL4Z+rSmHB3YuUkvMLeKkSWSJm2M2qAZzoZWTUNLpzOcAOAcIF uhkt1+GUuLsZu3ZoDbolMZl477DtBsbBOKsM0DZ5IMji3MRu8GpvhmOfGOAdVRpT msTWfUoWvrc2CM09v3HBtnsAfjDXb/4ebztZxGTGVFk0uYJA1Zg655bHbYbw3tWr jXKVa805 =Q9Qj -----END PGP SIGNATURE----- Merge tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "Two smb3 server fixes for access denied problem on share path checks" * tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: override fsids for smb2_query_info() ksmbd: override fsids for share path check	2024-08-13 09:03:23 -07:00
Naohiro Aota	6252690f7e	btrfs: fix invalid mapping of extent xarray state In __extent_writepage_io(), we call btrfs_set_range_writeback() -> folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the mapping xarray if the folio is not dirty. This worked fine before commit `97713b1a2c` ("btrfs: do not clear page dirty inside extent_write_locked_range()"). After the commit, however, the folio is still dirty at this point, so the mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io() calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That results in the page being unlocked with a "strange" state. The page is not PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY. This strange state looks like causing a hang with a call trace below when running fstests generic/091 on a null_blk device. It is waiting for a folio lock. While I don't have an exact relation between this hang and the strange state, fixing the state also fixes the hang. And, that state is worth fixing anyway. This commit reorders btrfs_folio_clear_dirty() and btrfs_set_range_writeback() in __extent_writepage_io(), so that the PAGECACHE_TAG_DIRTY tag is properly removed from the xarray. [464.274] task:fsx state:D stack:0 pid:3034 tgid:3034 ppid:2853 flags:0x00004002 [464.286] Call Trace: [464.291] <TASK> [464.295] __schedule+0x10ed/0x6260 [464.301] ? __pfx___blk_flush_plug+0x10/0x10 [464.308] ? __submit_bio+0x37c/0x450 [464.314] ? __pfx___schedule+0x10/0x10 [464.321] ? lock_release+0x567/0x790 [464.327] ? __pfx_lock_acquire+0x10/0x10 [464.334] ? __pfx_lock_release+0x10/0x10 [464.340] ? __pfx_lock_acquire+0x10/0x10 [464.347] ? __pfx_lock_release+0x10/0x10 [464.353] ? do_raw_spin_lock+0x12e/0x270 [464.360] schedule+0xdf/0x3b0 [464.365] io_schedule+0x8f/0xf0 [464.371] folio_wait_bit_common+0x2ca/0x6d0 [464.378] ? folio_wait_bit_common+0x1cc/0x6d0 [464.385] ? __pfx_folio_wait_bit_common+0x10/0x10 [464.392] ? __pfx_filemap_get_folios_tag+0x10/0x10 [464.400] ? __pfx_wake_page_function+0x10/0x10 [464.407] ? __pfx___might_resched+0x10/0x10 [464.414] ? do_raw_spin_unlock+0x58/0x1f0 [464.420] extent_write_cache_pages+0xe49/0x1620 [btrfs] [464.428] ? lock_acquire+0x435/0x500 [464.435] ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs] [464.443] ? btrfs_do_write_iter+0x493/0x640 [btrfs] [464.451] ? orc_find.part.0+0x1d4/0x380 [464.457] ? __pfx_lock_release+0x10/0x10 [464.464] ? __pfx_lock_release+0x10/0x10 [464.471] ? btrfs_do_write_iter+0x493/0x640 [btrfs] [464.478] btrfs_writepages+0x1cc/0x460 [btrfs] [464.485] ? __pfx_btrfs_writepages+0x10/0x10 [btrfs] [464.493] ? is_bpf_text_address+0x6e/0x100 [464.500] ? kernel_text_address+0x145/0x160 [464.507] ? unwind_get_return_address+0x5e/0xa0 [464.514] ? arch_stack_walk+0xac/0x100 [464.521] do_writepages+0x176/0x780 [464.527] ? lock_release+0x567/0x790 [464.533] ? __pfx_do_writepages+0x10/0x10 [464.540] ? __pfx_lock_acquire+0x10/0x10 [464.546] ? __pfx_stack_trace_save+0x10/0x10 [464.553] ? do_raw_spin_lock+0x12e/0x270 [464.560] ? do_raw_spin_unlock+0x58/0x1f0 [464.566] ? _raw_spin_unlock+0x23/0x40 [464.573] ? wbc_attach_and_unlock_inode+0x3da/0x7d0 [464.580] filemap_fdatawrite_wbc+0x113/0x180 [464.587] ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs] [464.596] __filemap_fdatawrite_range+0xaf/0xf0 [464.603] ? __pfx___filemap_fdatawrite_range+0x10/0x10 [464.611] ? trace_irq_enable.constprop.0+0xce/0x110 [464.618] ? kasan_quarantine_put+0xd7/0x1e0 [464.625] btrfs_start_ordered_extent+0x46f/0x570 [btrfs] [464.633] ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs] [464.642] ? __clear_extent_bit+0x2c0/0x9d0 [btrfs] [464.650] btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs] [464.659] ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs] [464.669] btrfs_read_folio+0x12a/0x1d0 [btrfs] [464.676] ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs] [464.684] ? __pfx_filemap_add_folio+0x10/0x10 [464.691] ? __pfx___might_resched+0x10/0x10 [464.698] ? __filemap_get_folio+0x1c5/0x450 [464.705] prepare_uptodate_page+0x12e/0x4d0 [btrfs] [464.713] prepare_pages.constprop.0+0x13c/0x5c0 [btrfs] [464.721] ? fault_in_iov_iter_readable+0xd2/0x240 [464.729] btrfs_buffered_write+0x5bd/0x12f0 [btrfs] [464.737] ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs] [464.745] ? __pfx_lock_release+0x10/0x10 [464.752] ? generic_write_checks+0x275/0x400 [464.759] ? down_write+0x118/0x1f0 [464.765] ? up_write+0x19b/0x500 [464.770] btrfs_direct_write+0x731/0xba0 [btrfs] [464.778] ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs] [464.785] ? __pfx___might_resched+0x10/0x10 [464.792] ? lock_acquire+0x435/0x500 [464.798] ? lock_acquire+0x435/0x500 [464.804] btrfs_do_write_iter+0x494/0x640 [btrfs] [464.811] ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs] [464.819] ? __pfx___might_resched+0x10/0x10 [464.825] ? rw_verify_area+0x6d/0x590 [464.831] vfs_write+0x5d7/0xf50 [464.837] ? __might_fault+0x9d/0x120 [464.843] ? __pfx_vfs_write+0x10/0x10 [464.849] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.856] ? lock_release+0x567/0x790 [464.862] ksys_write+0xfb/0x1d0 [464.867] ? __pfx_ksys_write+0x10/0x10 [464.873] ? _raw_spin_unlock+0x23/0x40 [464.879] ? btrfs_getattr+0x4af/0x670 [btrfs] [464.886] ? vfs_getattr_nosec+0x79/0x340 [464.892] do_syscall_64+0x95/0x180 [464.898] ? __do_sys_newfstat+0xde/0xf0 [464.904] ? __pfx___do_sys_newfstat+0x10/0x10 [464.911] ? trace_irq_enable.constprop.0+0xce/0x110 [464.918] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.925] ? do_syscall_64+0xa1/0x180 [464.931] ? trace_irq_enable.constprop.0+0xce/0x110 [464.939] ? trace_irq_enable.constprop.0+0xce/0x110 [464.946] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.953] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.960] ? do_syscall_64+0xa1/0x180 [464.966] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.973] ? trace_irq_enable.constprop.0+0xce/0x110 [464.980] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.987] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs] [464.995] ? trace_irq_enable.constprop.0+0xce/0x110 [465.002] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs] [465.010] ? do_syscall_64+0xa1/0x180 [465.016] ? lock_release+0x567/0x790 [465.022] ? __pfx_lock_acquire+0x10/0x10 [465.028] ? __pfx_lock_release+0x10/0x10 [465.034] ? trace_irq_enable.constprop.0+0xce/0x110 [465.042] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.049] ? do_syscall_64+0xa1/0x180 [465.055] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.062] ? do_syscall_64+0xa1/0x180 [465.068] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.075] ? do_syscall_64+0xa1/0x180 [465.081] ? clear_bhb_loop+0x25/0x80 [465.087] ? clear_bhb_loop+0x25/0x80 [465.093] ? clear_bhb_loop+0x25/0x80 [465.099] entry_SYSCALL_64_after_hwframe+0x76/0x7e [465.106] RIP: 0033:0x7f093b8ee784 [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784 [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003 [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000 [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000 [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001 [465.170] </TASK> [465.174] INFO: lockdep is turned off. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `97713b1a2c` ("btrfs: do not clear page dirty inside extent_write_locked_range()") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 15:36:57 +02:00
Phillip Lougher	810ee43d9c	Squashfs: sanity check symbolic link size Syzkiller reports a "KMSAN: uninit-value in pick_link" bug. This is caused by an uninitialised page, which is ultimately caused by a corrupted symbolic link size read from disk. The reason why the corrupted symlink size causes an uninitialised page is due to the following sequence of events: 1. squashfs_read_inode() is called to read the symbolic link from disk. This assigns the corrupted value 3875536935 to inode->i_size. 2. Later squashfs_symlink_read_folio() is called, which assigns this corrupted value to the length variable, which being a signed int, overflows producing a negative number. 3. The following loop that fills in the page contents checks that the copied bytes is less than length, which being negative means the loop is skipped, producing an uninitialised page. This patch adds a sanity check which checks that the symbolic link size is not larger than expected. -- Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk Reported-by: Lizhi Xu <lizhi.xu@windriver.com> Reported-by: syzbot+24ac24ff58dc5b0d26b9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/ V2: fix spelling mistake. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:56:46 +02:00
Dominique Martinet	e3786b29c5	9p: Fix DIO read through netfs If a program is watching a file on a 9p mount, it won't see any change in size if the file being exported by the server is changed directly in the source filesystem, presumably because 9p doesn't have change notifications, and because netfs skips the reads if the file is empty. Fix this by attempting to read the full size specified when a DIO read is requested (such as when 9p is operating in unbuffered mode) and dealing with a short read if the EOF was less than the expected read. To make this work, filesystems using netfslib must not set NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF. I don't want to mandatorily clear this flag in netfslib for DIO because, say, ceph might make a read from an object that is not completely filled, but does not reside at the end of file - and so we need to clear the excess. This can be tested by watching an empty file over 9p within a VM (such as in the ktest framework): while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo then writing something into the empty file. The watcher should immediately display the file content and break out of the loop. Without this fix, it remains in the loop indefinitely. Fixes: `80105ed2fd` ("9p: Use netfslib read/write_iter") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:53:09 +02:00
Zhihao Cheng	2a0629834c	vfs: Don't evict inode under the inode lru traversing context The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) \| iput(ib) // ib->nlink is 0, do evict \| ubifs_evict_inode \| ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: `e50e5129f3` ("ext4: xattr-in-inode support") Fixes: `7959cf3a75` ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:52:16 +02:00
Filipe Manana	46a6e10a1a	btrfs: send: allow cloning non-aligned extent if it ends at i_size If we a find that an extent is shared but its end offset is not sector size aligned, then we don't clone it and issue write operations instead. This is because the reflink (remap_file_range) operation does not allow to clone unaligned ranges, except if the end offset of the range matches the i_size of the source and destination files (and the start offset is sector size aligned). While this is not incorrect because send can only guarantee that a file has the same data in the source and destination snapshots, it's not optimal and generates confusion and surprising behaviour for users. For example, running this test: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Use a file size not aligned to any possible sector size. file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes dd if=/dev/random of=$MNT/foo bs=$file_size count=1 cp --reflink=always $MNT/foo $MNT/bar btrfs subvolume snapshot -r $MNT/ $MNT/snap rm -f /tmp/send-test btrfs send -f /tmp/send-test $MNT/snap umount $MNT mkfs.btrfs -f $DEV mount $DEV $MNT btrfs receive -vv -f /tmp/send-test $MNT xfs_io -r -c "fiemap -v" $MNT/snap/bar umount $MNT Gives the following result: (...) mkfile o258-7-0 rename o258-7-0 -> bar write bar - offset=0 length=49152 write bar - offset=49152 length=49152 write bar - offset=98304 length=49152 write bar - offset=147456 length=49152 write bar - offset=196608 length=49152 write bar - offset=245760 length=49152 write bar - offset=294912 length=49152 write bar - offset=344064 length=49152 write bar - offset=393216 length=49152 write bar - offset=442368 length=49152 write bar - offset=491520 length=49152 write bar - offset=540672 length=49152 write bar - offset=589824 length=49152 write bar - offset=638976 length=49152 write bar - offset=688128 length=49152 write bar - offset=737280 length=49152 write bar - offset=786432 length=49152 write bar - offset=835584 length=49152 write bar - offset=884736 length=49152 write bar - offset=933888 length=49152 write bar - offset=983040 length=49152 write bar - offset=1032192 length=16389 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x1 There's no clone operation to clone extents from the file foo into file bar and fiemap confirms there's no shared flag (0x2000). So update send_write_or_clone() so that it proceeds with cloning if the source and destination ranges end at the i_size of the respective files. After this changes the result of the test is: (...) mkfile o258-7-0 rename o258-7-0 -> bar clone bar - source=foo source offset=0 offset=0 length=1048581 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x2001 A test case for fstests will also follow up soon. Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:45:42 +02:00
Filipe Manana	ae1e766f62	btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit `b3ebb9b7e9` ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit `956a17d9d0` ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: `956a17d9d0` ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:43:28 +02:00
Qu Wenruo	31723c9542	btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type [REPORT] There is a bug report that kernel is rejecting a mismatching inode mode and its dir item: [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with dir: inode mode=040700 btrfs type=2 dir type=0 [CAUSE] It looks like the inode mode is correct, while the dir item type 0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all. This may be caused by a memory bit flip. [ENHANCEMENT] Although tree-checker is not able to do any cross-leaf verification, for this particular case we can at least reject any dir type with BTRFS_FT_UNKNOWN. So here we enhance the dir type check from [0, BTRFS_FT_MAX), to (0, BTRFS_FT_MAX). Although the existing corruption can not be fixed just by such enhanced checking, it should prevent the same 0x2->0x0 bitflip for dir type to reach disk in the future. Reported-by: Kota <nospam@kota.moe> Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:42:26 +02:00
Josef Bacik	42fac187b5	btrfs: check delayed refs when we're checking if a ref exists In the patch `78c52d9eb6` ("btrfs: check for refs on snapshot delete resume") I added some code to handle file systems that had been corrupted by a bug that incorrectly skipped updating the drop progress key while dropping a snapshot. This code would check to see if we had already deleted our reference for a child block, and skip the deletion if we had already. Unfortunately there is a bug, as the check would only check the on-disk references. I made an incorrect assumption that blocks in an already deleted snapshot that was having the deletion resume on mount wouldn't be modified. If we have 2 pending deleted snapshots that share blocks, we can easily modify the rules for a block. Take the following example subvolume a exists, and subvolume b is a snapshot of subvolume a. They share references to block 1. Block 1 will have 2 full references, one for subvolume a and one for subvolume b, and it belongs to subvolume a (btrfs_header_owner(block 1) == subvolume a). When deleting subvolume a, we will drop our full reference for block 1, and because we are the owner we will drop our full reference for all of block 1's children, convert block 1 to FULL BACKREF, and add a shared reference to all of block 1's children. Then we will start the snapshot deletion of subvolume b. We look up the extent info for block 1, which checks delayed refs and tells us that FULL BACKREF is set, so sets parent to the bytenr of block 1. However because this is a resumed snapshot deletion, we call into check_ref_exists(). Because check_ref_exists() only looks at the disk, it doesn't find the shared backref for the child of block 1, and thus returns 0 and we skip deleting the reference for the child of block 1 and continue. This orphans the child of block 1. The fix is to lookup the delayed refs, similar to what we do in btrfs_lookup_extent_info(). However we only care about whether the reference exists or not. If we fail to find our reference on disk, go look up the bytenr in the delayed refs, and if it exists look for an existing ref in the delayed ref head. If that exists then we know we can delete the reference safely and carry on. If it doesn't exist we know we have to skip over this block. This bug has existed since I introduced this fix, however requires having multiple deleted snapshots pending when we unmount. We noticed this in production because our shutdown path stops the container on the system, which deletes a bunch of subvolumes, and then reboots the box. This gives us plenty of opportunities to hit this issue. Looking at the history we've seen this occasionally in production, but we had a big spike recently thanks to faster machines getting jobs with multiple subvolumes in the job. Chris Mason wrote a reproducer which does the following mount /dev/nvme4n1 /btrfs btrfs subvol create /btrfs/s1 simoop -E -f 4k -n 200000 -z /btrfs/s1 while(true) ; do btrfs subvol snap /btrfs/s1 /btrfs/s2 simoop -f 4k -n 200000 -r 10 -z /btrfs/s2 btrfs subvol snap /btrfs/s2 /btrfs/s3 btrfs balance start -dusage=80 /btrfs btrfs subvol del /btrfs/s2 /btrfs/s3 umount /btrfs btrfsck /dev/nvme4n1 \|\| exit 1 mount /dev/nvme4n1 /btrfs done On the second loop this would fail consistently, with my patch it has been running for hours and hasn't failed. I also used dm-log-writes to capture the state of the failure so I could debug the problem. Using the existing failure case to test my patch validated that it fixes the problem. Fixes: `78c52d9eb6` ("btrfs: check for refs on snapshot delete resume") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:42:26 +02:00
David Howells	7b589a9b45	netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used correctly. The problem is that we try to set them up in the request initialisation, but we the cache may be in the process of setting up still, and so the state may not be correct. Further, we secondarily sample the cache state and make contradictory decisions later. The issue arises because we set up the cache resources, which allows the cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which triggers cache writing even if we didn't set the flags when allocating. Fix this in the following way: (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in ->init_request() rather than trying to juggle that in netfs_alloc_request(). (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is to be done, then PG_private_2 is to be used rather than only setting it if we decide to cache and then having netfs_rreq_unlock_folios() set the non-PG_private_2 writeback-to-cache if it wasn't set. (3) Split netfs_rreq_unlock_folios() into two functions, one of which contains the deprecated code for using PG_private_2 to avoid accidentally doing the writeback path - and always use it if USE_PGPRIV2 is set. (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always wait for PG_private_2. This function is deprecated and only used by ceph anyway, and so label it so. (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use fscache_operation_valid() on the cache_resources instead. This has the advantage of picking up the result of netfs_begin_cache_read() and fscache_begin_write_operation() - which are called after the object is initialised and will wait for the cache to come to a usable state. Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be applied on top of that. Without this as well, things like: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { and: WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386 may happen, along with some UAFs due to PG_private_2 not getting used to wait on writeback completion. Fixes: `2ff1e97587` ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Hristo Venev <hristo@venev.name> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:27 +02:00
David Howells	8e5ced7804	netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" This reverts commit `ae678317b9`. Revert the patch that removes the deprecated use of PG_private_2 in netfslib for the moment as Ceph is actually still using this to track data copied to the cache. Fixes: `ae678317b9` ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:27 +02:00
Lukas Bulwahn	889ced4c93	netfs: clean up after renaming FSCACHE_DEBUG config Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First, NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas the actual intended config is called NETFS_SUPPORT. Second, the config renaming misses to adjust the documentation of the functionality of this config. Clean up those two points. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
yangerkun	64a7ce76fb	libfs: fix infinite directory reads for offset dir After we switch tmpfs dir operations from simple_dir_operations to simple_offset_dir_operations, every rename happened will fill new dentry to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free key starting with octx->newx_offset, and then set newx_offset equals to free key + 1. This will lead to infinite readdir combine with rename happened at the same time, which fail generic/736 in xfstests(detail show as below). 1. create 5000 files(1 2 3...) under one dir 2. call readdir(man 3 readdir) once, and get one entry 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) 4. loop 2~3, until readdir return nothing or we loop too many times(tmpfs break test with the second condition) We choose the same logic what commit `9b378f6ad4` ("btrfs: fix infinite directory reads") to fix it, record the last_index when we open dir, and do not emit the entry which index >= last_index. The file->private_data now used in offset dir can use directly to do this, and we also update the last_index when we llseek the dir file. Fixes: `a2e459555c` ("shmem: stable directory offsets") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20240731043835.1828697-1-yangerkun@huawei.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [brauner: only update last_index after seek when offset is zero like Jan suggested] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
Max Kellermann	f71aa06398	fs/netfs/fscache_cookie: add missing "n_accesses" check This fixes a NULL pointer dereference bug due to a data race which looks like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_rreq_write_to_cache_work RIP: 0010:cachefiles_prepare_write+0x30/0xa0 Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10 RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286 RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000 RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438 RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001 R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68 R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00 FS: 0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x15d/0x440 ? search_module_extables+0xe/0x40 ? fixup_exception+0x22/0x2f0 ? exc_page_fault+0x5f/0x100 ? asm_exc_page_fault+0x22/0x30 ? cachefiles_prepare_write+0x30/0xa0 netfs_rreq_write_to_cache_work+0x135/0x2e0 process_one_work+0x137/0x2c0 worker_thread+0x2e9/0x400 ? __pfx_worker_thread+0x10/0x10 kthread+0xcc/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Modules linked in: CR2: 0000000000000008 ---[ end trace 0000000000000000 ]--- This happened because fscache_cookie_state_machine() was slow and was still running while another process invoked fscache_unuse_cookie(); this led to a fscache_cookie_lru_do_one() call, setting the FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by fscache_cookie_state_machine(), withdrawing the cookie via cachefiles_withdraw_cookie(), clearing cookie->cache_priv. At the same time, yet another process invoked cachefiles_prepare_write(), which found a NULL pointer in this code line: struct cachefiles_object object = cachefiles_cres_object(cres); The next line crashes, obviously: struct cachefiles_cache cache = object->volume->cache; During cachefiles_prepare_write(), the "n_accesses" counter is non-zero (via fscache_begin_operation()). The cookie must not be withdrawn until it drops to zero. The counter is checked by fscache_cookie_state_machine() before switching to FSCACHE_COOKIE_STATE_RELINQUISHING and FSCACHE_COOKIE_STATE_WITHDRAWING (in "case FSCACHE_COOKIE_STATE_FAILED"), but not for FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case FSCACHE_COOKIE_STATE_ACTIVE"). This patch adds the missing check. With a non-zero access counter, the function returns and the next fscache_end_cookie_access() call will queue another fscache_cookie_state_machine() call to handle the still-pending FSCACHE_COOKIE_DO_LRU_DISCARD. Fixes: `12bb21a29c` ("fscache: Implement cookie user counting and resource pinning") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
Omar Sandoval	3f65f3c099	filelock: fix name of file_lease slab cache When struct file_lease was split out from struct file_lock, the name of the file_lock slab cache was copied to the new slab cache for file_lease. This name conflict causes confusion in /proc/slabinfo and /sys/kernel/slab. In particular, it caused failures in drgn's test case for slab cache merging. Link: `9ad29fd864/tests/linux_kernel/helpers/test_slab.py (L81)` Fixes: `c69ff40719` ("filelock: split leases out of struct file_lock") Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/2d1d053da1cafb3e7940c4f25952da4f0af34e38.1722293276.git.osandov@fb.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:25 +02:00
Matthew Wilcox (Oracle)	98055bc359	netfs: Fault in smaller chunks for non-large folio mappings As in commit `4e527d5841` ("iomap: fault in smaller chunks for non-large folio mappings"), we can see a performance loss for filesystems which have not yet been converted to large folios. Fixes: `c38f4e96e6` ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240527201735.1898381-1-willy@infradead.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:25 +02:00
Linus Torvalds	a1460e457e	fix bitmap corruption on close_range(), take 2 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZrFgeQAKCRBZ7Krx/gZQ 6w4SAP48jL+Vil7ifIXviasoBrQGzf9lbTcOAmWAoaxjSlvlpAEAw4OyPhJUmKHW ykB/yqUMCajsrrTQPN5lmc0W5v0nqQ4= =G0ms -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fd bitmap fix from Al Viro: "Fix bitmap corruption on close_range() by cleaning up copy_fd_bitmaps()" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE	2024-08-12 08:03:28 -07:00
Linus Torvalds	5189dafa4c	nfsd-6.11 fixes: - Two minor fixes for recent changes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAma3pFoACgkQM2qzM29m f5f4dA//SiPOS7v60jQubfV3f031E+W3a6YxvB5tqByA23hOavjYowGREyvhtaS5 lgqRO25BXz7uNOjSWUXaV7SsR1GKElWtk/gGVsfYy03A91Qvafb24oapJRwRXeJ1 5n6X5AScxJeoDGEIUMoGU3O87v4T6UoTj5K6bjUDqr5kTRTL6rPFj3jYvy0YVLe3 EsA1P+BLctB1XiboMYNfZyEJbxNF5gpO7OaeS8VPQIeNYhaF1cwsa7A9wDJ22UpB BQEguTjPqgzkthOdkc3jNcoXtDUEzAXxQqACKR8CU6cCoyr159KupSlci4LqA2o4 8CwcVd01L79JjGcrZqo+4vmbuY6rfhN7U/EYcNNJPZKUIo7Q5h4dq/pCaBADK2Ke UJVPsmdYXKMwn/pPJI9kjoellqSrjOxqFDXXqBb/e8Z/xXuXmiYRl6E+dtGeBlty j4qm12d1BGKs7zRaLLThbFTH6s3kUCjVhEr+uphse6CVF6HV/NNK79jooHrrgCiS e1vgRaYIgzMREOyq8fZwvINDuk/8Pa1Z4wlR65orwVzWlZMmk9bLm+wwplcyj34x una8RrQhxlicY5nAGUa+aRCkKUtO9oSPs6hukoT+DB06+mhQ7u3ro96ojueMfMdQ pKux8zrCdJe7jZUlnIW30/QsrgWhHW9KsrdUA9eKQ1FoZIOdS9I= =YX9E -----END PGP SIGNATURE----- Merge tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two minor fixes for recent changes * tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets sunrpc: avoid -Wformat-security warning	2024-08-10 10:44:21 -07:00
Linus Torvalds	31b2444606	bcachefs fixes for 6.11-rc2, more - fix a bug that was causing ACLs to seemingly "disappear" - new on disk format version, bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 accidentally included padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we can fix this with another version bump. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma3mGIACgkQE6szbY3K bnagAw//XP8wYgEE6worKtLb8sNx+w+jeKR1FeAsYBP012+58xBvljxsUm2J6Mct bs7nJmhextVEhcIhm43/SEaic1dXWNxlC5hvTPQzJCGAjyVCPH2naXpz284GQzIu HlwBlnWK2HjnSpPHFPEqRm3jq/H9SGEhENDJTrccg7E82towGktQspvmx1Bzyt/E cb31An0Y4lQmZGu1/bxX0IDjcKYdI4Jr+FQwiJV35KwGXSw9gzkfHii0wNY+gTAj OklXiw5oOtIR52Oy7FKLhYWdl6GIe4/hYS+yYJ3sMiSUuYY50otwIGGnCRLICRlf AsE4fmN8wjTPCRryOOTDvc/MCuKuanQa8x1pH0cM3WMKqhl4mrDD2wne+38gfgDN zQwoLHfDdDXtadf0NB/VoVQnHTfYSk6GKpHHduMhZeb45XNm0WNryZcefQMunSMz CTBq22f22iSngF0KCvHQgp7mm8XNk+d12keQ37ldINRkC+mDWPu7Yi/mhwYzN43O +6WrzxyQsQPhLE+nfYRlvAuDCCVsXqjL2U2EVyuaBmUJaMMZbFU+u9zpb48dMjIl dRxBQfOsJjvbvTMTNlG4EDslKi8HjoqHXMo5Y3nhvSAmQRjkltfhiXSRivdJxYuR NwjJYtszbbBIqSsUZ6DqR+1eTG6HSQ1ML0B+Nm8orLLMqIFlAkQ= =f3fR -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs Pull more bcachefs fixes from Kent Overstreet: "A couple last minute fixes for the new disk accounting - fix a bug that was causing ACLs to seemingly "disappear" - new on disk format version, bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 accidentally included padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we can fix this with another version bump" * tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs: bcachefs: bcachefs_metadata_version_disk_accounting_v3 bcachefs: improve bch2_dev_usage_to_text() bcachefs: bch2_accounting_invalid() bcachefs: Switch to .get_inode_acl()	2024-08-10 10:06:26 -07:00
Linus Torvalds	34ac1e82e5	Three smb3 client fixes including two for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma2k9YACgkQiiy9cAdy T1GkqAv7B8wyDp7pVU8qtTxct1gVtDEqvd2Kte6962uwvKWEhhf8SKzagypICudL vXxwJuPsRlFM5jWd8h36/xXC17BYM1mEsz6GVq9viUYUcqtrKrGPsOL7+OLVR3Mp SwPRPLYyLyMBHhc3Su2/6uGr87NMkcgDTzugZrx67ojAaw/xkYsiuz+Wm8ijbdik waDkJ3zosj0Nbhud47sf6rkrymiqC5kP717rhlcXF+TxNscFDQ/eFO8b0lt3g5q5 +Y338a4pLLFWXBm1jP9EOtbx7NSDKaqWVYYtnwEBs6EV3QGpKjbT1zBQ07wWjfT4 FhcFhYYukq7Jc4X39JouBXqYWR2wjB8VpMzCuwNsDNJ7FahNaGTgfXCMx6C0Bi0w XICAMoZfBXGSAs+NTyi38AHyPQ+KdYJeSgA9doL+3FhMGx9KKAVkT6XB3twhxoOS w/iMyzhx/1mwv4CukCKm3lDflN09AseB688QiFsTshTmnTlWFcHdEnORqWbNrd2N evhrKvxK =GIpz -----END PGP SIGNATURE----- Merge tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - DFS fix - fix for security flags for requiring encryption - minor cleanup * tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: cifs_inval_name_dfs_link_error: correct the check for fullpath Fix spelling errors in Server Message Block smb3: fix setting SecurityFlags when encryption is required	2024-08-09 21:33:25 -07:00
Kees Cook	3eb3cd5992	binfmt_flat: Fix corruption when not offsetting data start Commit `04d82a6d08` ("binfmt_flat: allow not offsetting data start") introduced a RISC-V specific variant of the FLAT format which does not allocate any space for the (obsolete) array of shared library pointers. However, it did not disable the code which initializes the array, resulting in the corruption of sizeof(long) bytes before the DATA segment, generally the end of the TEXT segment. Introduce MAX_SHARED_LIBS_UPDATE which depends on the state of CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to guard the initialization of the shared library pointer region so that it will only be initialized if space is reserved for it. Fixes: `04d82a6d08` ("binfmt_flat: allow not offsetting data start") Co-developed-by: Stefan O'Rear <sorear@fastmail.com> Signed-off-by: Stefan O'Rear <sorear@fastmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Link: https://lore.kernel.org/r/20240807195119.it.782-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>	2024-08-09 20:19:00 -07:00

1 2 3 4 5 ...

93246 Commits