linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 06:50:37 +00:00

Author	SHA1	Message	Date
Dan Carpenter	27349b4d2e	ext4: cleanup variable name in ext4_fc_del() The variables "&EXT4_SB(inode->i_sb)->s_fc_lock" and "&sbi->s_fc_lock" are the same lock. This function uses a mix of both, which is a bit unsightly and confuses Smatch. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/96008557-8ff4-44cc-b5e3-ce242212f1a3@stanley.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:15 -05:00
R Sundar	867b73909a	ext4: use string choices helpers Use string choice helpers for better readability and to fix cocci warning Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202410062256.BoynX3c2-lkp@intel.com/ Signed-off-by: R Sundar <prosunofficial@gmail.com> Link: https://patch.msgid.link/20241007172006.83339-1-prosunofficial@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:15 -05:00
Yu Jiaoliang	5ad585bcfe	ext4: use ERR_CAST to return an error-valued pointer Instead of directly casting and returning an error-valued pointer, use ERR_CAST to make the error handling more explicit and improve code clarity. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Link: https://patch.msgid.link/20240920021440.1959243-1-yujiaoliang@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
Brian Foster	c7fc0366c6	ext4: partial zero eof block on unaligned inode size extension Using mapped writes, it's technically possible to expose stale post-eof data on a truncate up operation. Consider the following example: $ xfs_io -fc "pwrite 0 2k" -c "mmap 0 4k" -c "mwrite 2k 2k" \ -c "truncate 8k" -c "pread -v 2k 16" <file> ... 00000800: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 XXXXXXXXXXXXXXXX ... This shows that the post-eof data written via mwrite lands within EOF after a truncate up. While this is deliberate of the test case, behavior is somewhat unpredictable because writeback does post-eof zeroing, and writeback can occur at any time in the background. For example, an fsync inserted between the mwrite and truncate causes the subsequent read to instead return zeroes. This basically means that there is a race window in this situation between any subsequent extending operation and writeback that dictates whether post-eof data is exposed to the file or zeroed. To prevent this problem, perform partial block zeroing as part of the various inode size extending operations that are susceptible to it. For truncate extension, zero around the original eof similar to how truncate down does partial zeroing of the new eof. For extension via writes and fallocate related operations, zero the newly exposed range of the file to cover any partial zeroing that must occur at the original and new eof blocks. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://patch.msgid.link/20240919160741.208162-2-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
Jinliang Zheng	25f51ea8ac	ext4: disambiguate the return value of ext4_dio_write_end_io() The commit `91562895f8` ("ext4: properly sync file size update after O_SYNC direct IO") causes confusion about the meaning of the return value of ext4_dio_write_end_io(). Specifically, when the ext4_handle_inode_extension() operation succeeds, ext4_dio_write_end_io() directly returns count instead of 0. This does not cause a bug in the current kernel, but the semantics of the return value of the ext4_dio_write_end_io() function are wrong, which is likely to introduce bugs in the future code evolution. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240919082539.381626-1-alexjlzheng@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
j.xia	813f853604	ext4: pass write-hint for buffered IO Commit `449813515d` ("block, fs: Restore the per-bio/request data lifetime fields") restored write-hint support in ext4. But that is applicable only for direct IO. This patch supports passing write-hint for buffered IO from ext4 file system to block layer by filling bi_write_hint of struct bio in io_submit_add_bh(). Signed-off-by: j.xia <j.xia@samsung.com> Link: https://patch.msgid.link/20240919020341.2657646-1-j.xia@samsung.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
Long Li	2f3d93e210	ext4: fix race in buffer_head read fault injection When I enabled ext4 debug for fault injection testing, I encountered the following warning: EXT4-fs error (device sda): ext4_read_inode_bitmap:201: comm fsstress: Cannot read inode bitmap - block_group = 8, inode_bitmap = 1051 WARNING: CPU: 0 PID: 511 at fs/buffer.c:1181 mark_buffer_dirty+0x1b3/0x1d0 The root cause of the issue lies in the improper implementation of ext4's buffer_head read fault injection. The actual completion of buffer_head read and the buffer_head fault injection are not atomic, which can lead to the uptodate flag being cleared on normally used buffer_heads in race conditions. [CPU0] [CPU1] [CPU2] ext4_read_inode_bitmap ext4_read_bh() <bh read complete> ext4_read_inode_bitmap if (buffer_uptodate(bh)) return bh jbd2_journal_commit_transaction __jbd2_journal_refile_buffer __jbd2_journal_unfile_buffer __jbd2_journal_temp_unlink_buffer ext4_simulate_fail_bh() clear_buffer_uptodate mark_buffer_dirty <report warning> WARN_ON_ONCE(!buffer_uptodate(bh)) The best approach would be to perform fault injection in the IO completion callback function, rather than after IO completion. However, the IO completion callback function cannot get the fault injection code in sb. Fix it by passing the result of fault injection into the bh read function, we simulate faults within the bh read function itself. This requires adding an extra parameter to the bh read functions that need fault injection. Fixes: `46f870d690` ("ext4: simulate various I/O and checksum errors when reading metadata") Signed-off-by: Long Li <leo.lilong@huawei.com> Link: https://patch.msgid.link/20240906091746.510163-1-leo.lilong@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
Zhang Yi	a90825898b	ext4: don't pass full mapping flags to ext4_es_insert_extent() When converting a delalloc extent in ext4_es_insert_extent(), since we only want to pass the info of whether the quota has already been claimed if the allocation is a direct allocation from ext4_map_create_blocks(), there is no need to pass full mapping flags, so changes to just pass whether the EXT4_GET_BLOCKS_DELALLOC_RESERVE bit is set. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240906061401.2980330-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:14 -05:00
Andy Shevchenko	667de03a3b	ext4: mark ctx__flags() with __maybe_unused When ctx_set_flags() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: .../ext4/super.c:2120:1: error: unused function 'ctx_set_flags' [-Werror,-Wunused-function] 2120 \| EXT4_SET_CTX(flags); / set only / \| ^~~~~~~~~~~~~~~~~~~ Fix this by marking ctx__flags() with __maybe_unused (mark both for the sake of symmetry). See also commit `6863f5643d` ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905163229.140522-1-andriy.shevchenko@linux.intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:13 -05:00
Amir Goldstein	150c174a60	ext4: return error on syncfs after shutdown This is the logic behavior and one that we would like to verify using a generic fstest similar to xfs/546. Link: https://lore.kernel.org/fstests/20240830152648.GE6216@frogsfrogsfrogs/ Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904084657.1062243-1-amir73il@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:13 -05:00
Zhaoyang Huang	a9cdf82a47	fs: ext4: Don't use CMA for buffer_head cma_alloc() keep failed in our system which thanks to a jh->bh->b_page can not be migrated out of CMA area[1] as the jh has one cp_transaction pending on it because of j_free > j_max_transaction_buffers[2][3][4][5][6]. We temporarily solve this by launching jbd2_log_do_checkpoint forcefully somewhere. Since journal is common mechanism to all JFSs and cp_transaction has a little fewer opportunity to be launched, the cma_alloc() could be affected under the same scenario. This patch would like to have buffer_head of ext4 not use CMA pages when doing sb_getblk. [1] crash_arm64_v8.0.4++> kmem -p\|grep ffffff808f0aa150(sb->s_bdev->bd_inode->i_mapping) fffffffe01a51c00 e9470000 ffffff808f0aa150 3 2 8000000008020 lru,private fffffffe03d189c0 174627000 ffffff808f0aa150 4 2 2004000000008020 lru,private fffffffe03d88e00 176238000 ffffff808f0aa150 3f9 2 2008000000008020 lru,private fffffffe03d88e40 176239000 ffffff808f0aa150 6 2 2008000000008020 lru,private fffffffe03d88e80 17623a000 ffffff808f0aa150 5 2 2008000000008020 lru,private fffffffe03d88ec0 17623b000 ffffff808f0aa150 1 2 2008000000008020 lru,private fffffffe03d88f00 17623c000 ffffff808f0aa150 0 2 2008000000008020 lru,private fffffffe040e6540 183995000 ffffff808f0aa150 3f4 2 2004000000008020 lru,private [2] page -> buffer_head crash_arm64_v8.0.4++> struct page.private fffffffe01a51c00 -x private = 0xffffff802fca0c00 [3] buffer_head -> journal_head crash_arm64_v8.0.4++> struct buffer_head.b_private 0xffffff802fca0c00 b_private = 0xffffff8041338e10, [4] journal_head -> b_cp_transaction crash_arm64_v8.0.4++> struct journal_head.b_cp_transaction 0xffffff8041338e10 -x b_cp_transaction = 0xffffff80410f1900, [5] transaction_t -> journal crash_arm64_v8.0.4++> struct transaction_t.t_journal 0xffffff80410f1900 -x t_journal = 0xffffff80e70f3000, [6] j_free & j_max_transaction_buffers crash_arm64_v8.0.4++> struct journal_t.j_free,j_max_transaction_buffers 0xffffff80e70f3000 -x j_free = 0x3f1, j_max_transaction_buffers = 0x100, Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240904075300.1148836-1-zhaoyang.huang@unisoc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:13 -05:00
Jiapeng Chong	c7f9a6fa40	ext4: simplify if condition The if condition !A \|\| A && B can be simplified to !A \|\| B. ./fs/ext4/fast_commit.c:362:21-23: WARNING !A \|\| A && B is equivalent to !A \|\| B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9837 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830071713.40565-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:54:13 -05:00
Theodore Ts'o	4a622e4d47	ext4: fix FS_IOC_GETFSMAP handling The original implementation ext4's FS_IOC_GETFSMAP handling only worked when the range of queried blocks included at least one free (unallocated) block range. This is because how the metadata blocks were emitted was as a side effect of ext4_mballoc_query_range() calling ext4_getfsmap_datadev_helper(), and that function was only called when a free block range was identified. As a result, this caused generic/365 to fail. Fix this by creating a new function ext4_getfsmap_meta_helper() which gets called so that blocks before the first free block range in a block group can get properly reported. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2024-11-12 23:52:47 -05:00
Baokun Li	40eb3104cf	ext4: WARN if a full dir leaf block has only one dentry The maximum length of a filename is 255 and the minimum block size is 1024, so it is always guaranteed that the number of entries is greater than or equal to 2 when do_split() is called. So unless ext4_dx_add_entry() and make_indexed_dir() or some other functions are buggy, 'split == 0' will not occur. Setting 'continued' to 0 in this case masks the problem that the file system has become corrupted, even though it prevents possible out-of-bounds access. Hence WARN_ON_ONCE() is used to check if 'split' is 0, and if it is then warns and returns an error to abort split. Suggested-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20240823160518.GA424729@mit.edu Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008121152.3771906-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:49:55 -05:00
Baokun Li	fdfa648ab9	ext4: show the default enabled prefetch_block_bitmaps option After commit `21175ca434` ("ext4: make prefetch_block_bitmaps default"), we enable 'prefetch_block_bitmaps' by default, but this is not shown in the '/proc/fs/ext4/sdx/options' procfs interface. This makes it impossible to distinguish whether the feature is enabled by default or not, so 'prefetch_block_bitmaps' is shown in the 'options' procfs interface when prefetch_block_bitmaps is enabled by default. This makes it easy to notice changes to the default mount options between versions through the '/proc/fs/ext4/sdx/options' procfs interface. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241008120134.3758097-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-11-12 23:49:51 -05:00
Ritesh Harjani (IBM)	299537e9df	ext4: Do not fallback to buffered-io for DIO atomic write atomic writes is currently only supported for single fsblock and only for direct-io. We should not return -ENOTBLK for atomic writes since we want the atomic write request to either complete fully or fail otherwise. Hence, we should never fallback to buffered-io in case of DIO atomic write requests. Let's also catch if this ever happens by adding some WARN_ON_ONCE before buffered-io handling for direct-io atomic writes. More details of the discussion [1]. While at it let's add an inline helper ext4_want_directio_fallback() which simplifies the logic checks and inherently fixes condition on when to return -ENOTBLK which otherwise was always returning true for any write or directio in ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap. [1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04 Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>	2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)	b7987a7d69	ext4: Support setting FMODE_CAN_ATOMIC_WRITE FS needs to add the fmode capability in order to support atomic writes during file open (refer kiocb_set_rw_flags()). Set this capability on a regular file if ext4 can do atomic write. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>	2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)	43c696f9d0	ext4: Check for atomic writes support in write iter Let's validate the given constraints for atomic write request. Otherwise it will fail with -EINVAL. Currently atomic write is only supported on DIO, so for buffered-io it will return -EOPNOTSUPP. Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>	2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)	6dfc1c1d59	ext4: Add statx support for atomic writes This patch adds base support for atomic writes via statx getattr. On bs < ps systems, we can create FS with say bs of 16k. That means both atomic write min and max unit can be set to 16k for supporting atomic writes. Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz>	2024-11-05 16:20:40 -08:00
Al Viro	8152f82010	fdget(), more trivial conversions all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-11-03 01:28:06 -05:00
Jan Kara	76486b1041	ext4: avoid remount errors with 'abort' mount option When we remount filesystem with 'abort' mount option while changing other mount options as well (as is LTP test doing), we can return error from the system call after commit `d3476f3dad` ("ext4: don't set SB_RDONLY after filesystem errors") because the application of mount option changes detects shutdown filesystem and refuses to do anything. The behavior of application of other mount options in presence of 'abort' mount option is currently rather arbitary as some mount option changes are handled before 'abort' and some after it. Move aborting of the filesystem to the end of remount handling so all requested changes are properly applied before the filesystem is shutdown to have a reasonably consistent behavior. Fixes: `d3476f3dad` ("ext4: don't set SB_RDONLY after filesystem errors") Reported-by: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Jan Stancek <jstancek@redhat.com> Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
Jeongjun Park	902cc179c9	ext4: supress data-race warnings in ext4_free_inodes_{count,set}() find_group_other() and find_group_orlov() read _lo, _hi with ext4_free_inodes_count without additional locking. This can cause data-race warning, but since the lock is held for most writes and free inodes value is generally not a problem even if it is incorrect, it is more appropriate to use READ_ONCE()/WRITE_ONCE() than to add locking. ================================================================== BUG: KCSAN: data-race in ext4_free_inodes_count / ext4_free_inodes_set write to 0xffff88810404300e of 2 bytes by task 6254 on cpu 1: ext4_free_inodes_set+0x1f/0x80 fs/ext4/super.c:405 __ext4_new_inode+0x15ca/0x2200 fs/ext4/ialloc.c:1216 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff88810404300e of 2 bytes by task 6257 on cpu 0: ext4_free_inodes_count+0x1c/0x80 fs/ext4/super.c:349 find_group_other fs/ext4/ialloc.c:594 [inline] __ext4_new_inode+0x6ec/0x2200 fs/ext4/ialloc.c:1017 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391 vfs_symlink+0xca/0x1d0 fs/namei.c:4615 do_symlinkat+0xe3/0x340 fs/namei.c:4641 __do_sys_symlinkat fs/namei.c:4657 [inline] __se_sys_symlinkat fs/namei.c:4654 [inline] __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: stable@vger.kernel.org Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20241003125337.47283-1-aha310510@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
Markus Elfring	d431a2cd28	ext4: Call ext4_journal_stop(handle) only once in ext4_dio_write_iter() An ext4_journal_stop(handle) call was immediately used after a return value check for a ext4_orphan_add() call in this function implementation. Thus call such a function only once instead directly before the check. This issue was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/cf895072-43cf-412c-bced-8268498ad13e@web.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-30 17:42:44 -04:00
André Almeida	3f5ad0d21d	ext4: Use generic_ci_validate_strict_name helper Use the helper function to check the requirements for casefold directories using strict encoding. Suggested-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-2-f443d5814194@igalia.com Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:36:53 +01:00
Pankaj Raghav	30dac24e14	fs/writeback: convert wbc_account_cgroup_owner to take a folio Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-28 13:26:54 +01:00
Christian Brauner	b40508ca5d	Merge patch series "timekeeping/fs: multigrain timestamp redux" Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:57 +02:00
Jeff Layton	d0382c698f	ext4: switch to multigrain timestamps Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. For ext4, we only need to enable the FS_MGTIME flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-10-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-10-10 10:20:53 +02:00
Baokun Li	6121258c2b	ext4: fix off by one issue in alloc_flex_gd() Wesley reported an issue: ================================================================== EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks ------------[ cut here ]------------ kernel BUG at fs/ext4/resize.c:324! CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27 RIP: 0010:ext4_resize_fs+0x1212/0x12d0 Call Trace: __ext4_ioctl+0x4e0/0x1800 ext4_ioctl+0x12/0x20 __x64_sys_ioctl+0x99/0xd0 x64_sys_call+0x1206/0x20d0 do_syscall_64+0x72/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e ================================================================== While reviewing the patch, Honza found that when adjusting resize_bg in alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than flexbg_size. The reproduction of the problem requires the following: o_group = flexbg_size * 2 * n; o_size = (o_group + 1) * group_size; n_group: [o_group + flexbg_size, o_group + flexbg_size * 2) o_size = (n_group + 1) * group_size; Take n=0,flexbg_size=16 as an example: last:15 \|o---------------\|--------------n-\| o_group:0 resize to n_group:30 The corresponding reproducer is: img=test.img rm -f $img truncate -s 600M $img mkfs.ext4 -F $img -b 1024 -G 16 8M dev=`losetup -f --show $img` mkdir -p /tmp/test mount $dev /tmp/test resize2fs $dev 248M Delete the problematic plus 1 to fix the issue, and add a WARN_ON_ONCE() to prevent the issue from happening again. [ Note: another reproucer which this commit fixes is: img=test.img rm -f $img truncate -s 25MiB $img mkfs.ext4 -b 4096 -E nodiscard,lazy_itable_init=0,lazy_journal_init=0 $img truncate -s 3GiB $img dev=`losetup -f --show $img` mkdir -p /tmp/test mount $dev /tmp/test resize2fs $dev 3G umount $dev losetup -d $dev -- TYT ] Reported-by: Wesley Hershberger <wesley.hershberger@canonical.com> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081231 Reported-by: Stéphane Graber <stgraber@stgraber.org> Closes: https://lore.kernel.org/all/20240925143325.518508-1-aleksandr.mikhalitsyn@canonical.com/ Tested-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Tested-by: Eric Sandeen <sandeen@redhat.com> Fixes: `665d3e0af4` ("ext4: reduce unnecessary memory allocation in alloc_flex_gd()") Cc: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240927133329.1015041-1-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-10-04 17:36:28 -04:00
Luis Henriques (SUSE)	04e6ce8f06	ext4: mark fc as ineligible using an handle in ext4_xattr_set() Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result in a fast-commit being done before the filesystem is effectively marked as ineligible. This patch moves the call to this function so that an handle can be used. If a transaction fails to start, then there's not point in trying to mark the filesystem as ineligible, and an error will eventually be returned to user-space. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-10-04 17:36:09 -04:00
Luis Henriques (SUSE)	faab35a037	ext4: use handle to mark fc as ineligible in __track_dentry_update() Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result in a fast-commit being done before the filesystem is effectively marked as ineligible. This patch fixes the calls to this function in __track_dentry_update() by adding an extra parameter to the callback used in ext4_fc_track_template(). Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-10-04 17:35:54 -04:00
Linus Torvalds	f8ffbc365f	struct fd layout change (and conversion to accessor helpers) -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZvDNmgAKCRBZ7Krx/gZQ 63zrAP9vI0rf55v27twiabe9LnI7aSx5ckoqXxFIFxyT3dOYpQD/bPmoApnWDD3d 592+iDgLsema/H/0/CqfqlaNtDNY8Q0= =HUl5 -----END PGP SIGNATURE----- Merge tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.	2024-09-23 09:35:36 -07:00
Linus Torvalds	056f8c437d	Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmbsGRcACgkQ8vlZVpUN gaP+pwgAop3LUpOFQ9dPRTR3+37AJI8adfabfLIDkEkoVA7lyYY/6Q8pcQ0rklq3 wE1WxrJ7MaE1GaFCwRIDIL6TP+uYRK0pPjqbFBxGakhDc+WXrTcALOWWofb7J7PL FLwP264lRRfKfpMHdK8bx6YHnEN8425PR+ZNXGVPsw+wjo72mmnq54w+ct1iOKiw dKfIrwwCGKlBsNdYHS/XsSx7MMK8e7nsKoSq0UtpJ4PqF11/asOtlYYODc4hd27U E3I3UDKuntmz+meAscDejOJqQk5FT184HIt/Y5JfetKU2zpUFj9IKqXDzMjijdaj vGn9RkTXfJdxMPm1ouF2R6KIRJollg== =V7+A -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling" * tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits) ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C ext4: check stripe size compatibility on remount as well ext4: fix i_data_sem unlock order in ext4_ind_migrate() ext4: remove the special buffer dirty handling in do_journal_get_write_access ext4: fix a potential assertion failure due to improperly dirtied buffer ext4: hoist ext4_block_write_begin and replace the __block_write_begin ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers ext4: dax: keep orphan list before truncate overflow allocated blocks ext4: fix error message when rejecting the default hash ext4: save unnecessary indentation in ext4_ext_create_new_leaf() ext4: make some fast commit functions reuse extents path ext4: refactor ext4_swap_extents() to reuse extents path ext4: get rid of ppath in convert_initialized_extent() ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() ext4: get rid of ppath in ext4_ext_convert_to_initialized() ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() ext4: get rid of ppath in ext4_split_convert_extents() ext4: get rid of ppath in ext4_split_extent() ext4: get rid of ppath in ext4_force_split_extent_at() ext4: get rid of ppath in ext4_split_extent_at() ...	2024-09-20 19:26:45 -07:00
Linus Torvalds	3352633ce6	vfs-6.12.file -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEwAAKCRCRxhvAZXjc osS0AQCgIpvey9oW5DMyMw6Bv0hFMRv95gbNQZfHy09iK+NMNAD9GALhb/4cMIVB 7YrZGXEz454lpgcs8AnrOVjVNfctOQg= =e9s9 -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...	2024-09-16 09:14:02 +02:00
Christian Brauner	4f05ee2f82	ext4: store cookie in private data Store the cookie to detect concurrent seeks on directories in file->private_data. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-11-6d3e4816aa7b@kernel.org Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-09-09 11:58:08 +02:00
Ojaswin Mujoo	ff2beee206	ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C Although we have checks to make sure s_stripe is a multiple of cluster size, in case we accidentally end up with a scenario where this is not the case, use EXT4_NUM_B2C() so that we don't end up with unexpected cases where EXT4_B2C(stripe) becomes 0. Also make the is_stripe_aligned check in regular_allocator a bit more robust while we are at it. This should ideally have no functional change unless we have a bug somewhere causing (stripe % cluster_size != 0) Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/e0c0a3b58a40935a1361f668851d041575861411.1725002410.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Ojaswin Mujoo	ee85e0938a	ext4: check stripe size compatibility on remount as well We disable stripe size in __ext4_fill_super if it is not a multiple of the cluster ratio however this check is missed when trying to remount. This can leave us with cases where stripe < cluster_ratio after remount:set making EXT4_B2C(sbi->s_stripe) become 0 that can cause some unforeseen bugs like divide by 0. Fix that by adding the check in remount path as well. Reported-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com Tested-by: syzbot+1ad8bac5af24d01e2cbd@syzkaller.appspotmail.com Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Fixes: `c3defd99d5` ("ext4: treat stripe in block unit") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/3a493bb503c3598e25dcfbed2936bb2dff3fece7.1725002410.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Artem Sadovnikov	cc749e61c0	ext4: fix i_data_sem unlock order in ext4_ind_migrate() Fuzzing reports a possible deadlock in jbd2_log_wait_commit. This issue is triggered when an EXT4_IOC_MIGRATE ioctl is set to require synchronous updates because the file descriptor is opened with O_SYNC. This can lead to the jbd2_journal_stop() function calling jbd2_might_wait_for_commit(), potentially causing a deadlock if the EXT4_IOC_MIGRATE call races with a write(2) system call. This problem only arises when CONFIG_PROVE_LOCKING is enabled. In this case, the jbd2_might_wait_for_commit macro locks jbd2_handle in the jbd2_journal_stop function while i_data_sem is locked. This triggers lockdep because the jbd2_journal_start function might also lock the same jbd2_handle simultaneously. Found by Linux Verification Center (linuxtesting.org) with syzkaller. Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Co-developed-by: Mikhail Ukhin <mish.uxin2012@yandex.ru> Signed-off-by: Mikhail Ukhin <mish.uxin2012@yandex.ru> Signed-off-by: Artem Sadovnikov <ancowi69@gmail.com> Rule: add Link: https://lore.kernel.org/stable/20240404095000.5872-1-mish.uxin2012%40yandex.ru Link: https://patch.msgid.link/20240829152210.2754-1-ancowi69@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Shida Zhang	183aa1d3ba	ext4: remove the special buffer dirty handling in do_journal_get_write_access This kinda revert the commit 56d35a4cd13e("ext4: Fix dirtying of journalled buffers in data=journal mode") made by Jan 14 years ago, since the do_get_write_access() itself can deal with the extra unexpected buf dirting things in a proper way now. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-5-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Shida Zhang	cb3de5fc87	ext4: fix a potential assertion failure due to improperly dirtied buffer On an old kernel version(4.19, ext3, data=journal, pagesize=64k), an assertion failure will occasionally be triggered by the line below: ----------- jbd2_journal_commit_transaction { ... J_ASSERT_BH(bh, !buffer_dirty(bh)); /* * The buffer on BJ_Forget list and not jbddirty means ... } ----------- The same condition may also be applied to the lattest kernel version. When blocksize < pagesize and we truncate a file, there can be buffers in the mapping tail page beyond i_size. These buffers will be filed to transaction's BJ_Forget list by ext4_journalled_invalidatepage() during truncation. When the transaction doing truncate starts committing, we can grow the file again. This calls __block_write_begin() which allocates new blocks under these buffers in the tail page we go through the branch: if (buffer_new(bh)) { clean_bdev_bh_alias(bh); if (folio_test_uptodate(folio)) { clear_buffer_new(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); continue; } ... } Hence buffers on BJ_Forget list of the committing transaction get marked dirty and this triggers the jbd2 assertion. Teach ext4_block_write_begin() to properly handle files with data journalling by avoiding dirtying them directly. Instead of folio_zero_new_buffers() we use ext4_journalled_zero_new_buffers() which takes care of handling journalling. We also don't need to mark new uptodate buffers as dirty in ext4_block_write_begin(). That will be either done either by block_commit_write() in case of success or by folio_zero_new_buffers() in case of failure. Reported-by: Baolin Liu <liubaolin@kylinos.cn> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-4-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Shida Zhang	6b730a4050	ext4: hoist ext4_block_write_begin and replace the __block_write_begin Using __block_write_begin() make it inconvenient to journal the user data dirty process. We can't tell the block layer maintainer, ‘Hey, we want to trace the dirty user data in ext4, can we add some special code for ext4 in __block_write_begin?’:P So use ext4_block_write_begin() instead. The two functions are basically doing the same thing except for the fscrypt related code. Remove the unnecessary #ifdef since fscrypt_inode_uses_fs_layer_crypto() returns false (and it's known at compile time) when !CONFIG_FS_ENCRYPTION. And hoist the ext4_block_write_begin so that it can be used in other files. Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-3-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
Shida Zhang	3910b513fc	ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers For new uptodate buffers we also need to call write_end_fn() to persist the uptodate content, similarly as folio_zero_new_buffers() does it. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240830053739.3588573-2-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:17 -04:00
yangerkun	59efe53e38	ext4: dax: keep orphan list before truncate overflow allocated blocks Any extending write for ext4 requires the inode to be placed on the orphan list before the actual write. In addition, the inode can be actually removed from the orphan list only after all writes are completed. Otherwise we'd leave allocated blocks beyond i_disksize if we could not copy all the data into allocated block and e2fsck would complain. Currently, direct IO and buffered IO comply with this logic(buffered IO will truncate all overflow allocated blocks that has not been written successfully, and direct IO will truncate all allocated blocks when error occurs). However, dax write break this since dax write will remove the inode from the orphan list by calling ext4_handle_inode_extension unconditionally during extending write. We add a argument to help determine does we do a fully write, and for the case not fully write, we leave the inode on the orphan list, and the latter ext4_inode_extension_cleanup will help us truncate the overflow allocated blocks, and then remove the inode from the orphan list. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240829110222.126685-1-yangerkun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:16 -04:00
Gabriel Krisman Bertazi	a2187431c3	ext4: fix error message when rejecting the default hash Commit `985b67cd86` ("ext4: filesystems without casefold feature cannot be mounted with siphash") properly rejects volumes where s_def_hash_version is set to DX_HASH_SIPHASH, but the check and the error message should not look into casefold setup - a filesystem should never have DX_HASH_SIPHASH as the default hash. Fix it and, since we are there, move the check to ext4_hash_info_init. Fixes:985b67cd8639 ("ext4: filesystems without casefold feature cannot be mounted with siphash") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://patch.msgid.link/87jzg1en6j.fsf_-_@mailhost.krisman.be Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:16 -04:00
Baokun Li	5f48d4d9d8	ext4: save unnecessary indentation in ext4_ext_create_new_leaf() Save an indentation level in ext4_ext_create_new_leaf() by removing unnecessary 'else'. Besides, the variable 'ee_block' is declared to avoid line breaks. No functional changes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-26-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:16 -04:00
Baokun Li	2352e3e461	ext4: make some fast commit functions reuse extents path The ext4_find_extent() can update the extent path so that it does not have to allocate and free the path repeatedly, thus reducing the consumption of memory allocation and freeing in the following functions: ext4_ext_clear_bb ext4_ext_replay_set_iblocks ext4_fc_replay_add_range ext4_fc_set_bitmaps_and_counters No functional changes. Note that ext4_find_extent() does not support error pointers, so in this case set path to NULL first. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-25-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:14:12 -04:00
Baokun Li	a2c613b8c4	ext4: refactor ext4_swap_extents() to reuse extents path The ext4_find_extent() can update the extent path so it doesn't have to allocate and free path repeatedly, thus reducing the consumption of memory allocation and freeing in ext4_swap_extents(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-24-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	4191eefef9	ext4: get rid of ppath in convert_initialized_extent() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in convert_initialized_extent(), the following is done here: * Free the extents path when an error is encountered. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-23-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	2ec2e10434	ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_handle_unwritten_extents(), the following is done here: * Free the extents path when an error is encountered. * The 'allocated' is changed from passing a value to passing an address. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-22-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	33c14b8bd8	ext4: get rid of ppath in ext4_ext_convert_to_initialized() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_convert_to_initialized(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * The 'allocated' is changed from passing a value to passing an address. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-21-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	8d5ad7b08f	ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_convert_unwritten_extents_endio(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-20-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	225057b1af	ext4: get rid of ppath in ext4_split_convert_extents() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_convert_extents(), the following is done here: * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-19-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:18 -04:00
Baokun Li	f74cde0456	ext4: get rid of ppath in ext4_split_extent() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_extent(), the following is done here: * The 'allocated' is changed from passing a value to passing an address. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-18-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	f07be1c367	ext4: get rid of ppath in ext4_force_split_extent_at() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_force_split_extent_at(), the following is done here: * Free the extents path when an error is encountered. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-17-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	1de82b1b60	ext4: get rid of ppath in ext4_split_extent_at() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_split_extent_at(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * Teach ext4_ext_show_leaf() to skip error pointer. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-16-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	f7d1331f16	ext4: get rid of ppath in ext4_ext_insert_extent() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_insert_extent(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. * Free path when npath is used, free npath when it is not used. * The got_allocated_blocks label in ext4_ext_map_blocks() does not update err now, so err is updated to 0 if the err returned by ext4_ext_search_right() is greater than 0 and is about to enter got_allocated_blocks. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-15-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	a000bc8678	ext4: get rid of ppath in ext4_ext_create_new_leaf() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. To get rid of the ppath in ext4_ext_create_new_leaf(), the following is done here: * Free the extents path when an error is encountered. * Its caller needs to update ppath if it uses ppath. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-14-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	6b854d5527	ext4: get rid of ppath in get_ext_path() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. After getting rid of ppath in get_ext_path(), its caller may pass an error pointer to ext4_free_ext_path(), so it needs to teach ext4_free_ext_path() and ext4_ext_drop_refs() to skip the error pointer. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-13-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	0be4c0c2f1	ext4: get rid of ppath in ext4_find_extent() The use of path and ppath is now very confusing, so to make the code more readable, pass path between functions uniformly, and get rid of ppath. Getting rid of ppath in ext4_find_extent() requires its caller to update ppath. These ppaths will also be dropped later. No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-12-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	369c944ed1	ext4: propagate errors from ext4_find_extent() in ext4_insert_range() Even though ext4_find_extent() returns an error, ext4_insert_range() still returns 0. This may confuse the user as to why fallocate returns success, but the contents of the file are not as expected. So propagate the error returned by ext4_find_extent() to avoid inconsistencies. Fixes: `331573febb` ("ext4: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-11-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	6c2b3246cd	ext4: add new ext4_ext_path_brelse() helper Add ext4_ext_path_brelse() helper function to reduce duplicate code and ensure that path->p_bh is set to NULL after it is released. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-10-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	dcaa6c3113	ext4: fix double brelse() the buffer of the extents path In ext4_ext_try_to_merge_up(), set path[1].p_bh to NULL after it has been released, otherwise it may be released twice. An example of what triggers this is as follows: split2 map split1 \|--------\|-------\|--------\| ext4_ext_map_blocks ext4_ext_handle_unwritten_extents ext4_split_convert_extents // path->p_depth == 0 ext4_split_extent // 1. do split1 ext4_split_extent_at \|ext4_ext_insert_extent \| ext4_ext_create_new_leaf \| ext4_ext_grow_indepth \| le16_add_cpu(&neh->eh_depth, 1) \| ext4_find_extent \| // return -ENOMEM \|// get error and try zeroout \|path = ext4_find_extent \| path->p_depth = 1 \|ext4_ext_try_to_merge \| ext4_ext_try_to_merge_up \| path->p_depth = 0 \| brelse(path[1].p_bh) ---> not set to NULL here \|// zeroout success // 2. update path ext4_find_extent // 3. do split2 ext4_split_extent_at ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth le16_add_cpu(&neh->eh_depth, 1) ext4_find_extent path[0].p_bh = NULL; path->p_depth = 1 read_extent_tree_block ---> return err // path[1].p_bh is still the old value ext4_free_ext_path ext4_ext_drop_refs // path->p_depth == 1 brelse(path[1].p_bh) ---> brelse a buffer twice Finally got the following WARRNING when removing the buffer from lru: ============================================ VFS: brelse: Trying to free free buffer WARNING: CPU: 2 PID: 72 at fs/buffer.c:1241 __brelse+0x58/0x90 CPU: 2 PID: 72 Comm: kworker/u19:1 Not tainted 6.9.0-dirty #716 RIP: 0010:__brelse+0x58/0x90 Call Trace: <TASK> __find_get_block+0x6e7/0x810 bdev_getblk+0x2b/0x480 __ext4_get_inode_loc+0x48a/0x1240 ext4_get_inode_loc+0xb2/0x150 ext4_reserve_inode_write+0xb7/0x230 __ext4_mark_inode_dirty+0x144/0x6a0 ext4_ext_insert_extent+0x9c8/0x3230 ext4_ext_map_blocks+0xf45/0x2dc0 ext4_map_blocks+0x724/0x1700 ext4_do_writepages+0x12d6/0x2a70 [...] ============================================ Fixes: `ecb94f5fdf` ("ext4: collapse a single extent tree block into the inode if possible") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-9-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	5c0f4cc84d	ext4: drop ppath from ext4_ext_replay_update_ex() to avoid double-free When calling ext4_force_split_extent_at() in ext4_ext_replay_update_ex(), the 'ppath' is updated but it is the 'path' that is freed, thus potentially triggering a double-free in the following process: ext4_ext_replay_update_ex ppath = path ext4_force_split_extent_at(&ppath) ext4_split_extent_at ext4_ext_insert_extent ext4_ext_create_new_leaf ext4_ext_grow_indepth ext4_find_extent if (depth > path[0].p_maxdepth) kfree(path) ---> path First freed *orig_path = path = NULL ---> null ppath kfree(path) ---> path double-free !!! So drop the unnecessary ppath and use path directly to avoid this problem. And use ext4_find_extent() directly to update path, avoiding unnecessary memory allocation and freeing. Also, propagate the error returned by ext4_find_extent() instead of using strange error codes. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-8-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:17 -04:00
Baokun Li	a164f3a432	ext4: aovid use-after-free in ext4_ext_insert_extent() As Ojaswin mentioned in Link, in ext4_ext_insert_extent(), if the path is reallocated in ext4_ext_create_new_leaf(), we'll use the stale path and cause UAF. Below is a sample trace with dummy values: ext4_ext_insert_extent path = ppath = 2000 ext4_ext_create_new_leaf(ppath) ext4_find_extent(ppath) path = ppath = 2000 if (depth > path[0].p_maxdepth) kfree(path = 2000); ppath = path = NULL; path = kcalloc() = 3000 ppath = 3000; return path; /* here path is still 2000, UAF! / eh = path[depth].p_hdr ================================================================== BUG: KASAN: slab-use-after-free in ext4_ext_insert_extent+0x26d4/0x3330 Read of size 8 at addr ffff8881027bf7d0 by task kworker/u36:1/179 CPU: 3 UID: 0 PID: 179 Comm: kworker/u6:1 Not tainted 6.11.0-rc2-dirty #866 Call Trace: <TASK> ext4_ext_insert_extent+0x26d4/0x3330 ext4_ext_map_blocks+0xe22/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 [...] Allocated by task 179: ext4_find_extent+0x81c/0x1f70 ext4_ext_map_blocks+0x146/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 ext4_writepages+0x26d/0x4e0 do_writepages+0x175/0x700 [...] Freed by task 179: kfree+0xcb/0x240 ext4_find_extent+0x7c0/0x1f70 ext4_ext_insert_extent+0xa26/0x3330 ext4_ext_map_blocks+0xe22/0x2d40 ext4_map_blocks+0x71e/0x1700 ext4_do_writepages+0x1290/0x2800 ext4_writepages+0x26d/0x4e0 do_writepages+0x175/0x700 [...] ================================================================== So use ppath to update the path to avoid the above problem. Reported-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Closes: https://lore.kernel.org/r/ZqyL6rmtwl6N4MWR@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com Fixes: `10809df84a` ("ext4: teach ext4_ext_find_extent() to realloc path if necessary") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-7-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Baokun Li	5b4b2dcace	ext4: update orig_path in ext4_find_extent() In ext4_find_extent(), if the path is not big enough, we free it and set orig_path to NULL. But after reallocating and successfully initializing the path, we don't update orig_path, in which case the caller gets a valid path but a NULL ppath, and this may cause a NULL pointer dereference or a path memory leak. For example: ext4_split_extent path = ppath = 2000 ext4_find_extent if (depth > path[0].p_maxdepth) kfree(path = 2000); orig_path = path = NULL; path = kcalloc() = 3000 ext4_split_extent_at(ppath = NULL) path = ppath; ex = path[depth].p_ext; // NULL pointer dereference! ================================================================== BUG: kernel NULL pointer dereference, address: 0000000000000010 CPU: 6 UID: 0 PID: 576 Comm: fsstress Not tainted 6.11.0-rc2-dirty #847 RIP: 0010:ext4_split_extent_at+0x6d/0x560 Call Trace: <TASK> ext4_split_extent.isra.0+0xcb/0x1b0 ext4_ext_convert_to_initialized+0x168/0x6c0 ext4_ext_handle_unwritten_extents+0x325/0x4d0 ext4_ext_map_blocks+0x520/0xdb0 ext4_map_blocks+0x2b0/0x690 ext4_iomap_begin+0x20e/0x2c0 [...] ================================================================== Therefore, orig_path is updated when the extent lookup succeeds, so that the caller can safely use path or ppath. Fixes: `10809df84a` ("ext4: teach ext4_ext_find_extent() to realloc path if necessary") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240822023545.1994557-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Baokun Li	4e2524ba2c	ext4: avoid use-after-free in ext4_ext_show_leaf() In ext4_find_extent(), path may be freed by error or be reallocated, so using a previously saved ppath may have been freed and thus may trigger use-after-free, as follows: ext4_split_extent path = ppath; ext4_split_extent_at(ppath) path = ext4_find_extent(ppath) ext4_split_extent_at(ppath) // ext4_find_extent fails to free path // but zeroout succeeds ext4_ext_show_leaf(inode, path) eh = path[depth].p_hdr // path use-after-free !!! Similar to ext4_split_extent_at(), we use ppath directly as an input to ext4_ext_show_leaf(). Fix a spelling error by the way. Same problem in ext4_ext_handle_unwritten_extents(). Since 'path' is only used in ext4_ext_show_leaf(), remove 'path' and use ppath directly. This issue is triggered only when EXT_DEBUG is defined and therefore does not affect functionality. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-5-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Baokun Li	c26ab35702	ext4: fix slab-use-after-free in ext4_split_extent_at() We hit the following use-after-free: ================================================================== BUG: KASAN: slab-use-after-free in ext4_split_extent_at+0xba8/0xcc0 Read of size 2 at addr ffff88810548ed08 by task kworker/u20:0/40 CPU: 0 PID: 40 Comm: kworker/u20:0 Not tainted 6.9.0-dirty #724 Call Trace: <TASK> kasan_report+0x93/0xc0 ext4_split_extent_at+0xba8/0xcc0 ext4_split_extent.isra.0+0x18f/0x500 ext4_split_convert_extents+0x275/0x750 ext4_ext_handle_unwritten_extents+0x73e/0x1580 ext4_ext_map_blocks+0xe20/0x2dc0 ext4_map_blocks+0x724/0x1700 ext4_do_writepages+0x12d6/0x2a70 [...] Allocated by task 40: __kmalloc_noprof+0x1ac/0x480 ext4_find_extent+0xf3b/0x1e70 ext4_ext_map_blocks+0x188/0x2dc0 ext4_map_blocks+0x724/0x1700 ext4_do_writepages+0x12d6/0x2a70 [...] Freed by task 40: kfree+0xf1/0x2b0 ext4_find_extent+0xa71/0x1e70 ext4_ext_insert_extent+0xa22/0x3260 ext4_split_extent_at+0x3ef/0xcc0 ext4_split_extent.isra.0+0x18f/0x500 ext4_split_convert_extents+0x275/0x750 ext4_ext_handle_unwritten_extents+0x73e/0x1580 ext4_ext_map_blocks+0xe20/0x2dc0 ext4_map_blocks+0x724/0x1700 ext4_do_writepages+0x12d6/0x2a70 [...] ================================================================== The flow of issue triggering is as follows: ext4_split_extent_at path = ppath ext4_ext_insert_extent(ppath) ext4_ext_create_new_leaf(ppath) ext4_find_extent(orig_path) path = orig_path read_extent_tree_block // return -ENOMEM or -EIO ext4_free_ext_path(path) kfree(path) orig_path = NULL a. If err is -ENOMEM: ext4_ext_dirty(path + path->p_depth) // path use-after-free !!! b. If err is -EIO and we have EXT_DEBUG defined: ext4_ext_show_leaf(path) eh = path[depth].p_hdr // path also use-after-free !!! So when trying to zeroout or fix the extent length, call ext4_find_extent() to update the path. In addition we use ppath directly as an ext4_ext_show_leaf() input to avoid possible use-after-free when EXT_DEBUG is defined, and to avoid unnecessary path updates. Fixes: `dfe5080939` ("ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Baokun Li	3e8a584c82	ext4: prevent partial update of the extents path In ext4_ext_rm_idx() and ext4_ext_correct_indexes(), there is no proper rollback of already executed updates when updating a level of the extents path fails, so we may get an inconsistent extents tree, which may trigger some bad things in errors=continue mode. Hence clear the verified bit of modified extents buffers if the tree fails to be updated in ext4_ext_rm_idx() or ext4_ext_correct_indexes(), which forces the extents buffers to be checked in ext4_valid_extent_entries(), ensuring that the extents tree is consistent. Signed-off-by: zhanchengbin <zhanchengbin1@huawei.com> Link: https://lore.kernel.org/r/20230213080514.535568-3-zhanchengbin1@huawei.com/ Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Baokun Li	edfa71dbe8	ext4: refactor ext4_ext_rm_idx() to index 'path' As suggested by Honza in Link，modify ext4_ext_rm_idx() to leave 'path' alone and just index it like ext4_ext_correct_indexes() does it. This facilitates adding error handling later. No functional changes. Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/20230216130305.nrbtd42tppxhbynn@quack3/ Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/20240822023545.1994557-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Thadeu Lima de Souza Cascardo	c6b72f5d82	ext4: avoid OOB when system.data xattr changes underneath the filesystem When looking up for an entry in an inlined directory, if e_value_offs is changed underneath the filesystem by some change in the block device, it will lead to an out-of-bounds access that KASAN detects as an UAF. EXT4-fs (loop0): mounted filesystem 00000000-0000-0000-0000-000000000000 r/w without journal. Quota mode: none. loop0: detected capacity change from 2048 to 2047 ================================================================== BUG: KASAN: use-after-free in ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 Read of size 1 at addr ffff88803e91130f by task syz-executor269/5103 CPU: 0 UID: 0 PID: 5103 Comm: syz-executor269 Not tainted 6.11.0-rc4-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 ext4_search_dir+0xf2/0x1c0 fs/ext4/namei.c:1500 ext4_find_inline_entry+0x4be/0x5e0 fs/ext4/inline.c:1697 __ext4_find_entry+0x2b4/0x1b30 fs/ext4/namei.c:1573 ext4_lookup_entry fs/ext4/namei.c:1727 [inline] ext4_lookup+0x15f/0x750 fs/ext4/namei.c:1795 lookup_one_qstr_excl+0x11f/0x260 fs/namei.c:1633 filename_create+0x297/0x540 fs/namei.c:3980 do_symlinkat+0xf9/0x3a0 fs/namei.c:4587 __do_sys_symlinkat fs/namei.c:4610 [inline] __se_sys_symlinkat fs/namei.c:4607 [inline] __x64_sys_symlinkat+0x95/0xb0 fs/namei.c:4607 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f3e73ced469 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff4d40c258 EFLAGS: 00000246 ORIG_RAX: 000000000000010a RAX: ffffffffffffffda RBX: 0032656c69662f2e RCX: 00007f3e73ced469 RDX: 0000000020000200 RSI: 00000000ffffff9c RDI: 00000000200001c0 RBP: 0000000000000000 R08: 00007fff4d40c290 R09: 00007fff4d40c290 R10: 0023706f6f6c2f76 R11: 0000000000000246 R12: 00007fff4d40c27c R13: 0000000000000003 R14: 431bde82d7b634db R15: 00007fff4d40c2b0 </TASK> Calling ext4_xattr_ibody_find right after reading the inode with ext4_get_inode_loc will lead to a check of the validity of the xattrs, avoiding this problem. Reported-by: syzbot+0c2508114d912a54ee79@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0c2508114d912a54ee79 Fixes: `e8e948e780` ("ext4: let ext4_find_entry handle inline data") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-5-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Thadeu Lima de Souza Cascardo	51e14e78b5	ext4: explicitly exit when ext4_find_inline_entry returns an error __ext4_find_entry currently ignores the return of ext4_find_inline_entry, except for returning the bh or NULL when has_inline_data is 1. Even though has_inline_data is set to 1 before calling ext4_find_inline_entry and would only be set to 0 when that function returns NULL, check for an encoded error return explicitly in order to exit. That makes the code more readable, not requiring that one assumes the cases when has_inline_data is 1. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-4-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Thadeu Lima de Souza Cascardo	4d231b91a9	ext4: return error on ext4_find_inline_entry In case of errors when reading an inode from disk or traversing inline directory entries, return an error-encoded ERR_PTR instead of returning NULL. ext4_find_inline_entry only caller, __ext4_find_entry already returns such encoded errors. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-3-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Thadeu Lima de Souza Cascardo	cd69f8f9de	ext4: ext4_search_dir should return a proper error ext4_search_dir currently returns -1 in case of a failure, while it returns 0 when the name is not found. In such failure cases, it should return an error code instead. This becomes even more important when ext4_find_inline_entry returns an error code as well in the next commit. -EFSCORRUPTED seems appropriate as such error code as these failures would be caused by unexpected record lengths and is in line with other instances of ext4_check_dir_entry failures. In the case of ext4_dx_find_entry, the current use of ERR_BAD_DX_DIR was left as is to reduce the risk of regressions. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Link: https://patch.msgid.link/20240821152324.3621860-2-cascardo@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Kemeng Shi	7d2b488818	ext4: check buffer_verified in advance to avoid unneeded ext4_get_group_info() Check buffer_verified in advance to avoid unneeded ext4_get_group_info(). This could be a simple cleanup as compiler may handle this. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Kemeng Shi	7523a7ef09	ext4: remove unneeded NULL check of buffer_head in ext4_mark_inode_used() If gdp from ext4_get_group_desc() is not NULL, then returned group_desc_bh won't be NULL either. Remove check of group_desc_bh and only check returned gdp from ext4_get_group_desc() like how other callers do. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:16 -04:00
Kemeng Shi	66eafbde7d	ext4: move checksum length calculation of inode bitmap into ext4_inode_bitmap_csum_[verify/set]() functions There are some little improve: 1. remove repeat code to calculate checksum length of inode bitmap 2. remove unnecessary checksum length calculation if checksum is not enabled. 3. use more efficient bit shift operation instead of div opreation. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:15 -04:00
Kemeng Shi	f7c69be505	ext4: remove dead check in __ext4_new_inode() If we can't grab any inode, the prvious find_inode_bit() will set ino to be >= EXT4_INODES_PER_GROUP(sb). So the check of need to repeat in the same group is not needed. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:15 -04:00
Kemeng Shi	bb0a12c343	ext4: avoid negative min_clusters in find_group_orlov() min_clusters is signed integer and will be converted to unsigned integer when compared with unsigned number stats.free_clusters. If min_clusters is negative, it will be converted to a huge unsigned value in which case all groups may not meet the actual desired free clusters. Set negative min_clusters to 0 to avoid unexpected behavior. Fixes: `ac27a0ec11` ("[PATCH] ext4: initial copy of files from ext3") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:15 -04:00
Kemeng Shi	227d31b921	ext4: avoid potential buffer_head leak in __ext4_new_inode() If a group is marked EXT4_GROUP_INFO_IBITMAP_CORRUPT after it's inode bitmap buffer_head was successfully verified, then __ext4_new_inode() will get a valid inode_bitmap_bh of a corrupted group from ext4_read_inode_bitmap() in which case inode_bitmap_bh misses a release. Hnadle "IS_ERR(inode_bitmap_bh)" and group corruption separately like how ext4_free_inode() does to avoid buffer_head leak. Fixes: `9008a58e5d` ("ext4: make the bitmap read routines return real error codes") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:15 -04:00
Kemeng Shi	5e5b2a56c5	ext4: avoid buffer_head leak in ext4_mark_inode_used() Release inode_bitmap_bh from ext4_read_inode_bitmap() in ext4_mark_inode_used() to avoid buffer_head leak. By the way, remove unneeded goto for invalid ino when inode_bitmap_bh is NULL. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240820132234.2759926-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-03 22:12:15 -04:00
yangerkun	20cee68f5b	ext4: clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT even mount with discard Commit `3d56b8d2c7` ("ext4: Speed up FITRIM by recording flags in ext4_group_info") speed up fstrim by skipping trim trimmed group. We also has the chance to clear trimmed once there exists some block free for this group(mount without discard), and the next trim for this group will work well too. For mount with discard, we will issue dicard when we free blocks, so leave trimmed flag keep alive to skip useless trim trigger from userspace seems reasonable. But for some case like ext4 build on dm-thinpool(ext4 blocksize 4K, pool blocksize 128K), discard from ext4 maybe unaligned for dm thinpool, and thinpool will just finish this discard(see process_discard_bio when begein equals to end) without actually process discard. For this case, trim from userspace can really help us to free some thinpool block. So convert to clear trimmed flag for all case no matter mounted with discard or not. Fixes: `3d56b8d2c7` ("ext4: Speed up FITRIM by recording flags in ext4_group_info") Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240817085510.2084444-1-yangerkun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 16:11:24 -04:00
Zhang Yi	2046657e64	ext4: drop all delonly descriptions When counting reserved clusters, delayed type is always equal to delonly type now, hence drop all delonly descriptions in parameters and comments. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-13-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:15 -04:00
Zhang Yi	b224b18497	ext4: drop ext4_es_is_delonly() Since we don't add delayed flag in unwritten extents, so there is no difference between ext4_es_is_delayed() and ext4_es_is_delonly(), just drop ext4_es_is_delonly(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-12-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	ce09036ea4	ext4: make extent status types exclusive Since we don't add delayed flag in unwritten extents, all of the four extent status types EXTENT_STATUS_WRITTEN, EXTENT_STATUS_UNWRITTEN, EXTENT_STATUS_DELAYED and EXTENT_STATUS_HOLE are exclusive now, add assertion when storing pblock before inserting extent into status tree and add comment to the status definition. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	3b4ba269ab	ext4: drop unused ext4_es_store_status() The helper ext4_es_store_status() is unused now, just drop it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	15996a8485	ext4: use ext4_map_query_blocks() in ext4_map_blocks() The blocks map querying logic in ext4_map_blocks() are the same as ext4_map_query_blocks(), so switch to directly use it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	6e124d5b4b	ext4: drop ext4_es_delayed_clu() Since we move ext4_da_update_reserve_space() to ext4_es_insert_extent(), no one uses ext4_es_delayed_clu() and __es_delayed_clu(), just drop them. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	c543e24296	ext4: update delalloc data reserve spcae in ext4_es_insert_extent() Now that we update data reserved space for delalloc after allocating new blocks in ext4_{ind\|ext}_map_blocks(), and if bigalloc feature is enabled, we also need to query the extents_status tree to calculate the exact reserved clusters. This is complicated now and it appears that it's better to do this job in ext4_es_insert_extent(), because __es_remove_extent() have already count delalloc blocks when removing delalloc extents and __revise_pending() return new adding pending count, we could update the reserved blocks easily in ext4_es_insert_extent(). We direct reduce the reserved cluster count when replacing a delalloc extent. However, thers are two special cases need to concern about the quota claiming when doing direct block allocation (e.g. from fallocate). A), fallocate a range that covers a delalloc extent but start with non-delayed allocated blocks, e.g. a hole. hhhhhhh+ddddddd+ddddddd ^^^^^^^^^^^^^^^^^^^^^^^ fallocate this range Current ext4_map_blocks() can't always trim the extent since it may release i_data_sem before calling ext4_map_create_blocks() and raced by another delayed allocation. Hence the EXT4_GET_BLOCKS_DELALLOC_RESERVE may not set even when we are replacing a delalloc extent, without this flag set, the quota has already been claimed by ext4_mb_new_blocks(), so we should release the quota reservations instead of claim them again. B), bigalloc feature is enabled, fallocate a range that contains non-delayed allocated blocks. \|< one cluster >\| hhhhhhh+hhhhhhh+hhhhhhh+ddddddd ^^^^^^^ fallocate this range This case is similar to above case, the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag is also not set. Hence we should release the quota reservations if we replace a delalloc extent but without EXT4_GET_BLOCKS_DELALLOC_RESERVE set. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	f3baf33b9c	ext4: passing block allocation information to ext4_es_insert_extent() Just pass the block allocation flag to ext4_es_insert_extent() when we replacing a current extent after an actually block allocation or extent status conversion, this flag will be used by later changes. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	fccd632670	ext4: let __revise_pending() return newly inserted pendings Let __insert_pending() return 1 after successfully inserting a new pending cluster, and also let __revise_pending() to return the number of of newly inserted pendings. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240813123452.2824659-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:14 -04:00
Zhang Yi	eba8c368c8	ext4: don't set EXTENT_STATUS_DELAYED on allocated blocks Currently, we release delayed allocation reservation when removing delayed extent from extent status tree (which also happens when overwriting one extent with another one). When we allocated unwritten extent under some delayed allocated extent, we don't need the reservation anymore and hence we don't need to preserve the EXT4_MAP_DELAYED status bit. Allocating the new extent blocks will properly release the reservation. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:13 -04:00
Zhang Yi	8b8252884f	ext4: optimize the EXT4_GET_BLOCKS_DELALLOC_RESERVE flag set When doing block allocation, magic EXT4_GET_BLOCKS_DELALLOC_RESERVE means the allocating range covers a range of delayed allocated clusters, the blocks and quotas have already been reserved in ext4_da_map_blocks(), we should update the reserved space and don't need to claim them again. At the moment, we only set this magic in mpage_map_one_extent() when allocating a range of delayed allocated clusters in the write back path, it makes things complicated since we have to notice and deal with the case of allocating non-delayed allocated clusters separately in ext4_ext_map_blocks(). For example, it we fallocate some blocks that have been delayed allocated, free space would be claimed again in ext4_mb_new_blocks() (this is wrong exactily), and we can't claim quota space again, we have to release the quota reservations made for that previously delayed allocated clusters. Move the position thats set the EXT4_GET_BLOCKS_DELALLOC_RESERVE to where we actually do block allocation, it could simplify above handling a lot, it means that we always set this magic once the allocation range covers delalloc blocks, no need to take care of the allocation path. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240813123452.2824659-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:13 -04:00
Zhang Yi	130078d020	ext4: factor out ext4_map_create_blocks() to allocate new blocks Factor out a common helper ext4_map_create_blocks() from ext4_map_blocks() to do a real blocks allocation, no logic changes. [ Note: this first patch of a ten patch series named "v3: simplify the counting and management of delalloc reserved blocks". The link to the v1 and v2 patch series are below. -- TYT ] Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240802115120.362902-1-yi.zhang@huaweicloud.com # v2 of patch series Link: https://patch.msgid.link/20240601034149.2169771-1-yi.zhang@huaweicloud.com # v1 of the patch series Link: https://patch.msgid.link/20240813123452.2824659-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-09-02 15:26:13 -04:00
Zhihao Cheng	dda898d7ff	ext4: dax: fix overflowing extents beyond inode size when partially writing The dax_iomap_rw() does two things in each iteration: map written blocks and copy user data to blocks. If the process is killed by user(See signal handling in dax_iomap_iter()), the copied data will be returned and added on inode size, which means that the length of written extents may exceed the inode size, then fsck will fail. An example is given as: dd if=/dev/urandom of=file bs=4M count=1 dax_iomap_rw iomap_iter // round 1 ext4_iomap_begin ext4_iomap_alloc // allocate 0~2M extents(written flag) dax_iomap_iter // copy 2M data iomap_iter // round 2 iomap_iter_advance iter->pos += iter->processed // iter->pos = 2M ext4_iomap_begin ext4_iomap_alloc // allocate 2~4M extents(written flag) dax_iomap_iter fatal_signal_pending done = iter->pos - iocb->ki_pos // done = 2M ext4_handle_inode_extension ext4_update_inode_size // inode size = 2M fsck reports: Inode 13, i_size is 2097152, should be 4194304. Fix? Fix the problem by truncating extents if the written length is smaller than expected. Fixes: `776722e85d` ("ext4: DAX iomap write support") CC: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=219136 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://patch.msgid.link/20240809121532.2105494-1-chengzhihao@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 23:56:49 -04:00
Jan Kara	d3476f3dad	ext4: don't set SB_RDONLY after filesystem errors When the filesystem is mounted with errors=remount-ro, we were setting SB_RDONLY flag to stop all filesystem modifications. We knew this misses proper locking (sb->s_umount) and does not go through proper filesystem remount procedure but it has been the way this worked since early ext2 days and it was good enough for catastrophic situation damage mitigation. Recently, syzbot has found a way (see link) to trigger warnings in filesystem freezing because the code got confused by SB_RDONLY changing under its hands. Since these days we set EXT4_FLAGS_SHUTDOWN on the superblock which is enough to stop all filesystem modifications, modifying SB_RDONLY shouldn't be needed. So stop doing that. Link: https://lore.kernel.org/all/000000000000b90a8e061e21d12f@google.com Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://patch.msgid.link/20240805201241.27286-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 23:53:20 -04:00
Wojciech Gładysz	d1bc560e9a	ext4: nested locking for xattr inode Add nested locking with I_MUTEX_XATTR subclass to avoid lockdep warning while handling xattr inode on file open syscall at ext4_xattr_inode_iget. Backtrace EXT4-fs (loop0): Ignoring removed oldalloc option ====================================================== WARNING: possible circular locking dependency detected 5.10.0-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor543/2794 is trying to acquire lock: ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline] ffff8880215e1a48 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}, at: ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 but task is already holding lock: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ei->i_data_sem/3){++++}-{3:3}: lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 ext4_update_i_disksize fs/ext4/ext4.h:3267 [inline] ext4_xattr_inode_write fs/ext4/xattr.c:1390 [inline] ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1538 [inline] ext4_xattr_set_entry+0x331a/0x3d80 fs/ext4/xattr.c:1662 ext4_xattr_ibody_set+0x124/0x390 fs/ext4/xattr.c:2228 ext4_xattr_set_handle+0xc27/0x14e0 fs/ext4/xattr.c:2385 ext4_xattr_set+0x219/0x390 fs/ext4/xattr.c:2498 ext4_xattr_user_set+0xc9/0xf0 fs/ext4/xattr_user.c:40 __vfs_setxattr+0x404/0x450 fs/xattr.c:177 __vfs_setxattr_noperm+0x11d/0x4f0 fs/xattr.c:208 __vfs_setxattr_locked+0x1f9/0x210 fs/xattr.c:266 vfs_setxattr+0x112/0x2c0 fs/xattr.c:283 setxattr+0x1db/0x3e0 fs/xattr.c:548 path_setxattr+0x15a/0x240 fs/xattr.c:567 __do_sys_setxattr fs/xattr.c:582 [inline] __se_sys_setxattr fs/xattr.c:578 [inline] __x64_sys_setxattr+0xc5/0xe0 fs/xattr.c:578 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb -> #0 (&ea_inode->i_rwsem#7/1){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:2988 [inline] check_prevs_add kernel/locking/lockdep.c:3113 [inline] validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 inode_lock include/linux/fs.h:782 [inline] ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline] ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline] ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline] __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562 notify_change+0xbb6/0xe60 fs/attr.c:435 do_truncate+0x1de/0x2c0 fs/open.c:64 handle_truncate fs/namei.c:2970 [inline] do_open fs/namei.c:3311 [inline] path_openat+0x29f3/0x3290 fs/namei.c:3425 do_filp_open+0x20b/0x450 fs/namei.c:3452 do_sys_openat2+0x124/0x460 fs/open.c:1207 do_sys_open fs/open.c:1223 [inline] __do_sys_open fs/open.c:1231 [inline] __se_sys_open fs/open.c:1227 [inline] __x64_sys_open+0x221/0x270 fs/open.c:1227 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_data_sem/3); lock(&ea_inode->i_rwsem#7/1); lock(&ei->i_data_sem/3); lock(&ea_inode->i_rwsem#7/1); * DEADLOCK * 5 locks held by syz-executor543/2794: #0: ffff888026fbc448 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x4a/0x2a0 fs/namespace.c:365 #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: inode_lock include/linux/fs.h:782 [inline] #1: ffff8880215e3488 (&sb->s_type->i_mutex_key#7){++++}-{3:3}, at: do_truncate+0x1cf/0x2c0 fs/open.c:62 #2: ffff8880215e3310 (&ei->i_mmap_sem){++++}-{3:3}, at: ext4_setattr+0xec4/0x19c0 fs/ext4/inode.c:5519 #3: ffff8880215e3278 (&ei->i_data_sem/3){++++}-{3:3}, at: ext4_setattr+0x136d/0x19c0 fs/ext4/inode.c:5559 #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_write_trylock_xattr fs/ext4/xattr.h:162 [inline] #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: ext4_try_to_expand_extra_isize fs/ext4/inode.c:5938 [inline] #4: ffff8880215e30c8 (&ei->xattr_sem){++++}-{3:3}, at: __ext4_mark_inode_dirty+0x4fb/0x810 fs/ext4/inode.c:6018 stack backtrace: CPU: 1 PID: 2794 Comm: syz-executor543 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x177/0x211 lib/dump_stack.c:118 print_circular_bug+0x146/0x1b0 kernel/locking/lockdep.c:2002 check_noncircular+0x2cc/0x390 kernel/locking/lockdep.c:2123 check_prev_add kernel/locking/lockdep.c:2988 [inline] check_prevs_add kernel/locking/lockdep.c:3113 [inline] validate_chain+0x1695/0x58f0 kernel/locking/lockdep.c:3729 __lock_acquire+0x12fd/0x20d0 kernel/locking/lockdep.c:4955 lock_acquire+0x197/0x480 kernel/locking/lockdep.c:5566 down_write+0x93/0x180 kernel/locking/rwsem.c:1564 inode_lock include/linux/fs.h:782 [inline] ext4_xattr_inode_iget+0x42a/0x5c0 fs/ext4/xattr.c:425 ext4_xattr_inode_get+0x138/0x410 fs/ext4/xattr.c:485 ext4_xattr_move_to_block fs/ext4/xattr.c:2580 [inline] ext4_xattr_make_inode_space fs/ext4/xattr.c:2682 [inline] ext4_expand_extra_isize_ea+0xe70/0x1bb0 fs/ext4/xattr.c:2774 __ext4_expand_extra_isize+0x304/0x3f0 fs/ext4/inode.c:5898 ext4_try_to_expand_extra_isize fs/ext4/inode.c:5941 [inline] __ext4_mark_inode_dirty+0x591/0x810 fs/ext4/inode.c:6018 ext4_setattr+0x1400/0x19c0 fs/ext4/inode.c:5562 notify_change+0xbb6/0xe60 fs/attr.c:435 do_truncate+0x1de/0x2c0 fs/open.c:64 handle_truncate fs/namei.c:2970 [inline] do_open fs/namei.c:3311 [inline] path_openat+0x29f3/0x3290 fs/namei.c:3425 do_filp_open+0x20b/0x450 fs/namei.c:3452 do_sys_openat2+0x124/0x460 fs/open.c:1207 do_sys_open fs/open.c:1223 [inline] __do_sys_open fs/open.c:1231 [inline] __se_sys_open fs/open.c:1227 [inline] __x64_sys_open+0x221/0x270 fs/open.c:1227 do_syscall_64+0x6d/0xa0 arch/x86/entry/common.c:62 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7f0cde4ea229 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 21 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd81d1c978 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 RAX: ffffffffffffffda RBX: 0030656c69662f30 RCX: 00007f0cde4ea229 RDX: 0000000000000089 RSI: 00000000000a0a00 RDI: 00000000200001c0 RBP: 2f30656c69662f2e R08: 0000000000208000 R09: 0000000000208000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd81d1c9c0 R13: 00007ffd81d1ca00 R14: 0000000000080000 R15: 0000000000000003 EXT4-fs error (device loop0): ext4_expand_extra_isize_ea:2730: inode #13: comm syz-executor543: corrupted in-inode xattr Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240801143827.19135-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 23:52:02 -04:00
Thorsten Blum	01cdf03b13	ext4: annotate struct ext4_xattr_inode_array with __counted_by() Add the __counted_by compiler attribute to the flexible array member inodes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Remove the now obsolete comment on the count field. In ext4_expand_inode_array(), use struct_size() instead of offsetof() and remove the local variable count. Increment the count field before adding a new inode to the inodes array. Compile-tested only. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://patch.msgid.link/20240730220200.410939-3-thorsten.blum@toblux.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 23:40:06 -04:00
Luis Henriques (SUSE)	ebc4b2c1ac	ext4: fix incorrect tid assumption in ext4_fc_mark_ineligible() Function jbd2_journal_shrink_checkpoint_list() assumes that '0' is not a valid value for transaction IDs, which is incorrect. Furthermore, the sbi->s_fc_ineligible_tid handling also makes the same assumption by being initialised to '0'. Fortunately, the sb flag EXT4_MF_FC_INELIGIBLE can be used to check whether sbi->s_fc_ineligible_tid has been previously set instead of comparing it with '0'. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-5-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 23:40:00 -04:00
Luis Henriques (SUSE)	dd589b0f14	ext4: fix incorrect tid assumption in ext4_wait_for_tail_page_commit() Function ext4_wait_for_tail_page_commit() assumes that '0' is not a valid value for transaction IDs, which is incorrect. Don't assume that and invoke jbd2_log_wait_commit() if the journal had a committing transaction instead. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240724161119.13448-2-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 23:39:35 -04:00
Matthew Wilcox (Oracle)	3e3a693551	ext4: tidy the BH loop in mext_page_mkuptodate() This for loop is somewhat hard to read; turn it into a normal BH do-while loop. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20240718223005.568869-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 21:47:04 -04:00
Matthew Wilcox (Oracle)	a40759fb16	ext4: remove array of buffer_heads from mext_page_mkuptodate() Iterate the folio's list of buffer_heads twice instead of keeping an array of pointers. This solves a too-large-array-for-stack problem on architectures with a ridiculoously large PAGE_SIZE and prepares ext4 to support larger folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20240718223005.568869-3-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 21:47:03 -04:00
Matthew Wilcox (Oracle)	368a83cebb	ext4: pipeline buffer reads in mext_page_mkuptodate() Instead of synchronously reading one buffer at a time, submit reads as we walk the buffers in the first loop, then wait for them in the second loop. This should be significantly more efficient, particularly on HDDs, but I have not measured. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20240718223005.568869-2-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 21:47:03 -04:00
Matthew Wilcox (Oracle)	e37c9e173b	ext4: reduce stack usage in ext4_mpage_readpages() This function is very similar to do_mpage_readpage() and a similar approach to that taken in commit `12ac5a65cb` will work. As in do_mpage_readpage(), we only use this array for checking block contiguity and we can do that more efficiently with a little arithmetic. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://patch.msgid.link/20240718223005.568869-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 21:47:03 -04:00
Luis Henriques (SUSE)	23dfdb5658	ext4: fix access to uninitialised lock in fc replay path The following kernel trace can be triggered with fstest generic/629 when executed against a filesystem with fast-commit feature enabled: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 0 PID: 866 Comm: mount Not tainted 6.10.0+ #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x66/0x90 register_lock_class+0x759/0x7d0 __lock_acquire+0x85/0x2630 ? __find_get_block+0xb4/0x380 lock_acquire+0xd1/0x2d0 ? __ext4_journal_get_write_access+0xd5/0x160 _raw_spin_lock+0x33/0x40 ? __ext4_journal_get_write_access+0xd5/0x160 __ext4_journal_get_write_access+0xd5/0x160 ext4_reserve_inode_write+0x61/0xb0 __ext4_mark_inode_dirty+0x79/0x270 ? ext4_ext_replay_set_iblocks+0x2f8/0x450 ext4_ext_replay_set_iblocks+0x330/0x450 ext4_fc_replay+0x14c8/0x1540 ? jread+0x88/0x2e0 ? rcu_is_watching+0x11/0x40 do_one_pass+0x447/0xd00 jbd2_journal_recover+0x139/0x1b0 jbd2_journal_load+0x96/0x390 ext4_load_and_init_journal+0x253/0xd40 ext4_fill_super+0x2cc6/0x3180 ... In the replay path there's an attempt to lock sbi->s_bdev_wb_lock in function ext4_check_bdev_write_error(). Unfortunately, at this point this spinlock has not been initialized yet. Moving it's initialization to an earlier point in __ext4_fill_super() fixes this splat. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Link: https://patch.msgid.link/20240718094356.7863-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 21:21:20 -04:00
Luis Henriques (SUSE)	6db3c1575a	ext4: fix fast commit inode enqueueing during a full journal commit When a full journal commit is on-going, any fast commit has to be enqueued into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing is done only once, i.e. if an inode is already queued in a previous fast commit entry it won't be enqueued again. However, if a full commit starts _after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to be done into FC_Q_STAGING. And this is not being done in function ext4_fc_track_template(). This patch fixes the issue by re-enqueuing an inode into the STAGING queue during the fast commit clean-up callback when doing a full commit. However, to prevent a race with a fast-commit, the clean-up callback has to be called with the journal locked. This bug was found using fstest generic/047. This test creates several 32k bytes files, sync'ing each of them after it's creation, and then shutting down the filesystem. Some data may be loss in this operation; for example a file may have it's size truncated to zero. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240717172220.14201-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 21:21:10 -04:00
Xiaxi Shen	0ce160c5bd	ext4: fix timer use-after-free on failed mount Syzbot has found an ODEBUG bug in ext4_fill_super The del_timer_sync function cancels the s_err_report timer, which reminds about filesystem errors daily. We should guarantee the timer is no longer active before kfree(sbi). When filesystem mounting fails, the flow goes to failed_mount3, where an error occurs when ext4_stop_mmpd is called, causing a read I/O failure. This triggers the ext4_handle_error function that ultimately re-arms the timer, leaving the s_err_report timer active before kfree(sbi) is called. Fix the issue by canceling the s_err_report timer after calling ext4_stop_mmpd. Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com> Reported-and-tested-by: syzbot+59e0101c430934bc9a36@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=59e0101c430934bc9a36 Link: https://patch.msgid.link/20240715043336.98097-1-shenxiaxi26@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 21:20:57 -04:00
Markus Elfring	bd8daa7717	ext4: use seq_putc() in two functions Single characters (line breaks) should be put into a sequence. Thus use the corresponding function “seq_putc”. This issue was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Link: https://patch.msgid.link/076974ab-4da3-4176-89dc-0514e020c276@web.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-26 21:20:57 -04:00
Edward Adam Davis	1a00a393d6	ext4: no need to continue when the number of entries is 1 Fixes: `ac27a0ec11` ("[PATCH] ext4: initial copy of files from ext3") Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-and-tested-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Link: https://patch.msgid.link/tencent_BE7AEE6C7C2D216CB8949CE8E6EE7ECC2C0A@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 21:20:48 -04:00
yao.ly	70dd7b573a	ext4: correct encrypted dentry name hash when not casefolded EXT4_DIRENT_HASH and EXT4_DIRENT_MINOR_HASH will access struct ext4_dir_entry_hash followed ext4_dir_entry. But there is no ext4_dir_entry_hash followed when inode is encrypted and not casefolded Signed-off-by: yao.ly <yao.ly@linux.alibaba.com> Link: https://patch.msgid.link/1719816219-128287-1-git-send-email-yao.ly@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org	2024-08-26 21:20:25 -04:00
Kemeng Shi	5071010ac3	ext4: correct comment of h_checksum Checksum of xattr block is always crc32c(uuid+blknum+xattrblock), see ext4_xattr_block_csum_set for detail. Remove incorrect comment that "id = inum if refcount=1". Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240606125508.1459893-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 22:56:30 -04:00
Kemeng Shi	4b14737ce9	ext4: correct comment of ext4_xattr_block_cache_insert There is no return value from ext4_xattr_block_cache_insert, just correct it's comment about return value. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240606125508.1459893-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 22:56:30 -04:00
Kemeng Shi	6ceeb2d8fd	ext4: correct comment of ext4_xattr_cmp The ext4_xattr_cmp never returns negative error number. Correct possible return value in ext4_xattr_cmp's comment. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://patch.msgid.link/20240606125508.1459893-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 22:56:30 -04:00
carrion bent	f67fbacd92	ext4: fix macro definition error of EXT4_DIRENT_HASH and EXT4_DIRENT_MINOR_HASH The macro parameter 'entry' of EXT4_DIRENT_HASH and EXT4_DIRENT_MINOR_HASH was not used, but rather the variable 'de' was directly used, which may be a local variable inside a function that calls the macros. Fortunately, all callers have passed in 'de' so far, so this bug didn't have an effect. Signed-off-by: carrion bent <carrionbent@linux.alibaba.com> Link: https://patch.msgid.link/1717652596-58760-1-git-send-email-carrionbent@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 21:38:59 -04:00
Lizhi Xu	985b67cd86	ext4: filesystems without casefold feature cannot be mounted with siphash When mounting the ext4 filesystem, if the default hash version is set to DX_HASH_SIPHASH but the casefold feature is not set, exit the mounting. Reported-by: syzbot+340581ba9dceb7e06fb3@syzkaller.appspotmail.com Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Link: https://patch.msgid.link/20240605012335.44086-1-lizhi.xu@windriver.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 21:37:18 -04:00
Junchao Sun	a3c3eecc7c	ext4: adjust the layout of the ext4_inode_info structure to save memory Using pahole, we can see that there are some padding holes in the current ext4_inode_info structure. Adjusting the layout of ext4_inode_info can reduce these holes, resulting in the size of the structure decreasing from 2424 bytes to 2408 bytes. Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240603131524.324224-1-sunjunchao2870@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-08-20 21:37:00 -04:00
Al Viro	1da91ea87a	introduce fd_file(), convert all accessors to it. For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-12 22:00:43 -04:00
Matthew Wilcox (Oracle)	9f04609f74	buffer: Convert __block_write_begin() to take a folio Almost all callers have a folio now, so change __block_write_begin() to take a folio and remove a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-07 11:33:36 +02:00
Matthew Wilcox (Oracle)	1da86618bd	fs: Convert aops->write_begin to take a folio Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-07 11:33:21 +02:00
Matthew Wilcox (Oracle)	a225800f32	fs: Convert aops->write_end to take a folio Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-07 11:32:02 +02:00
Matthew Wilcox (Oracle)	97edbc02b2	buffer: Convert block_write_end() to take a folio All callers now have a folio, so pass it in instead of converting from a folio to a page and back to a folio again. Saves a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-07 11:31:59 +02:00
Linus Torvalds	51ed42a8a1	Many cleanups and bug fixes in ext4, especially for the fast commit feature. Also some performance improvements; in particular, improving IOPS and throughput on fast devices running Async Direct I/O by up to 20% by optimizing jbd2_transaction_committed(). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmaYiqsACgkQ8vlZVpUN gaOWpQf/d6Y9WGyjeC1jOc+vIBxLgL+X0kbzYkkjGTSIZ7mZJS9X4NMMEtqayJ4f 1zGobcGENc05l4LVxf3uMbDj1aGlHeI9X4GLGaP5s5NcaAl4HKjQ3aFs3MuiJHPj Ol2CebXJx+NKt1lkD8PSPGgaTb5zg+SeZifI+OZ1RpkcKmGnkSNa5NkUNAaBh6dl 5LLXTc2p9NcCwAwDAQSiAJCV35bAZpcp6fwLLaPQ6Eok9HxGcJuYXW2Fict4rbtV mXeogXVIo2bkMcfh6tDchDBrFvORYIA7uBVmaG1LgAMrtEnYxnxnEntD0h6j/bzF Fl4jjQfd8o2uYto/4eo+iY6Z0haxyQ== =rcOo -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many cleanups and bug fixes in ext4, especially for the fast commit feature. Also some performance improvements; in particular, improving IOPS and throughput on fast devices running Async Direct I/O by up to 20% by optimizing jbd2_transaction_committed()" * tag 'ext4_for_linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: make sure the first directory block is not a hole ext4: check dot and dotdot of dx_root before making dir indexed ext4: sanity check for NULL pointer after ext4_force_shutdown jbd2: increase maximum transaction size jbd2: drop pointless shrinker batch initialization jbd2: avoid infinite transaction commit loop jbd2: precompute number of transaction descriptor blocks jbd2: make jbd2_journal_get_max_txn_bufs() internal jbd2: avoid mount failed when commit block is partial submitted ext4: avoid writing unitialized memory to disk in EA inodes ext4: don't track ranges in fast_commit if inode has inlined data ext4: fix possible tid_t sequence overflows ext4: use ext4_update_inode_fsync_trans() helper in inode creation ext4: add missing MODULE_DESCRIPTION() jbd2: add missing MODULE_DESCRIPTION() ext4: use memtostr_pad() for s_volume_name jbd2: speed up jbd2_transaction_committed() ext4: make ext4_da_map_blocks() buffer_head unaware ext4: make ext4_insert_delayed_block() insert multi-blocks ext4: factor out a helper to check the cluster allocation state ...	2024-07-18 17:03:42 -07:00
Linus Torvalds	b8fc1bd73a	vfs-6.11.mount.api -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZpEGjAAKCRCRxhvAZXjc okXfAP4tFUYszUsSqYdsgy9UvXw3Dr5zOIzQmN++NdjGkbU5fgEAs2ystqEfJgr3 v7XvGbu65CvL4/slNhBZOU4yekGx5Qc= =C4QD -----END PGP SIGNATURE----- Merge tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount API updates from Christian Brauner: - Add a generic helper to parse uid and gid mount options. Currently we open-code the same logic in various filesystems which is error prone, especially since the verification of uid and gid mount options is a sensitive operation in the face of idmappings. Add a generic helper and convert all filesystems over to it. Make sure that filesystems that are mountable in unprivileged containers verify that the specified uid and gid can be represented in the owning namespace of the filesystem. - Convert hostfs to the new mount api. * tag 'vfs-6.11.mount.api' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fuse: Convert to new uid/gid option parsing helpers fuse: verify {g,u}id mount options correctly fat: Convert to new uid/gid option parsing helpers fat: Convert to new mount api fat: move debug into fat_mount_options vboxsf: Convert to new uid/gid option parsing helpers tracefs: Convert to new uid/gid option parsing helpers smb: client: Convert to new uid/gid option parsing helpers tmpfs: Convert to new uid/gid option parsing helpers ntfs3: Convert to new uid/gid option parsing helpers isofs: Convert to new uid/gid option parsing helpers hugetlbfs: Convert to new uid/gid option parsing helpers ext4: Convert to new uid/gid option parsing helpers exfat: Convert to new uid/gid option parsing helpers efivarfs: Convert to new uid/gid option parsing helpers debugfs: Convert to new uid/gid option parsing helpers autofs: Convert to new uid/gid option parsing helpers fs_parse: add uid & gid option option parsing helpers hostfs: Add const qualifier to host_root in hostfs_fill_super() hostfs: convert hostfs to use the new mount API	2024-07-15 11:31:32 -07:00
Baokun Li	f9ca51596b	ext4: make sure the first directory block is not a hole The syzbot constructs a directory that has no dirblock but is non-inline, i.e. the first directory block is a hole. And no errors are reported when creating files in this directory in the following flow. ext4_mknod ... ext4_add_entry // Read block 0 ext4_read_dirblock(dir, block, DIRENT) bh = ext4_bread(NULL, inode, block, 0) if (!bh && (type == INDEX \|\| type == DIRENT_HTREE)) // The first directory block is a hole // But type == DIRENT, so no error is reported. After that, we get a directory block without '.' and '..' but with a valid dentry. This may cause some code that relies on dot or dotdot (such as make_indexed_dir()) to crash. Therefore when ext4_read_dirblock() finds that the first directory block is a hole report that the filesystem is corrupted and return an error to avoid loading corrupted data from disk causing something bad. Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0 Fixes: `4e19d6b65f` ("ext4: allow directory holes") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240702132349.2600605-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-10 23:25:12 -04:00
Baokun Li	50ea741def	ext4: check dot and dotdot of dx_root before making dir indexed Syzbot reports a issue as follows: ============================================ BUG: unable to handle page fault for address: ffffed11022e24fe PGD 23ffee067 P4D 23ffee067 PUD 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 0 PID: 5079 Comm: syz-executor306 Not tainted 6.10.0-rc5-g55027e689933 #0 Call Trace: <TASK> make_indexed_dir+0xdaf/0x13c0 fs/ext4/namei.c:2341 ext4_add_entry+0x222a/0x25d0 fs/ext4/namei.c:2451 ext4_rename fs/ext4/namei.c:3936 [inline] ext4_rename2+0x26e5/0x4370 fs/ext4/namei.c:4214 [...] ============================================ The immediate cause of this problem is that there is only one valid dentry for the block to be split during do_split, so split==0 results in out of bounds accesses to the map triggering the issue. do_split unsigned split dx_make_map count = 1 split = count/2 = 0; continued = hash2 == map[split - 1].hash; ---> map[4294967295] The maximum length of a filename is 255 and the minimum block size is 1024, so it is always guaranteed that the number of entries is greater than or equal to 2 when do_split() is called. But syzbot's crafted image has no dot and dotdot in dir, and the dentry distribution in dirblock is as follows: bus dentry1 hole dentry2 free \|xx--\|xx-------------\|...............\|xx-------------\|...............\| 0 12 (8+248)=256 268 256 524 (8+256)=264 788 236 1024 So when renaming dentry1 increases its name_len length by 1, neither hole nor free is sufficient to hold the new dentry, and make_indexed_dir() is called. In make_indexed_dir() it is assumed that the first two entries of the dirblock must be dot and dotdot, so bus and dentry1 are left in dx_root because they are treated as dot and dotdot, and only dentry2 is moved to the new leaf block. That's why count is equal to 1. Therefore add the ext4_check_dx_root() helper function to add more sanity checks to dot and dotdot before starting the conversion to avoid the above issue. Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0 Fixes: `ac27a0ec11` ("[PATCH] ext4: initial copy of files from ext3") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240702132349.2600605-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-10 23:25:12 -04:00
Wojciech Gładysz	83f4414b8f	ext4: sanity check for NULL pointer after ext4_force_shutdown Test case: 2 threads write short inline data to a file. In ext4_page_mkwrite the resulting inline data is converted. Handling ext4_grp_locked_error with description "block bitmap and bg descriptor inconsistent: X vs Y free clusters" calls ext4_force_shutdown. The conversion clears EXT4_STATE_MAY_INLINE_DATA but fails for ext4_destroy_inline_data_nolock and ext4_mark_iloc_dirty due to ext4_forced_shutdown. The restoration of inline data fails for the same reason not setting EXT4_STATE_MAY_INLINE_DATA. Without the flag set a regular process path in ext4_da_write_end follows trying to dereference page folio private pointer that has not been set. The fix calls early return with -EIO error shall the pointer to private be NULL. Sample crash report: Unable to handle kernel paging request at virtual address dfff800000000004 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [dfff800000000004] address between user and kernel address ranges Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 20274 Comm: syz-executor185 Not tainted 6.9.0-rc7-syzkaller-gfda5695d692c #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 lr : __block_commit_write+0x3c/0x2b0 fs/buffer.c:2160 sp : ffff8000a1957600 x29: ffff8000a1957610 x28: dfff800000000000 x27: ffff0000e30e34b0 x26: 0000000000000000 x25: dfff800000000000 x24: dfff800000000000 x23: fffffdffc397c9e0 x22: 0000000000000020 x21: 0000000000000020 x20: 0000000000000040 x19: fffffdffc397c9c0 x18: 1fffe000367bd196 x17: ffff80008eead000 x16: ffff80008ae89e3c x15: 00000000200000c0 x14: 1fffe0001cbe4e04 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : 0000000000000000 x8 : 0000000000000004 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffdffc397c9c0 x4 : 0000000000000020 x3 : 0000000000000020 x2 : 0000000000000040 x1 : 0000000000000020 x0 : fffffdffc397c9c0 Call trace: __block_commit_write+0x64/0x2b0 fs/buffer.c:2167 block_write_end+0xb4/0x104 fs/buffer.c:2253 ext4_da_do_write_end fs/ext4/inode.c:2955 [inline] ext4_da_write_end+0x2c4/0xa40 fs/ext4/inode.c:3028 generic_perform_write+0x394/0x588 mm/filemap.c:3985 ext4_buffered_write_iter+0x2c0/0x4ec fs/ext4/file.c:299 ext4_file_write_iter+0x188/0x1780 call_write_iter include/linux/fs.h:2110 [inline] new_sync_write fs/read_write.c:497 [inline] vfs_write+0x968/0xc3c fs/read_write.c:590 ksys_write+0x15c/0x26c fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __arm64_sys_write+0x7c/0x90 fs/read_write.c:652 __invoke_syscall arch/arm64/kernel/syscall.c:34 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:48 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:133 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:152 el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Code: 97f85911 f94002da 91008356 d343fec8 (38796908) ---[ end trace 0000000000000000 ]--- ---------------- Code disassembly (best guess): 0: 97f85911 bl 0xffffffffffe16444 4: f94002da ldr x26, [x22] 8: 91008356 add x22, x26, #0x20 c: d343fec8 lsr x8, x22, #3 * 10: 38796908 ldrb w8, [x8, x25] <-- trapping instruction Reported-by: syzbot+18df508cf00a0598d9a6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18df508cf00a0598d9a6 Link: https://lore.kernel.org/all/000000000000f19a1406109eb5c5@google.com/T/ Signed-off-by: Wojciech Gładysz <wojciech.gladysz@infogain.com> Link: https://patch.msgid.link/20240703070112.10235-1-wojciech.gladysz@infogain.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-08 23:59:37 -04:00
Jan Kara	65121eff3e	ext4: avoid writing unitialized memory to disk in EA inodes If the extended attribute size is not a multiple of block size, the last block in the EA inode will have uninitialized tail which will get written to disk. We will never expose the data to userspace but still this is not a good practice so just zero out the tail of the block as it isn't going to cause a noticeable performance overhead. Fixes: `e50e5129f3` ("ext4: xattr-in-inode support") Reported-by: syzbot+9c1fe13fcb51574b249b@syzkaller.appspotmail.com Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240613150234.25176-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-08 23:59:37 -04:00
Luis Henriques (SUSE)	7882b0187b	ext4: don't track ranges in fast_commit if inode has inlined data When fast-commit needs to track ranges, it has to handle inodes that have inlined data in a different way because ext4_fc_write_inode_data(), in the actual commit path, will attempt to map the required blocks for the range. However, inodes that have inlined data will have it's data stored in inode->i_block and, eventually, in the extended attribute space. Unfortunately, because fast commit doesn't currently support extended attributes, the solution is to mark this commit as ineligible. Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1039883 Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Tested-by: Ben Hutchings <benh@debian.org> Fixes: `9725958bb7` ("ext4: fast commit may miss tracking unwritten range during ftruncate") Link: https://patch.msgid.link/20240618144312.17786-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-08 23:59:37 -04:00
Luis Henriques (SUSE)	63469662cc	ext4: fix possible tid_t sequence overflows In the fast commit code there are a few places where tid_t variables are being compared without taking into account the fact that these sequence numbers may wrap. Fix this issue by using the helper functions tid_gt() and tid_geq(). Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://patch.msgid.link/20240529092030.9557-3-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-08 23:59:35 -04:00
Luis Henriques (SUSE)	2d4d6bda0f	ext4: use ext4_update_inode_fsync_trans() helper in inode creation Call helper function ext4_update_inode_fsync_trans() instead of open coding it in __ext4_new_inode(). This helper checks both that the handle is valid and that it hasn't been aborted due to some fatal error in the journalling layer, using is_handle_aborted(). Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Link: https://patch.msgid.link/20240527161447.21434-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-05 16:48:54 -04:00
Jeff Johnson	7378e8991a	ext4: add missing MODULE_DESCRIPTION() Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/ext4/ext4-inode-test.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240527-md-fs-ext4-v1-1-07aad5936bb1@quicinc.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-05 16:07:24 -04:00
Kees Cook	be27cd6446	ext4: use memtostr_pad() for s_volume_name As with the other strings in struct ext4_super_block, s_volume_name is not NUL terminated. The other strings were marked in commit `072ebb3bff` ("ext4: add nonstring annotations to ext4.h"). Using strscpy() isn't the right replacement for strncpy(); it should use memtostr_pad() instead. Reported-by: syzbot+50835f73143cc2905b9e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000019f4c00619192c05@google.com/ Fixes: `744a56389f` ("ext4: replace deprecated strncpy with alternatives") Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://patch.msgid.link/20240523225408.work.904-kees@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-07-05 13:14:51 -04:00
Eric Sandeen	6b5732b5ca	ext4: Convert to new uid/gid option parsing helpers Convert to new uid/gid option parsing helpers Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/a84be40d-5110-4dac-83b1-0ea8e043f0fd@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-02 06:21:18 +02:00
Zhang Yi	8262fe9a90	ext4: make ext4_da_map_blocks() buffer_head unaware After calling the ext4_da_map_blocks(), a delalloc extent state could be identified through the EXT4_MAP_DELAYED flag in map. So factor out buffer_head related handles in ext4_da_map_blocks(), make this function buffer_head unaware and becomes a common helper, and also update the stale function commtents, preparing for the iomap da write path in the future. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-11-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:50 -04:00
Zhang Yi	1850d76c1b	ext4: make ext4_insert_delayed_block() insert multi-blocks Rename ext4_insert_delayed_block() to ext4_insert_delayed_blocks(), pass length parameter to make it insert multiple delalloc blocks at a time. For non-bigalloc case, just reserve len blocks and insert delalloc extent. For bigalloc case, we can ensure that the clusters in the middle of a extent must be unallocated, we only need to check whether the start and end clusters are delayed/allocated. We should subtract the space for the start and/or end block(s) if they are allocated. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-10-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:50 -04:00
Zhang Yi	49bf6ab4d3	ext4: factor out a helper to check the cluster allocation state Factor out a common helper ext4_clu_alloc_state(), check whether the cluster containing a delalloc block to be added has been allocated or has delalloc reservation, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-9-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:50 -04:00
Zhang Yi	0d66b23d79	ext4: make ext4_da_reserve_space() reserve multi-clusters Add 'nr_resv' parameter to ext4_da_reserve_space(), which indicates the number of clusters wants to reserve, make it reserve multiple clusters at a time. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:50 -04:00
Zhang Yi	12eba993b9	ext4: make ext4_es_insert_delayed_block() insert multi-blocks Rename ext4_es_insert_delayed_block() to ext4_es_insert_delayed_extent() and pass length parameter to make it insert multiple delalloc blocks at a time. For the case of bigalloc, split the allocated parameter to lclu_allocated and end_allocated. lclu_allocated indicates the allocation state of the cluster which is containing the lblk, end_allocated indicates the allocation state of the extent end, clusters in the middle of delay allocated extent must be unallocated. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Zhang Yi	bb6b18057f	ext4: drop iblock parameter The start block of the delalloc extent to be inserted is equal to map->m_lblk, just drop the duplicate iblock input parameter. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20240517124005.347221-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Zhang Yi	14a210c110	ext4: trim delalloc extent In ext4_da_map_blocks(), we could find four kind of extents in the extent status tree: hole, unwritten, written and delayed extent. Now we only trim the map len if we found an unwritten extent or a written extent. This is okay now since map->m_len is always set to one and we always insert one delayed block at a time. But this will become isn't okay for other two cases if ext4_insert_delayed_block() and ext4_da_map_blocks() support inserting multiple map->len blocks later. 1. If we found a hole in the extent status tree which es->es_len is shorter than the length we want to write, we should trim the map->m_len to prevent adding extra delay more blocks than we expected. For example, assume we write data [A, C) to a file that contains a hole extent [A, B) and a written extent [B, D) in the cache. A B C D before da write: ...hhhhhh\|wwwwww.... Then we will get extent [A, B), we should trim map->m_len to B-A before inserting new delalloc blocks, if not, the range [B, C) will be duplicated. 2. If we found a delayed extent in the extent status tree which es->es_len is shorter than the length we want to write, we should trim the map->m_len to es->es_len and return directly since the front part of this map has been delayed, we can't insert the delalloc extent that contains the latter part in this round, we should return the delayed length and the caller should increase the position and call ext4_da_map_blocks() again. For example, assume we write data [A, C) to a file that contains a delayed extent [A, B) in the cache. A B C before da write: ...dddddd\|hhh.... Then we will get delayed extent [A, B), we should also trim map->m_len to B-A and return, if not, we will incorrectly assume that the write is complete and won't insert [B, C). So we need to always trim the map->m_len if the found es->es_len in the extent status tree is shorter than the map->m_len, prearing for inserting a extent with multiple delalloc blocks. This patch only does a pre-fix, the handle is crude and ext4_da_map_blocks() deserve a cleanup, we will do that later. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Zhang Yi	b37c907073	ext4: warn if delalloc counters are not zero on inactive The per-inode i_reserved_data_blocks count the reserved delalloc blocks in a regular file, it should be zero when destroying the file. The per-fs s_dirtyclusters_counter count all reserved delalloc blocks in a filesystem, it also should be zero when umounting the filesystem. Now we have only an error message if the i_reserved_data_blocks is not zero, which is unable to be simply captured, so add WARN_ON_ONCE to make it more visable. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Zhang Yi	0ea6560abb	ext4: check the extent status again before inserting delalloc block ext4_da_map_blocks looks up for any extent entry in the extent status tree (w/o i_data_sem) and then the looks up for any ondisk extent mapping (with i_data_sem in read mode). If it finds a hole in the extent status tree or if it couldn't find any entry at all, it then takes the i_data_sem in write mode to add a da entry into the extent status tree. This can actually race with page mkwrite & fallocate path. Note that this is ok between 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the folio lock 2. ext4 buffered write path v/s ext4 fallocate because of the inode lock. But this can race between ext4_page_mkwrite() & ext4 fallocate path ext4_page_mkwrite() ext4_fallocate() block_page_mkwrite() ext4_da_map_blocks() //find hole in extent status tree ext4_alloc_file_blocks() ext4_map_blocks() //allocate block and unwritten extent ext4_insert_delayed_block() ext4_da_reserve_space() //reserve one more block ext4_es_insert_delayed_block() //drop unwritten extent and add delayed extent by mistake Then, the delalloc extent is wrong until writeback and the extra reserved block can't be released any more and it triggers below warning: EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared! Fix the problem by looking up extent status tree again while the i_data_sem is held in write mode. If it still can't find any entry, then we insert a new da entry into the extent status tree. Cc: stable@vger.kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20240517124005.347221-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Zhang Yi	8e4e5cdf2f	ext4: factor out a common helper to query extent map Factor out a new common helper ext4_map_query_blocks() from the ext4_da_map_blocks(), it query and return the extent map status on the inode's extent path, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20240517124005.347221-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 18:04:49 -04:00
Luis Henriques (SUSE)	907c3fe532	ext4: fix infinite loop when replaying fast_commit When doing fast_commit replay an infinite loop may occur due to an uninitialized extent_status struct. ext4_ext_determine_insert_hole() does not detect the replay and calls ext4_es_find_extent_range(), which will return immediately without initializing the 'es' variable. Because 'es' contains garbage, an integer overflow may happen causing an infinite loop in this function, easily reproducible using fstest generic/039. This commit fixes this issue by unconditionally initializing the structure in function ext4_es_find_extent_range(). Thanks to Zhang Yi, for figuring out the real problem! Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20240515082857.32730-1-luis.henriques@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 10:26:28 -04:00
Xiaxi Shen	8dc9c3da79	ext4: fix uninitialized variable in ext4_inlinedir_to_tree Syzbot has found an uninit-value bug in ext4_inlinedir_to_tree This error happens because ext4_inlinedir_to_tree does not handle the case when ext4fs_dirhash returns an error This can be avoided by checking the return value of ext4fs_dirhash and propagating the error, similar to how it's done with ext4_htree_store_dirent Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com> Reported-and-tested-by: syzbot+eaba5abe296837a640c0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=eaba5abe296837a640c0 Link: https://patch.msgid.link/20240501033017.220000-1-shenxiaxi26@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 10:08:36 -04:00
Li zeming	57802c73bf	ext4: block_validity: Remove unnecessary ‘NULL’ values from new_node new_node is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20240402022300.25858-1-zeming@nfschina.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-06-27 09:34:00 -04:00
Gabriel Krisman Bertazi	d98c822232	ext4: Move CONFIG_UNICODE defguards into the code flow Instead of a bunch of ifdefs, make the unicode built checks part of the code flow where possible, as requested by Torvalds. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [eugen.hristev@collabora.com: port to 6.10-rc1] Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Link: https://lore.kernel.org/r/20240606073353.47130-7-eugen.hristev@collabora.com Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-07 17:00:45 +02:00
Gabriel Krisman Bertazi	d76b92f61f	ext4: Reuse generic_ci_match for ci comparisons Instead of reimplementing ext4_match_ci, use the new libfs helper. It also adds a comment explaining why fname->cf_name.name must be checked prior to the encryption hash optimization, because that tripped me before. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Link: https://lore.kernel.org/r/20240606073353.47130-5-eugen.hristev@collabora.com Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-07 17:00:44 +02:00
Gabriel Krisman Bertazi	f776f02a2c	ext4: Simplify the handling of cached casefolded names Keeping it as qstr avoids the unnecessary conversion in ext4_match Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> [eugen.hristev@collabora.com: port to 6.10-rc1] Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Link: https://lore.kernel.org/r/20240606073353.47130-2-eugen.hristev@collabora.com Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-06-07 17:00:43 +02:00
Linus Torvalds	38da32ee70	bd_inode series Replacement of bdev->bd_inode with sane(r) set of primitives. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs= =6LRp -----END PGP SIGNATURE----- Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number	2024-05-21 09:51:42 -07:00
Linus Torvalds	5ad8b6ad9a	getting rid of bogus set_blocksize() uses, switching it to struct file * and verifying that caller has device opened exclusively. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt KubxHVFsfW+eu6ASeaoMRB83w5OIzwk= =Liix -----END PGP SIGNATURE----- Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file ' and verifies that the caller has the device opened exclusively" tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()	2024-05-21 08:34:51 -07:00
Linus Torvalds	7991c92f4c	Ext4 patches for the 6.10-rc1 merge window: - more folio conversion patches - add support for FS_IOC_GETFSSYSFSPATH - mballoc cleaups and add more kunit tests - sysfs cleanups and bug fixes - miscellaneous bug fixes and cleanups -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmZIMBAACgkQ8vlZVpUN gaNvhQf9GAdBxpCLc3fc9mW8oP+okAqQ2etpz7Up5PRjX62P8o89QOXBUHSAZxat qOpKu5NaUBdz5mfdg/ptbCRdbsLxQTY670nSYhmseOCHZR/cw4jX1f+FUMj0VoFm tu/TR285W6A+i7zb1xOgyUsqN8jbQdm4ASmhVjV67oTLs+A6I8loL0wotlAl+K0U g8twZbnNfUaB0jrNyhEzr59bTFUgFMjt8Jv9aH3Oi4rjXGzmS5/xqPCK5Lhl+nCW gxIfRphwKlw9+c9osLYRrtRFrexFsQMCGmz2z9F4m7SplHI3A/SVaKSHaFeW/jQ0 gXP/S91zale6tSeu14gZLY2JqwvI0g== =XA7v -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: - more folio conversion patches - add support for FS_IOC_GETFSSYSFSPATH - mballoc cleaups and add more kunit tests - sysfs cleanups and bug fixes - miscellaneous bug fixes and cleanups * tag 'ext4_for_linus-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: fix error pointer dereference in ext4_mb_load_buddy_gfp() jbd2: add prefix 'jbd2' for 'shrink_type' jbd2: use shrink_type type instead of bool type for __jbd2_journal_clean_checkpoint_list() ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super() ext4: remove calls to to set/clear the folio error flag ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() jbd2: remove redundant assignement to variable err ext4: remove the redundant folio_wait_stable() ext4: fix potential unnitialized variable ext4: convert ac_buddy_page to ac_buddy_folio ext4: convert ac_bitmap_page to ac_bitmap_folio ext4: convert ext4_mb_init_cache() to take a folio ext4: convert bd_buddy_page to bd_buddy_folio ext4: convert bd_bitmap_page to bd_bitmap_folio ext4: open coding repeated check in next_linear_group ext4: use correct criteria name instead stale integer number in comment ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used ext4: keep "prefetch_grp" and "nr" consistent ...	2024-05-18 14:11:54 -07:00
Dan Carpenter	c6a6c9694a	ext4: fix error pointer dereference in ext4_mb_load_buddy_gfp() This code calls folio_put() on an error pointer which will lead to a crash. Check for both error pointers and NULL pointers before calling folio_put(). Fixes: `5eea586b47` ("ext4: convert bd_buddy_page to bd_buddy_folio") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/eaafa1d9-a61c-4af4-9f97-d3ad72c60200@moroto.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-17 11:24:38 -04:00
Linus Torvalds	1b0aabcc9a	vfs-6.10.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L 0+DNepg4R8PZOHH371eSSsLNRCUCkAs= =SVsU -----END PGP SIGNATURE----- Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_ bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...	2024-05-13 11:40:06 -07:00
Baokun Li	b4b4fda34e	ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super() In the following concurrency we will access the uninitialized rs->lock: ext4_fill_super ext4_register_sysfs // sysfs registered msg_ratelimit_interval_ms // Other processes modify rs->interval to // non-zero via msg_ratelimit_interval_ms ext4_orphan_cleanup ext4_msg(sb, KERN_INFO, "Errors on filesystem, " __ext4_msg ___ratelimit(&(EXT4_SB(sb)->s_msg_ratelimit_state) if (!rs->interval) // do nothing if interval is 0 return 1; raw_spin_trylock_irqsave(&rs->lock, flags) raw_spin_trylock(lock) _raw_spin_trylock __raw_spin_trylock spin_acquire(&lock->dep_map, 0, 1, _RET_IP_) lock_acquire __lock_acquire register_lock_class assign_lock_key dump_stack(); ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10); raw_spin_lock_init(&rs->lock); // init rs->lock here and get the following dump_stack: ========================================================= INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. CPU: 12 PID: 753 Comm: mount Tainted: G E 6.7.0-rc6-next-20231222 #504 [...] Call Trace: dump_stack_lvl+0xc5/0x170 dump_stack+0x18/0x30 register_lock_class+0x740/0x7c0 __lock_acquire+0x69/0x13a0 lock_acquire+0x120/0x450 _raw_spin_trylock+0x98/0xd0 ___ratelimit+0xf6/0x220 __ext4_msg+0x7f/0x160 [ext4] ext4_orphan_cleanup+0x665/0x740 [ext4] __ext4_fill_super+0x21ea/0x2b10 [ext4] ext4_fill_super+0x14d/0x360 [ext4] [...] ========================================================= Normally interval is 0 until s_msg_ratelimit_state is initialized, so ___ratelimit() does nothing. But registering sysfs precedes initializing rs->lock, so it is possible to change rs->interval to a non-zero value via the msg_ratelimit_interval_ms interface of sysfs while rs->lock is uninitialized, and then a call to ext4_msg triggers the problem by accessing an uninitialized rs->lock. Therefore register sysfs after all initializations are complete to avoid such problems. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240102133730.1098120-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-09 10:40:07 -04:00
Matthew Wilcox (Oracle)	ea4fd933ab	ext4: remove calls to to set/clear the folio error flag Nobody checks this flag on ext4 folios, stop setting and clearing it. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240420025029.2166544-11-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-09 00:23:51 -04:00
Baokun Li	dc1c4663bc	ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() In ext4_xattr_block_cache_find(), when ext4_sb_bread() returns an error, we will either continue to find the next ea block or return NULL to try to insert a new ea block. But whether ext4_sb_bread() returns -EIO or -ENOMEM, the next operation is most likely to fail with the same error. So propagate the error returned by ext4_sb_bread() to make ext4_xattr_block_set() fail to reduce pointless operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Baokun Li	0c0b4a49d3	ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() Syzbot reports a warning as follows: ============================================ WARNING: CPU: 0 PID: 5075 at fs/mbcache.c:419 mb_cache_destroy+0x224/0x290 Modules linked in: CPU: 0 PID: 5075 Comm: syz-executor199 Not tainted 6.9.0-rc6-gb947cc5bf6d7 RIP: 0010:mb_cache_destroy+0x224/0x290 fs/mbcache.c:419 Call Trace: <TASK> ext4_put_super+0x6d4/0xcd0 fs/ext4/super.c:1375 generic_shutdown_super+0x136/0x2d0 fs/super.c:641 kill_block_super+0x44/0x90 fs/super.c:1675 ext4_kill_sb+0x68/0xa0 fs/ext4/super.c:7327 [...] ============================================ This is because when finding an entry in ext4_xattr_block_cache_find(), if ext4_sb_bread() returns -ENOMEM, the ce's e_refcnt, which has already grown in the __entry_find(), won't be put away, and eventually trigger the above issue in mb_cache_destroy() due to reference count leakage. So call mb_cache_entry_put() on the -ENOMEM error branch as a quick fix. Reported-by: syzbot+dd43bd0f7474512edc47@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=dd43bd0f7474512edc47 Fixes: `fb265c9cb4` ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Zhang Yi	df0b5afc62	ext4: remove the redundant folio_wait_stable() __filemap_get_folio() with FGP_WRITEBEGIN parameter has already wait for stable folio, so remove the redundant folio_wait_stable() in ext4_da_write_begin(), it was left over from the commit `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") that removed the retry getting page logic. Fixes: `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240419023005.2719050-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:48:04 -04:00
Dan Carpenter	3f4830abd2	ext4: fix potential unnitialized variable Smatch complains "err" can be uninitialized in the caller. fs/ext4/indirect.c:349 ext4_alloc_branch() error: uninitialized symbol 'err'. Set the error to zero on the success path. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/363a4673-0fb8-4adf-b4fb-90a499077276@moroto.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:44:40 -04:00
Matthew Wilcox (Oracle)	c84f1510fb	ext4: convert ac_buddy_page to ac_buddy_folio This just carries around the bd_buddy_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-6-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:17 -04:00
Matthew Wilcox (Oracle)	ccedf35b5d	ext4: convert ac_bitmap_page to ac_bitmap_folio This just carries around the bd_bitmap_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-5-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:14 -04:00
Matthew Wilcox (Oracle)	e1622a0d55	ext4: convert ext4_mb_init_cache() to take a folio All callers now have a folio, so convert this function from operating on a page to operating on a folio. The folio is assumed to be a single page. Signe-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:10 -04:00
Matthew Wilcox (Oracle)	5eea586b47	ext4: convert bd_buddy_page to bd_buddy_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-3-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:07 -04:00
Matthew Wilcox (Oracle)	99b150d84e	ext4: convert bd_bitmap_page to bd_bitmap_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-2-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:37:46 -04:00
Al Viro	224941e837	use ->bd_mapping instead of ->bd_inode->i_mapping Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Yu Kuai	559428a915	ext4: remove block_device_ejected() block_device_ejected() is added by commit `bdfe0cbd74` ("Revert "ext4: remove block_device_ejected"") in 2015. At that time 'bdi->wb' is destroyed synchronized from del_gendisk(), hence if ext4 is still mounted, and then mark_buffer_dirty() will reference destroyed 'wb'. However, such problem doesn't exist anymore: - commit `d03f6cdc1f` ("block: Dynamically allocate and refcount backing_dev_info") switch bdi to use refcounting; - commit `13eec2363e` ("fs: Get proper reference for s_bdi"), will grab additional reference of bdi while mounting, so that 'bdi->wb' will not be destroyed until generic_shutdown_super(). Hence remove this dead function block_device_ejected(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-7-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:28:27 -04:00
Kemeng Shi	da5704eef7	ext4: open coding repeated check in next_linear_group Open coding repeated check in next_linear_group. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:33 -04:00
Kemeng Shi	2caffb6a27	ext4: use correct criteria name instead stale integer number in comment Use correct criteria name instead stale integer number in comment Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	d1a3924e43	ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk In mb_mark_used, we will find free chunk and mark it inuse. For chunk in mid of passed range, we could simply mark whole chunk inuse. For chunk at end of range, we may need to mark a continuous bits at end of part of chunk inuse and keep rest part of chunk free. To only mark a part of chunk inuse, we firstly mark whole chunk inuse and then mark a continuous range at end of chunk free. Function mb_mark_used does several times of "mb_find_buddy; mb_clear_bit; ..." to mark a continuous range free which can be done by simply calling ext4_mb_mark_free_simple which free continuous bits in a more effective way. Just call ext4_mb_mark_free_simple in mb_mark_used to use existing and effective code to free continuous blocks in chunk at end of passed range. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	d0b88624f8	ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used Add test_mb_mark_used_cost to estimate cost of mb_mark_used Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240424061904.987525-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	9c97c34a99	ext4: keep "prefetch_grp" and "nr" consistent Keep "prefetch_grp" and "nr" consistent to avoid to call ext4_mb_prefetch_fini with non-prefetched groups. When we step into next criteria, "prefetch_grp" is set to prefetch start of new criteria while "nr" is number of the prefetched group in previous criteria. If previous criteria and next criteria are both inexpensive (< CR_GOAL_LEN_SLOW) and prefetch_ios reachs sbi->s_mb_prefetch_limit in previous criteria, "prefetch_grp" and "nr" will be inconsistent and may introduce unexpected cost to do ext4_mb_init_group for non-prefetched groups. Reset "nr" to 0 when we reset "prefetch_grp" to goal group to keep them consistent. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	a11adf7be9	ext4: implement filesystem specific alloc_inode in unit test We expect inode with ext4_info_info type as following: mbt_kunit_init mbt_mb_init ext4_mb_init ext4_mb_init_backend sbi->s_buddy_cache = new_inode(sb); EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; Implement alloc_inode ionde with ext4_inode_info type to avoid out-of-bounds write. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322165518.8147-1-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:04:35 -04:00
Jan Kara	0a46ef2347	ext4: do not create EA inode under buffer lock ext4_xattr_set_entry() creates new EA inodes while holding buffer lock on the external xattr block. This is problematic as it nests all the allocation locking (which acquires locks on other buffers) under the buffer lock. This can even deadlock when the filesystem is corrupted and e.g. quota file is setup to contain xattr block as data block. Move the allocation of EA inode out of ext4_xattr_set_entry() into the callers. Reported-by: syzbot+a43d4f48b8397d0e41a9@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:02:24 -04:00
Jan Kara	4f3e6db3c3	Revert "ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()" This reverts commit `7f48212678`. We will need the special cleanup handling once we move allocation of EA inode outside of the buffer lock in the following patch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:02:08 -04:00
Justin Stitt	744a56389f	ext4: replace deprecated strncpy with alternatives strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. in file.c: s_last_mounted is marked as __nonstring meaning it does not need to be NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the string source to the byte array destination -- while also ensuring to pad with zeroes. in ioctl.c: We can drop the memset and size argument in favor of using the new 2-argument version of strscpy_pad() -- which was introduced with Commit `e6584c3964` ("string: Allow 2-argument strscpy()"). This guarantees NUL-termination and NUL-padding on the destination buffer -- which seems to be a requirement judging from this comment: \| static int ext4_ioctl_getlabel(struct ext4_sb_info sbi, char __user user_label) \| { \| char label[EXT4_LABEL_MAX + 1]; \| \| /* \| * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because \| * FSLABEL_MAX must include terminating null byte, while s_volume_name \| * does not have to. \| */ in super.c: s_first_error_func is marked as __nonstring meaning we can take the same approach as in file.c; just use strtomem_pad() Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:55:10 -04:00
Baokun Li	e19089dff5	ext4: clean up s_mb_rb_lock to fix build warnings with C=1 Running sparse (make C=1) on mballoc.c we get the following warning: fs/ext4/mballoc.c:3194:13: warning: context imbalance in 'ext4_mb_seq_structs_summary_start' - wrong count at exit This is because __acquires(&EXT4_SB(sb)->s_mb_rb_lock) was called in ext4_mb_seq_structs_summary_start(), but s_mb_rb_lock was removed in commit `83e80a6e35` ("ext4: use buckets for cr 1 block scan instead of rbtree"), so remove the __acquires to silence the warning. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	261341a932	ext4: set the type of max_zeroout to unsigned int to avoid overflow The max_zeroout is of type int and the s_extent_max_zeroout_kb is of type uint, and the s_extent_max_zeroout_kb can be freely modified via the sysfs interface. When the block size is 1024, max_zeroout may overflow, so declare it as unsigned int to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	9a9f3a9842	ext4: set type of ac_groups_linear_remaining to __u32 to avoid overflow Now ac_groups_linear_remaining is of type __u16 and s_mb_max_linear_groups is of type unsigned int, so an overflow occurs when setting a value above 65535 through the mb_max_linear_groups sysfs interface. Therefore, the type of ac_groups_linear_remaining is set to __u32 to avoid overflow. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	63bfe84105	ext4: add positive int attr pointer to avoid sysfs variables overflow The following variables controlled by the sysfs interface are of type int and are normally used in the range [0, INT_MAX], but are declared as attr_pointer_ui, and thus may be set to values that exceed INT_MAX and result in overflows to get negative values. err_ratelimit_burst msg_ratelimit_burst warning_ratelimit_burst err_ratelimit_interval_ms msg_ratelimit_interval_ms warning_ratelimit_interval_ms Therefore, we add attr_pointer_pi (aka positive int attr pointer) with a value range of 0-INT_MAX to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	b7b2a5799b	ext4: add new attr pointer attr_mb_order The s_mb_best_avail_max_trim_order is of type unsigned int, and has a range of values well beyond the normal use of the mb_order. Although the mballoc code is careful enough that large numbers don't matter there, but this can mislead the sysadmin into thinking that it's normal to set such values. Hence add a new attr_id attr_mb_order with values in the range [0, 64] to avoid storing garbage values and make us more resilient to surprises in the future. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	13df4d44a3	ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists() We can trigger a slab-out-of-bounds with the following commands: mkfs.ext4 -F /dev/$disk 10G mount /dev/$disk /tmp/test echo 2147483647 > /sys/fs/ext4/$disk/mb_group_prealloc echo test > /tmp/test/file && sync ================================================================== BUG: KASAN: slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] Read of size 8 at addr ffff888121b9d0f0 by task kworker/u2:0/11 CPU: 0 PID: 11 Comm: kworker/u2:0 Tainted: GL 6.7.0-next-20240118 #521 Call Trace: dump_stack_lvl+0x2c/0x50 kasan_report+0xb6/0xf0 ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] ext4_mb_regular_allocator+0x19e9/0x2370 [ext4] ext4_mb_new_blocks+0x88a/0x1370 [ext4] ext4_ext_map_blocks+0x14f7/0x2390 [ext4] ext4_map_blocks+0x569/0xea0 [ext4] ext4_do_writepages+0x10f6/0x1bc0 [ext4] [...] ================================================================== The flow of issue triggering is as follows: // Set s_mb_group_prealloc to 2147483647 via sysfs ext4_mb_new_blocks ext4_mb_normalize_request ext4_mb_normalize_group_request ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc ext4_mb_regular_allocator ext4_mb_choose_next_group ext4_mb_choose_next_group_best_avail mb_avg_fragment_size_order order = fls(len) - 2 = 29 ext4_mb_find_good_group_avg_frag_lists frag_list = &sbi->s_mb_avg_fragment_size[order] if (list_empty(frag_list)) // Trigger SOOB! At 4k block size, the length of the s_mb_avg_fragment_size list is 14, but an oversized s_mb_group_prealloc is set, causing slab-out-of-bounds to be triggered by an attempt to access an element at index 29. Add a new attr_id attr_clusters_in_group with values in the range [0, sbi->s_clusters_per_group] and declare mb_group_prealloc as that type to fix the issue. In addition avoid returning an order from mb_avg_fragment_size_order() greater than MB_NUM_ORDERS(sb) and reduce some useless loops. Fixes: `7e170922f0` ("ext4: Add allocation criteria 1.5 (CR1_5)") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240319113325.3110393-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	57341fe317	ext4: refactor out ext4_generic_attr_show() Refactor out the function ext4_generic_attr_show() to handle the reading of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	f536808adc	ext4: refactor out ext4_generic_attr_store() Refactor out the function ext4_generic_attr_store() to handle the setting of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	9e8e819f8f	ext4: avoid overflow when setting values via sysfs When setting values of type unsigned int through sysfs, we use kstrtoul() to parse it and then truncate part of it as the final set value, when the set value is greater than UINT_MAX, the set value will not match what we see because of the truncation. As follows: $ echo 4294967296 > /sys/fs/ext4/sda/mb_max_linear_groups $ cat /sys/fs/ext4/sda/mb_max_linear_groups 0 So we use kstrtouint() to parse the attr_pointer_ui type to avoid the inconsistency described above. In addition, a judgment is added to avoid setting s_resv_clusters less than 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Max Kellermann	c77194965d	Revert "ext4: apply umask if ACL support is disabled" This reverts commit `484fd6c1de`. The commit caused a regression because now the umask was applied to symlinks and the fix is unnecessary because the umask/O_TMPFILE bug has been fixed somewhere else already. Fixes: https://lore.kernel.org/lkml/28DSITL9912E1.2LSZUVTGTO52Q@mforney.org/ Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Michael Forney <mforney@mforney.org> Link: https://lore.kernel.org/r/20240315142956.2420360-1-max.kellermann@ionos.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 18:25:39 -04:00
Al Viro	ead083aeee	set_blocksize(): switch to passing struct file * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:39:44 -04:00
Kent Overstreet	fb092d4072	ext4: add support for FS_IOC_GETFSSYSFSPATH The new sysfs path ioctl lets us get the /sys/fs/ path for a given filesystem in a fs agnostic way, potentially nudging us towards standarizing some of our reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Link: https://lore.kernel.org/r/20240315035308.3563511-4-kent.overstreet@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 15:21:23 -04:00
Jan Kara	35a1f12f0c	ext4: avoid excessive credit estimate in ext4_tmpfile() A user with minimum journal size (1024 blocks these days) complained about the following error triggered by generic/697 test in ext4_tmpfile(): run fstests generic/697 at 2024-02-28 05:34:46 JBD2: vfstest wants too many credits credits:260 rsv_credits:0 max:256 EXT4-fs error (device loop0) in __ext4_new_inode:1083: error 28 Indeed the credit estimate in ext4_tmpfile() is huge. EXT4_MAXQUOTAS_INIT_BLOCKS() is 219, then 10 credits from ext4_tmpfile() itself and then ext4_xattr_credits_for_new_inode() adds more credits needed for security attributes and ACLs. Now the EXT4_MAXQUOTAS_INIT_BLOCKS() is in fact unnecessary because we've already initialized quotas with dquot_init() shortly before and so EXT4_MAXQUOTAS_TRANS_BLOCKS() is enough (which boils down to 3 credits). Fixes: `af51a2ac36` ("ext4: ->tmpfile() support") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Luis Henriques <lhenriques@suse.de> Tested-by: Disha Goel <disgoel@linux.ibm.com> Link: https://lore.kernel.org/r/20240307115320.28949-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 14:49:16 -04:00
Thorsten Blum	ea7d09ad7c	ext4: remove unneeded if checks before kfree kfree already checks if its argument is NULL. This fixes two Coccinelle/coccicheck warnings reported by ifnullfree.cocci. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20240317153638.2136-2-thorsten.blum@toblux.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 14:44:51 -04:00
Christoph Hellwig	a0c7cce824	ext4: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method Since commit `a2ad63daa8` ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> [RH: Rebased to upstream] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/e5797bb597219a49043e53e4e90aa494b97dc328.1709215665.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:53:32 -04:00
Ritesh Harjani (IBM)	53c17fe55a	ext4: Remove PAGE_MASK dependency on mpage_submit_folio This patch simply removes the PAGE_MASK dependency since mpage_submit_folio() is already converted to work with folio. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d6eadb090334ea49ceef4e643b371fabfcea328f.1709182251.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:50:44 -04:00
Ritesh Harjani (IBM)	c2a09f3d78	ext4: Fixes len calculation in mpage_journal_page_buffers Truncate operation can race with writeback, in which inode->i_size can get truncated and therefore size - folio_pos() can be negative. This fixes the len calculation. However this path doesn't get easily triggered even with data journaling. Cc: stable@kernel.org # v6.5 Fixes: `80be8c5cc9` ("Fixes: ext4: Make mpage_journal_page_buffers use folio") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/cff4953b5c9306aba71e944ab176a5d396b9a1b7.1709182250.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:50:44 -04:00
Christian Brauner	210a03c9d5	fs: claw back a few FMODE_* bits There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-07 13:49:02 +02:00
Christian Brauner	22650a9982	fs,block: yield devices early Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: `f3a608827d` ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-27 13:17:15 +01:00
Luis Henriques (SUSE)	7b30851a70	fs_parser: move fsparam_string_empty() helper into header Since both ext4 and overlayfs define the same macro to specify string parameters that may allow empty values, define it in an header file so that this helper can be shared. Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev> Link: https://lore.kernel.org/r/20240312104757.27333-1-luis.henriques@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-03-26 09:01:18 +01:00
Linus Torvalds	68bf6bfdcf	Ext4 bug fixes and cleanups for 6.9-rc1, plus some additional kunit tests. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmXydHkACgkQ8vlZVpUN gaPFcQf/e1DcEw7dITXoOJ16Sz3pI3ykFEae3aIp1C0DoBL6ncnx4NrKJlbKVmfG CvYwwaPIILps0W5gwRll0wG8G9wrx+QY+xx5elsFKlfLsiRmkvXEIFPELYvtblcG u6fXumpArtH2dbjsmxw+gxEuborl3aeOIWW62dVvarEpfdvFlEwMAfBYlJ/E4HKM z74CmR09sr51XuQZTKaUNioyS6qNR/HIBoelJ50Xt6qLZrpfyIxtU/wHbN1GAM5+ pBXCYxlBaiSJHb8p9R99DT5JqVD5zwrqWscbajEhOJo4QQQacJGGvIOHz6b6FMRV +dPnTBh79t8DAktqT6LAf83bmiRCWQ== =4/t9 -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Ext4 bug fixes and cleanups, plus some additional kunit tests" * tag 'ext4_for_linus-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: initialize sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter before use in kunit test ext4: hold group lock in ext4 kunit test ext4: alloc test super block from sget ext4: kunit: use dynamic inode allocation ext4: enable meta_bg only when new desc blocks are needed ext4: remove unused parameter biop in ext4_issue_discard() ext4: remove SLAB_MEM_SPREAD flag usage ext4: verify s_clusters_per_group even without bigalloc ext4: fix corruption during on-line resize ext4: don't report EOPNOTSUPP errors from discard ext4: drop duplicate ea_inode handling in ext4_xattr_block_set() ext4: fold quota accounting into ext4_xattr_inode_lookup_create() ext4: correct best extent lstart adjustment logic ext4: forbid commit inconsistent quota data when errors=remount-ro ext4: add a hint for block bitmap corrupt state in mb_groups ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks() ext4: improve error msg for ext4_mb_seq_groups_show ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show ext4: Add unit test for ext4_mb_mark_diskspace_used ext4: Add unit test for mb_free_blocks ...	2024-03-15 09:20:30 -07:00
Linus Torvalds	e5e038b7ae	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmXx5kwACgkQnJ2qBz9k QNmZowf/UlGJ1rmQFFhoodn3SyK48tQjOZ23Ygx6v9FZiLMuQ3b1k0kWKmwM4lZb mtRriCm+lPO9Yp/Sflz+jn8S51b/2bcTXiPV4w2Y4ZIun41wwggV7rWPnTCHhu94 rGEPu/SNSBdpxWGv43BKHSDl4XolsGbyusQKBbKZtftnrpIf0y2OnyEXSV91Vnlh KM/XxzacBD4/3r4KCljyEkORWlIIn2+gdZf58sKtxLKvnfCIxjB+BF1e0gOWgmNQ e/pVnzbAHO3wuavRlwnrtA+ekBYQiJq7T61yyYI8zpeSoLHmwvPoKSsZP+q4BTvV yrcVCbGp3uZlXHD93U3BOfdqS0xBmg== =84Q4 -----END PGP SIGNATURE----- Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, isofs, udf, and quota updates from Jan Kara: "A lot of material this time: - removal of a lot of GFP_NOFS usage from ext2, udf, quota (either it was legacy or replaced with scoped memalloc_nofs_() API) - removal of BUG_ONs in quota code - conversion of UDF to the new mount API - tightening quota on disk format verification - fix some potentially unsafe use of RCU pointers in quota code and annotate everything properly to make sparse happy - a few other small quota, ext2, udf, and isofs fixes" tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (26 commits) udf: remove SLAB_MEM_SPREAD flag usage quota: remove SLAB_MEM_SPREAD flag usage isofs: remove SLAB_MEM_SPREAD flag usage ext2: remove SLAB_MEM_SPREAD flag usage ext2: mark as deprecated udf: convert to new mount API udf: convert novrs to an option flag MAINTAINERS: add missing git address for ext2 entry quota: Detect loops in quota tree quota: Properly annotate i_dquot arrays with __rcu quota: Fix rcu annotations of inode dquot pointers isofs: handle CDs with bad root inode but good Joliet root directory udf: Avoid invalid LVID used on mount quota: Fix potential NULL pointer dereference quota: Drop GFP_NOFS instances under dquot->dq_lock and dqio_sem quota: Set nofs allocation context when acquiring dqio_sem ext2: Remove GFP_NOFS use in ext2_xattr_cache_insert() ext2: Drop GFP_NOFS use in ext2_get_blocks() ext2: Drop GFP_NOFS allocation from ext2_init_block_alloc_info() udf: Remove GFP_NOFS allocation in udf_expand_file_adinicb() ...	2024-03-13 14:30:58 -07:00
Linus Torvalds	f88c3fb81c	mm, slab: remove last vestiges of SLAB_MEM_SPREAD Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-03-12 20:32:19 -07:00
Linus Torvalds	0f1a876682	vfs-6.9.uuid -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem5LwAKCRCRxhvAZXjc onZsAQCjMNabNWAty2VBAQrNIpGkZ+AMA2DxEajPldaPiJH5zQEA9ea7feB3T47i NUrXXfMQ5DSop+k5Y65pPkEpbX4rhQo= =NZgd -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid()	2024-03-11 11:02:06 -07:00
Linus Torvalds	910202f00a	vfs-6.9.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l 9NdULtniW1reuCYkc8R7dYM8S+yAwAc= =Y1qX -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_() via struct bdev_handle bdev_file_open_by_() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...	2024-03-11 10:52:34 -07:00
Linus Torvalds	7ea65c89d8	vfs-6.9.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem3wQAKCRCRxhvAZXjc otRMAQDeo8qsuuIAcS2KUicKqZR5yMVvrY9r4sQzf7YRcJo5HQD+NQXkKwQuv1VO OUeScsic/+I+136AgdjWnlEYO5dp0go= =4WKU -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Misc features, cleanups, and fixes for vfs and individual filesystems. Features: - Support idmapped mounts for hugetlbfs. - Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug where the passed offset is ignored if the file is O_APPEND. The new flag allows a caller to enforce that the offset is honored to conform to posix even if the file was opened in append mode. - Move i_mmap_rwsem in struct address_space to avoid false sharing between i_mmap and i_mmap_rwsem. - Convert efs, qnx4, and coda to use the new mount api. - Add a generic is_dot_dotdot() helper that's used by various filesystems and the VFS code instead of open-coding it multiple times. - Recently we've added stable offsets which allows stable ordering when iterating directories exported through NFS on e.g., tmpfs filesystems. Originally an xarray was used for the offset map but that caused slab fragmentation issues over time. This switches the offset map to the maple tree which has a dense mode that handles this scenario a lot better. Includes tests. - Finally merge the case-insensitive improvement series Gabriel has been working on for a long time. This cleanly propagates case insensitive operations through ->s_d_op which in turn allows us to remove the quite ugly generic_set_encrypted_ci_d_ops() operations. It also improves performance by trying a case-sensitive comparison first and then fallback to case-insensitive lookup if that fails. This also fixes a bug where overlayfs would be able to be mounted over a case insensitive directory which would lead to all sort of odd behaviors. Cleanups: - Make file_dentry() a simple accessor now that ->d_real() is simplified because of the backing file work we did the last two cycles. - Use the dedicated file_mnt_idmap helper in ntfs3. - Use smp_load_acquire/store_release() in the i_size_read/write helpers and thus remove the hack to handle i_size reads in the filemap code. - The SLAB_MEM_SPREAD is a nop now. Remove it from various places in fs/ - It's no longer necessary to perform a second built-in initramfs unpack call because we retain the contents of the previous extraction. Remove it. - Now that we have removed various allocators kfree_rcu() always works with kmem caches and kmalloc(). So simplify various places that only use an rcu callback in order to handle the kmem cache case. - Convert the pipe code to use a lockdep comparison function instead of open-coding the nesting making lockdep validation easier. - Move code into fs-writeback.c that was located in a header but can be made static as it's only used in that one file. - Rewrite the alignment checking iterators for iovec and bvec to be easier to read, and also significantly more compact in terms of generated code. This saves 270 bytes of text on x86-64 (with clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also saves a bit of time for the same workload. - Switch various places to use KMEM_CACHE instead of kmem_cache_create(). - Use inode_set_ctime_to_ts() in inode_set_ctime_current() - Use kzalloc() in name_to_handle_at() to avoid kernel infoleak. - Various smaller cleanups for eventfds. Fixes: - Fix various comments and typos, and unneeded initializations. - Fix stack allocation hack for clang in the select code. - Improve dump_mapping() debug code on a best-effort basis. - Fix build errors in various selftests. - Avoid wrap-around instrumentation in various places. - Don't allow user namespaces without an idmapping to be used for idmapped mounts. - Fix sysv sb_read() call. - Fix fallback implementation of the get_name() export operation" * tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits) hugetlbfs: support idmapped mounts qnx4: convert qnx4 to use the new mount api fs: use inode_set_ctime_to_ts to set inode ctime to current time libfs: Drop generic_set_encrypted_ci_d_ops ubifs: Configure dentry operations at dentry-creation time f2fs: Configure dentry operations at dentry-creation time ext4: Configure dentry operations at dentry-creation time libfs: Add helper to choose dentry operations at mount-time libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops fscrypt: Drop d_revalidate once the key is added fscrypt: Drop d_revalidate for valid dentries during lookup fscrypt: Factor out a helper to configure the lookup dentry ovl: Always reject mounting over case-insensitive directories libfs: Attempt exact-match comparison first during casefolded lookup efs: remove SLAB_MEM_SPREAD flag usage jfs: remove SLAB_MEM_SPREAD flag usage minix: remove SLAB_MEM_SPREAD flag usage openpromfs: remove SLAB_MEM_SPREAD flag usage proc: remove SLAB_MEM_SPREAD flag usage qnx6: remove SLAB_MEM_SPREAD flag usage ...	2024-03-11 09:38:17 -07:00
Kemeng Shi	0ecae5410a	ext4: initialize sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter before use in kunit test Fix that sbi->s_freeclusters_counter and sbi->s_dirtyclusters_counter are used before initialization. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Kemeng Shi	ad943758e0	ext4: hold group lock in ext4 kunit test Although there is no concurrent block allocation/free in unit test, internal functions mb_mark_used and mb_free_blocks assert group lock is always held. Acquire group before calling mb_mark_used and mb_free_blocks in unit test to avoid the assertion. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Kemeng Shi	8ffc0cd24c	ext4: alloc test super block from sget This fix the oops in ext4 unit test which is cuased by NULL sb.s_user_ns as following: <4>[ 14.344565] map_id_range_down (kernel/user_namespace.c:318) <4>[ 14.345378] make_kuid (kernel/user_namespace.c:415) <4>[ 14.345998] inode_init_always (include/linux/fs.h:1375 fs/inode.c:174) <4>[ 14.346696] alloc_inode (fs/inode.c:268) <4>[ 14.347353] new_inode_pseudo (fs/inode.c:1007) <4>[ 14.348016] new_inode (fs/inode.c:1033) <4>[ 14.348644] ext4_mb_init (fs/ext4/mballoc.c:3404 fs/ext4/mballoc.c:3719) <4>[ 14.349312] mbt_kunit_init (fs/ext4/mballoc-test.c:57 fs/ext4/mballoc-test.c:314) <4>[ 14.349983] kunit_try_run_case (lib/kunit/test.c:388 lib/kunit/test.c:443) <4>[ 14.350696] kunit_generic_run_threadfn_adapter (lib/kunit/try-catch.c:30) <4>[ 14.351530] kthread (kernel/kthread.c:388) <4>[ 14.352168] ret_from_fork (arch/arm64/kernel/entry.S:861) <0>[ 14.353385] Code: 52808004 b8236ae7 72be5e44 b90004c4 (38e368a1) Alloc test super block from sget to properly initialize test super block to fix the issue. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240304163543.6700-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Arnd Bergmann	d60c53694c	ext4: kunit: use dynamic inode allocation Storing an inode structure on the stack pushes some functions over the warning limit for stack frame size: In file included from fs/ext4/mballoc.c:7039: fs/ext4/mballoc-test.c:506:13: error: stack frame size (1032) exceeds limit (1024) in 'test_mark_diskspace_used' [-Werror,-Wframe-larger-than] 506 \| static void test_mark_diskspace_used(struct kunit *test) \| ^ Use kunit_kzalloc() for all inodes. There may be a better way to do it by preallocating the inode, which would result in a larger rework. Fixes: `2b81493f8e` ("ext4: Add unit test for ext4_mb_mark_diskspace_used") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240227161548.2929881-1-arnd@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Srivathsa Dara	07be778c70	ext4: enable meta_bg only when new desc blocks are needed This patch addresses an issue observed when resize_inode is disabled and an online extension of a filesysyem is performed. When a filesystem is expanded to a size that does not require a addition of a new descriptor block, the meta_bg feature is being enabled even though no part of the filesystem uses this layout. This patch ensures that the meta_bg feature is only enabled if any of the added block groups utilize meta_bg layout. Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Link: https://lore.kernel.org/r/20240227131329.2608466-1-srivathsa.d.dara@oracle.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Wenchao Hao	0efcd739fc	ext4: remove unused parameter biop in ext4_issue_discard() all caller of ext4_issue_discard() would set biop to NULL since 'commit `55cdd0af2b` ("ext4: get discard out of jbd2 commit kthread contex")', it's unnecessary to keep this parameter any more. Signed-off-by: Wenchao Hao <haowenchao2@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240226081731.3224470-1-haowenchao2@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Chengming Zhou	708623737b	ext4: remove SLAB_MEM_SPREAD flag usage The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit `16a1d96835` ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Link: https://lore.kernel.org/r/20240224134822.829456-1-chengming.zhou@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Jan Kara	40da553f5d	ext4: verify s_clusters_per_group even without bigalloc Currently we ignore s_clusters_per_group field in the on-disk superblock if bigalloc feature is not enabled. However e2fsprogs don't even open the filesystem if s_clusters_per_group is invalid. This results in an odd state where kernel happily works with the filesystem while even e2fsck refuses to touch it. Verify that s_clusters_per_group is valid even if bigalloc feature is not enabled to make things consistent. Due to current e2fsprogs behavior it is unlikely there are filesystems out in the wild (except for intentionally fuzzed ones) with invalid s_clusters_per_group counts. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240219171033.22882-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Maximilian Heyne	a6b3bfe176	ext4: fix corruption during on-line resize We observed a corruption during on-line resize of a file system that is larger than 16 TiB with 4k block size. With having more then 2^32 blocks resize_inode is turned off by default by mke2fs. The issue can be reproduced on a smaller file system for convenience by explicitly turning off resize_inode. An on-line resize across an 8 GiB boundary (the size of a meta block group in this setup) then leads to a corruption: dev=/dev/<some_dev> # should be >= 16 GiB mkdir -p /corruption /sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 221 - 215)) mount -t ext4 $dev /corruption dd if=/dev/zero bs=4096 of=/corruption/test count=$((2221 - 42*15)) sha1sum /corruption/test # 79d2658b39dcfd77274e435b0934028adafaab11 /corruption/test /sbin/resize2fs $dev $((22*21)) # drop page cache to force reload the block from disk echo 1 > /proc/sys/vm/drop_caches sha1sum /corruption/test # 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3 /corruption/test 2^21 = 2^152^6 equals 8 GiB whereof 2^15 is the number of blocks per block group and 2^6 are the number of block groups that make a meta block group. The last checksum might be different depending on how the file is laid out across the physical blocks. The actual corruption occurs at physical block 63*2^15 = 2064384 which would be the location of the backup of the meta block group's block descriptor. During the on-line resize the file system will be converted to meta_bg starting at s_first_meta_bg which is 2 in the example - meaning all block groups after 16 GiB. However, in ext4_flex_group_add we might add block groups that are not part of the first meta block group yet. In the reproducer we achieved this by substracting the size of a whole block group from the point where the meta block group would start. This must be considered when updating the backup block group descriptors to follow the non-meta_bg layout. The fix is to add a test whether the group to add is already part of the meta block group or not. Fixes: `01f795f9e0` ("ext4: add online resizing support for meta_bg and 64-bit file systems") Cc: <stable@vger.kernel.org> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Jan Kara	fa60629380	ext4: don't report EOPNOTSUPP errors from discard When ext4 is mounted without journal, with discard mount option, and on a device not supporting trim, we print error for each and every freed extent. This is not only useless but actively harmful. Instead ignore the EOPNOTSUPP error. Trim is only advisory anyway and when the filesystem has journal we silently ignore trim error as well. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240213101601.17463-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Jan Kara	7f48212678	ext4: drop duplicate ea_inode handling in ext4_xattr_block_set() ext4_xattr_block_set() drops ea_inode reference in two places. Handling it just under the 'cleanup' label is enough so drop the second occurence. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240209112107.10585-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-03-07 13:32:54 -05:00
Gabriel Krisman Bertazi	04aa5f4eba	ext4: Configure dentry operations at dentry-creation time This was already the case for case-insensitive before commit `bb9cd9106b` ("fscrypt: Have filesystems handle their d_ops"), but it was changed to set at lookup-time to facilitate the integration with fscrypt. But it's a problem because dentries that don't get created through ->lookup() won't have any visibility of the operations. Since fscrypt now also supports configuring dentry operations at creation-time, do it for any encrypted and/or casefold volume, simplifying the implementation across these features. Acked-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20240221171412.10710-8-krisman@suse.de Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>	2024-02-27 16:55:34 -05:00
Linus Torvalds	66a97c2ec9	We still have some races in filesystem methods when exposed to RCU pathwalk. This series is a result of code audit (the second round of it) and it should deal with most of that stuff. Exceptions: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate(). Up to maintainers (a note for NTFS folks - when documentation says that a method may not block, it does imply that blocking allocations are to be avoided. Really). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdroDAAKCRBZ7Krx/gZQ 60dKAQCzp6rYr3ye4nylho9Rzu8LEpH04TuNf3C6JuyUaNHxHwEAvNLatZsyFnmV Ksp2Rg/IlKPNtQgYJ8xPxv9DFmNe8gI= =47Un -----END PGP SIGNATURE----- Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull RCU pathwalk fixes from Al Viro: "We still have some races in filesystem methods when exposed to RCU pathwalk. This series is a result of code audit (the second round of it) and it should deal with most of that stuff. Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate(). Up to maintainers (a note for NTFS folks - when documentation says that a method may not block, it does imply that blocking allocations are to be avoided. Really)" [ More explanations for people who aren't familiar with the vagaries of RCU path walking: most of it is hidden from filesystems, but if a filesystem actively participates in the low-level path walking it needs to make sure the fields involved in that walk are RCU-safe. That "actively participate in low-level path walking" includes things like having its own ->d_hash()/->d_compare() routines, or by having its own directory permission function that doesn't just use the common helpers. Having a ->d_revalidate() function will also have this issue. Note that instead of making everything RCU safe you can also choose to abort the RCU pathwalk if your operation cannot be done safely under RCU, but that obviously comes with a performance penalty. One common pattern is to allow the simple cases under RCU, and abort only if you need to do something more complicated. So not everything needs to be RCU-safe, and things like the inode etc that the VFS itself maintains obviously already are. But these fixes tend to be about properly RCU-delaying things like ->s_fs_info that are maintained by the filesystem and that got potentially released too early. - Linus ] * tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ext4_get_link(): fix breakage in RCU mode cifs_get_link(): bail out in unsafe case fuse: fix UAF in rcu pathwalks procfs: make freeing proc_fs_info rcu-delayed procfs: move dropping pde and pid from ->evict_inode() to ->free_inode() nfs: fix UAF on pathwalk running into umount nfs: make nfs_set_verifier() safe for use in RCU pathwalk afs: fix __afs_break_callback() / afs_drop_open_mmap() race hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper affs: free affs_sb_info with kfree_rcu() rcu pathwalk: prevent bogus hard errors from may_lookup() fs/super.c: don't drop ->s_user_ns until we free struct super_block itself	2024-02-25 09:29:05 -08:00
Christian Brauner	61ead71476	ext4: port block device access to file Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-21-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-25 12:05:26 +01:00
Al Viro	9fa8e282c2	ext4_get_link(): fix breakage in RCU mode 1) errors from ext4_getblk() should not be propagated to caller unless we are really sure that we would've gotten the same error in non-RCU pathwalk. 2) we leak buffer_heads if ext4_getblk() is successful, but bh is not uptodate. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-02-25 02:10:32 -05:00
Jan Kara	8208c41c43	ext4: fold quota accounting into ext4_xattr_inode_lookup_create() When allocating EA inode, quota accounting is done just before ext4_xattr_inode_lookup_create(). Logically these two operations belong together so just fold quota accounting into ext4_xattr_inode_lookup_create(). We also make ext4_xattr_inode_lookup_create() return the looked up / created inode to convert the function to a more standard calling convention. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240209112107.10585-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 23:44:50 -05:00
Baokun Li	4fbf8bc733	ext4: correct best extent lstart adjustment logic When yangerkun review commit `93cdf49f6e` ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best extent did not completely cover the original request after adjusting the best extent lstart in ext4_mb_new_inode_pa() as follows: original request: 2/10(8) normalized request: 0/64(64) best extent: 0/9(9) When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical is 2 less than the adjusted best extent logical end 9, so we think the adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we should determine here if the original request logical end is less than or equal to the adjusted best extent logical end. In addition, add a comment stating when adjusted best_ex will not cover the original request, and remove the duplicate assertion because adjusting lstart makes no change to b_ex.fe_len. Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com Fixes: `93cdf49f6e` ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Cc: <stable@kernel.org> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:46:39 -05:00
Ye Bin	d8b945fa47	ext4: forbid commit inconsistent quota data when errors=remount-ro There's issue as follows When do IO fault injection test: Quota error (device dm-3): find_block_dqentry: Quota for id 101 referenced but not present Quota error (device dm-3): qtree_read_dquot: Can't read quota structure for id 101 Quota error (device dm-3): do_check_range: Getting block 2021161007 out of range 1-186 Quota error (device dm-3): qtree_read_dquot: Can't read quota structure for id 661 Now, ext4_write_dquot()/ext4_acquire_dquot()/ext4_release_dquot() may commit inconsistent quota data even if process failed. This may lead to filesystem corruption. To ensure filesystem consistent when errors=remount-ro there is need to call ext4_handle_error() to abort journal. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240119062908.3598806-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:40:57 -05:00
Zhang Yi	68ee261fb1	ext4: add a hint for block bitmap corrupt state in mb_groups If one group is marked as block bitmap corrupted, its free blocks cannot be used and its free count is also deducted from the global sbi->s_freeclusters_counter. User might be confused about the absent free space because we can't query the information about corrupted block groups except unreliable error messages in syslog. So add a hint to show block bitmap corrupted groups in mb_groups. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240119061154.1525781-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:39:38 -05:00
Cheng Nie	547e64bda9	ext4: fix the comment of ext4_map_blocks()/ext4_ext_map_blocks() this comment of ext4_map_blocks()/ext4_ext_map_blocks() need update after commit c21770573319("ext4: Define a new set of flags for ext4_get_blocks()"). Signed-off-by: Cheng Nie <niecheng1@uniontech.com> Link: https://lore.kernel.org/r/20240118062511.28276-1-niecheng1@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:21 -05:00
yangerkun	4b55d3431c	ext4: improve error msg for ext4_mb_seq_groups_show While cat mb_groups for a mounted ext4 image which has some corrupted group, the string return to userspace was just "I/O error" which confuse me a lot. Improve it with ext4_decode_error. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240118042557.380058-2-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:21 -05:00
yangerkun	250448802c	ext4: remove unused buddy_loaded in ext4_mb_seq_groups_show We can just first call ext4_mb_unload_buddy, then copy information from ext4_group_info. So remove this unused value. Signed-off-by: yangerkun <yangerkun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240118042557.380058-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:21 -05:00
Kemeng Shi	2b81493f8e	ext4: Add unit test for ext4_mb_mark_diskspace_used Add unit test for ext4_mb_mark_diskspace_used Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240103104900.464789-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:20 -05:00
Kemeng Shi	b7098e1fa7	ext4: Add unit test for mb_free_blocks Add unit test for mb_free_blocks. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240103104900.464789-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:20 -05:00
Kemeng Shi	ac96b56a2f	ext4: Add unit test for mb_mark_used Add unit test for mb_mark_used Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240103104900.464789-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:20 -05:00
Kemeng Shi	67d2a11b22	ext4: Add unit test of ext4_mb_generate_buddy Add unit test of ext4_mb_generate_buddy Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240103104900.464789-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:20 -05:00
Kemeng Shi	6c5e0c9c21	ext4: Add unit test for test_free_blocks_simple Add unit test for test_free_blocks_simple. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240103104900.464789-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-21 22:33:20 -05:00
Kent Overstreet	a4af51ce22	fs: super_set_uuid() Some weird old filesytems have UUID-like things that we wish to expose as UUIDs, but are smaller; add a length field so that the new FS_IOC_(GET\|SET)UUID ioctls can handle them in generic code. And add a helper super_set_uuid(), for setting nonstandard length uuids. Helper is now required for the new FS_IOC_GETUUID ioctl; if super_set_uuid() hasn't been called, the ioctl won't be supported. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240207025624.1019754-2-kent.overstreet@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-08 21:19:59 +01:00
Jan Kara	ccb49011bb	quota: Properly annotate i_dquot arrays with __rcu Dquots pointed to from i_dquot arrays in inodes are protected by dquot_srcu. Annotate them as such and change .get_dquots callback to return properly annotated pointer to make sparse happy. Fixes: `b9ba6f94b2` ("quota: remove dqptr_sem") Signed-off-by: Jan Kara <jack@suse.cz>	2024-02-08 12:04:59 +01:00
Linus Torvalds	3f24fcdacd	Miscellaneous bug fixes and cleanups in ext4's multi-block allocator and extent handling code. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmW/G4YACgkQ8vlZVpUN gaPTpwf/c/Fk27GV8ge9PQtR6gmir/lyw2qkvK3Z+12aEsblZRmyvElyZWjAuNQG bciQyltabIPOA4XxfsZOrdgYI42n0rTTFG7bmI0lr+BJM/HRw0tGGMN91FZla0FP EXv/AiHKCqlT5OFZbD+8n1TzfdOgWotjug1VLteXve3YKjkDgt5IQm/0Gx9hKBld IR8SrQlD/rYe+VPvaHz5G4u09Ne5pUE5fDj3xE23wxfU5cloVzuVRCSOGWUCTnCW T0v6sHeKrmiLC8tIOZkBjer4nXC0MOu0p5+geAjwOArc9VJ1Lh2eAkH+GgDOVprx ahdl2qmbIbacBYECIeQ/+1i78+O1yw== =CmYr -----END PGP SIGNATURE----- Merge tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous bug fixes and cleanups in ext4's multi-block allocator and extent handling code" * tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits) ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type ext4: make ext4_map_blocks() distinguish delalloc only extent ext4: add a hole extent entry in cache after punch ext4: correct the hole length returned by ext4_map_blocks() ext4: convert to exclusive lock while inserting delalloc extents ext4: refactor ext4_da_map_blocks() ext4: remove 'needed' in trace_ext4_discard_preallocations ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations ext4: remove unused return value of ext4_mb_release_group_pa ext4: remove unused return value of ext4_mb_release_inode_pa ext4: remove unused return value of ext4_mb_release ext4: remove unused ext4_allocation_context::ac_groups_considered ext4: remove unneeded return value of ext4_mb_release_context ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*() ext4: remove unused return value of __mb_check_buddy ext4: mark the group block bitmap as corrupted before reporting an error ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal() ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found() ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks() ...	2024-02-04 07:33:01 +00:00
Zhang Yi	ec9d669eba	ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type Since ext4_map_blocks() can recognize a delayed allocated only extent, make ext4_set_iomap() can also recognize it, and remove the useless separate check in ext4_iomap_begin_report(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:59:21 -05:00
Zhang Yi	874eaba96d	ext4: make ext4_map_blocks() distinguish delalloc only extent Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a delayed allocated only (not unwritten) one, and making ext4_map_blocks() can distinguish it, no longer mixing it with holes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:59:21 -05:00
Zhang Yi	9f1118223a	ext4: add a hole extent entry in cache after punch In order to cache hole extents in the extent status tree and keep the hole length as long as possible, re-add a hole entry to the cache just after punching a hole. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:59:21 -05:00
Zhang Yi	6430dea07e	ext4: correct the hole length returned by ext4_map_blocks() In ext4_map_blocks(), if we can't find a range of mapping in the extents cache, we are calling ext4_ext_map_blocks() to search the real path and ext4_ext_determine_hole() to determine the hole range. But if the querying range was partially or completely overlaped by a delalloc extent, we can't find it in the real extent path, so the returned hole length could be incorrect. Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc extent, but it searches start from the expanded hole_start, doesn't start from the querying range, so the delalloc extent found could not be the one that overlaped the querying range, plus, it also didn't adjust the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle delalloc and insert adjusted hole extent in ext4_ext_determine_hole(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:47:02 -05:00
Zhang Yi	acf795dc16	ext4: convert to exclusive lock while inserting delalloc extents ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem when inserting delalloc extents, it could be raced by another querying path of ext4_map_blocks() without i_rwsem, .e.g buffered read path. Suppose we buffered read a file containing just a hole, and without any cached extents tree, then it is raced by another delayed buffered write to the same area or the near area belongs to the same hole, and the new delalloc extent could be overwritten to a hole extent. pread() pwrite() filemap_read_folio() ext4_mpage_readpages() ext4_map_blocks() down_read(i_data_sem) ext4_ext_determine_hole() //find hole ext4_ext_put_gap_in_cache() ext4_es_find_extent_range() //no delalloc extent ext4_da_map_blocks() down_read(i_data_sem) ext4_insert_delayed_block() //insert delalloc extent ext4_es_insert_extent() //overwrite delalloc extent to hole This race could lead to inconsistent delalloc extents tree and incorrect reserved space counter. Fix this by converting to hold i_data_sem in exclusive mode when adding a new delalloc extent in ext4_da_map_blocks(). Cc: stable@vger.kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:47:02 -05:00
Zhang Yi	3fcc2b887a	ext4: refactor ext4_da_map_blocks() Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary parameters and branches, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240127015825.1608160-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-02-01 23:47:02 -05:00
Kemeng Shi	f0e54b6087	ext4: remove 'needed' in trace_ext4_discard_preallocations As 'needed' to trace_ext4_discard_preallocations is always 0 which is meaningless. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-10-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	2ffd2a6ad1	ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations The "needed" controls the number of ext4_prealloc_space to discard in ext4_discard_preallocations. Function ext4_discard_preallocations is supposed to discard all non-used preallocated blocks when "needed" is 0 and now ext4_discard_preallocations is always called with "needed" = 0. Remove unnecessary parameter "needed" and remove all non-used preallocated spaces in ext4_discard_preallocations to simplify the code. Note: If count of non-used preallocated spaces could be more than UINT_MAX, there was a memory leak as some non-used preallocated spaces are left ununsed and this commit will fix it. Otherwise, there is no behavior change. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	20427949b9	ext4: remove unused return value of ext4_mb_release_group_pa Remove unused return value of ext4_mb_release_group_pa. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-8-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	820c280896	ext4: remove unused return value of ext4_mb_release_inode_pa Remove unused return value of ext4_mb_release_inode_pa Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	908177175a	ext4: remove unused return value of ext4_mb_release Remove unused return value of ext4_mb_release. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	97c32dbffc	ext4: remove unused ext4_allocation_context::ac_groups_considered Remove unused ext4_allocation_context::ac_groups_considered Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:45 -05:00
Kemeng Shi	11fd1a5d64	ext4: remove unneeded return value of ext4_mb_release_context Function ext4_mb_release_context always return 0 and the return value is never used. Just remove unneeded return value of ext4_mb_release_context. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:44 -05:00
Kemeng Shi	438a35e72d	ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_() Remove unused parameter ngroup in ext4_mb_choose_next_group_(). Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240105092102.496631-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:44 -05:00
Kemeng Shi	133de5a0d8	ext4: remove unused return value of __mb_check_buddy Remove unused return value of __mb_check_buddy. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240105092102.496631-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:52:44 -05:00
Baokun Li	c5f3a3821d	ext4: mark the group block bitmap as corrupted before reporting an error Otherwise unlocking the group in ext4_grp_locked_error may allow other processes to modify the core block bitmap that is known to be corrupt. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:50:24 -05:00
Baokun Li	832698373a	ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal() Places the logic for checking if the group's block bitmap is corrupt under the protection of the group lock to avoid allocating blocks from the group with a corrupted block bitmap. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:50:24 -05:00
Baokun Li	4530b3660d	ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found() Determine if the group block bitmap is corrupted before using ac_b_ex in ext4_mb_try_best_found() to avoid allocating blocks from a group with a corrupted block bitmap in the following concurrency and making the situation worse. ext4_mb_regular_allocator ext4_lock_group(sb, group) ext4_mb_good_group // check if the group bbitmap is corrupted ext4_mb_complex_scan_group // Scan group gets ac_b_ex but doesn't use it ext4_unlock_group(sb, group) ext4_mark_group_bitmap_corrupted(group) // The block bitmap was corrupted during // the group unlock gap. ext4_mb_try_best_found ext4_lock_group(ac->ac_sb, group) ext4_mb_use_best_found mb_mark_used // Allocating blocks in block bitmap corrupted group Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:50:24 -05:00
Baokun Li	993bf0f4c3	ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt Determine if bb_fragments is 0 instead of determining bb_free to eliminate the risk of dividing by zero when the block bitmap is corrupted. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:50:24 -05:00
Baokun Li	2331fd4a49	ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks() After updating bb_free in mb_free_blocks, it is possible to return without updating bb_fragments because the block being freed is found to have already been freed, which leads to inconsistency between bb_free and bb_fragments. Since the group may be unlocked in ext4_grp_locked_error(), this can lead to problems such as dividing by zero when calculating the average fragment length. Hence move the update of bb_free to after the block double-free check guarantees that the corresponding statistics are updated only after the core block bitmap is modified. Fixes: `eabe0444df` ("ext4: speed-up releasing blocks on commit") CC: <stable@vger.kernel.org> # 3.10 Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240104142040.2835097-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-01-18 10:50:24 -05:00

... 3 4 5 6 7 ...

5406 Commits