mirror_ubuntu-kernels

mirror of https://git.proxmox.com/git/mirror_ubuntu-kernels.git synced 2025-11-18 17:38:00 +00:00

Author	SHA1	Message	Date
Youling Tang	36aa49d33e	bcachefs: Change destroy_inode to free_inode The vfs[1] documentation describes free_inode as follows: ``` free_inode this method is called from RCU callback. If you use call_rcu() in ->destroy_inode to free ‘struct inode’ memory, then it’s better to release memory in this method. ``` free_inode will be called by the RCU callback, so it might be better to move the inode free operation to destroy_inode. Similar to commit `ae6b47b565` ("fs/ntfs3: Change destroy_inode to free_inode"). Link: [1]: https://www.kernel.org/doc/html/latest/filesystems/vfs.html Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	c8bda9f20a	bcachefs: Simplify resuming of journal position Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	83c38e3ef8	bcachefs: check inode backpointer in bch2_lookup() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	4da1713a8d	bcachefs: check for inodes that should have backpointers in fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	45150765d3	bcachefs: bch_member.last_journal_bucket On recovery from clean shutdown we don't typically read the journal, but we still want to avoid overwriting existing entries in the journal for list_journal debugging. Thus, add some fields to the member info section so we can remember where we left off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	c749541353	bcachefs: uninline set_btree_iter_dontneed() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Hongbo Li	0af0b963b5	bcachefs: eliminate the uninitialized compilation warning in bch2_reconstruct_snapshots When compiling the bcachefs-tools, the following compilation warning is reported: libbcachefs/snapshot.c: In function ‘bch2_reconstruct_snapshots’: libbcachefs/snapshot.c:915:19: warning: ‘tree_id’ may be used uninitialized in this function [-Wmaybe-uninitialized] 915 \| snapshot->v.tree = cpu_to_le32(tree_id); libbcachefs/snapshot.c:903:6: note: ‘tree_id’ was declared here 903 \| u32 tree_id; \| ^~~~~~~ This is a false alert, because @tree_id is changed in bch2_snapshot_tree_create after it returns 0. And if this function returns other value, @tree_id wouldn't be used. Thus there should be nothing wrong in logical. Although the report itself is a false alert, we can still make it more explicit by setting the initial value of @tree_id to 0 (an invalid tree ID). Fixes: `a292be3b68` ("bcachefs: Reconstruct missing snapshot nodes") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	56522d7276	bcachefs: fix btree_path_clone() ip_allocated Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Nathan Chancellor	8bb0eddbbc	bcachefs: Fix format specifiers in bch2_btree_key_cache_to_text() When building for a 32-bit target, for which 'size_t' is 'unsigned int', there are two warnings around mismatched format specifiers and argument types: In file included from fs/bcachefs/vstructs.h:5, from fs/bcachefs/bcachefs_format.h:79, from fs/bcachefs/bcachefs.h:207, from fs/bcachefs/btree_key_cache.c:3: fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text': fs/bcachefs/btree_key_cache.c:1046:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] 1046 \| prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf' 192 \| #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__) \| ^~~~~~~~~~~ fs/bcachefs/btree_key_cache.c:1046:47: note: format string is defined here 1046 \| prt_printf(out, "nonpcpu freelist:\t%lu\r\n", bc->nr_freed_nonpcpu); \| ~~^ \| \| \| long unsigned int \| %u fs/bcachefs/btree_key_cache.c:1047:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=] 1047 \| prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu); \| ^~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} fs/bcachefs/util.h:192:63: note: in definition of macro 'prt_printf' 192 \| #define prt_printf(_out, ...) bch2_prt_printf(_out, __VA_ARGS__) \| ^~~~~~~~~~~ fs/bcachefs/btree_key_cache.c:1047:44: note: format string is defined here 1047 \| prt_printf(out, "pcpu freelist:\t%lu\r\n", bc->nr_freed_pcpu); \| ~~^ \| \| \| long unsigned int \| %u cc1: all warnings being treated as error Use the proper 'size_t' specifier, '%zu', to clear up the warnings for these platforms. Fixes: f2d47ec26af5 ("bcachefs: Btree key cache instrumentation") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Nathan Chancellor	2d288745eb	bcachefs: Fix type of flags parameter for some ->trigger() implementations When building with clang's -Wincompatible-function-pointer-types-strict (a warning designed to catch potential kCFI failures at build time), there are several warnings along the lines of: fs/bcachefs/bkey_methods.c:118:2: error: incompatible function pointer types initializing 'int ()(struct btree_trans , enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, enum btree_iter_update_trigger_flags)' with an expression of type 'int (struct btree_trans *, enum btree_id, unsigned int, struct bkey_s_c, struct bkey_s, unsigned int)' [-Werror,-Wincompatible-function-pointer-types-strict] 118 \| BCH_BKEY_TYPES() \| ^~~~~~~~~~~~~~~~ fs/bcachefs/bcachefs_format.h:394:2: note: expanded from macro 'BCH_BKEY_TYPES' 394 \| x(inode, 8) \ \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ fs/bcachefs/bkey_methods.c:117:41: note: expanded from macro 'x' 117 \| #define x(name, nr) [KEY_TYPE_##name] = bch2_bkey_ops_##name, \| ^~~~~~~~~~~~~~~~~~~~ <scratch space>:277:1: note: expanded from here 277 \| bch2_bkey_ops_inode \| ^~~~~~~~~~~~~~~~~~~ fs/bcachefs/inode.h:26:13: note: expanded from macro 'bch2_bkey_ops_inode' 26 \| .trigger = bch2_trigger_inode, \ \| ^~~~~~~~~~~~~~~~~~ There are several functions that did not have their flags parameter converted to 'enum btree_iter_update_trigger_flags' in the recent unification, which will cause kCFI failures at runtime because the types, while ABI compatible (hence no warning from the non-strict version of this warning), do not match exactly. Fix up these functions (as well as a few other obvious functions that should have it, even if there are no warnings currently) to resolve the warnings and potential kCFI runtime failures. Fixes: 31e4ef3280c8 ("bcachefs: iter/update/trigger/str_hash flag cleanup") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	24b27975a9	bcachefs: Kill gc_init_recurse() This unifies the online and offline btree gc passes; we're not yet running it online. We now iterate over one level of the btree at a time - the same as check_extents_to_backpointers(); this ordering preserves order of keys regardless of btree splits and merges, which will be important when we re-enable online gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:21 -04:00
Kent Overstreet	c451986bf4	bcachefs: do reflink_p repair from BTREE_TRIGGER_check_repair Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f40d13f94d	bcachefs: Run bch2_check_fix_ptrs() via triggers Currently, the reflink_p gc trigger does repair as well - turning a reflink_p key into an error key if the reflink_v it points to doesn't exist. This won't work with online check/repair, because the repair path once online will be subject to transaction restarts, but BTREE_TRIGGER_gc is not idempotant - we can't run it multiple times if we get a transaction restart. So we need to split these paths; to do so this patch calls check_fix_ptrs() by a new general path - a new trigger type, BTREE_TRIGGER_check_repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	930e1a92d6	bcachefs: kill gc looping for bucket gens looping when we change a bucket gen is not ideal - it means we risk failing if we'd go into an infinite loop, and it's better to make forward progress even if fsck doesn't fix everything. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	70e3e039cf	bcachefs: bch2_bucket_ref_update() If we hit an inconsistency when updating allocation information, we don't want to fail the update if it's for a deletion - only if it's for a new key. Rename check_bucket_ref() -> bucket_ref_update() so we can centralize the logic to do this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	9cc455d1bc	bcachefs: Consolidate mark_stripe_bucket() and trans_mark_stripe_bucket() This eliminates some duplicated logic, and the gc path now handles stripe updates and deletions - we need this since soon we're bringing back runtime gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	d930764650	bcachefs: mark_stripe_bucket cleanup Start to work on unifying mark_stripe_bucket() and trans_mark_stripe_bucket(); first, clean up all the unnecessary and gratuitious differences. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	c4e8db2b5d	bcachefs: bucket_data_type_mismatch() We're working on potentially unifying bch2_check_bucket_ref() and bch2_check_fix_ptrs() - or at least eliminating gratuitious differences. Most immediately, there's a bunch of cleanups to be done regarding BCH_DATA_stripe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	b769590f33	bcachefs: Clean up inode alloc There's no need to be using new_inode(); we can skip all that indirection and make the code easier to follow. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f04158290d	bcachefs: journal seq blacklist gc no longer has to walk btree Since btree_ptr_v2, we no longer require the journal seq blacklist table for skipping blacklisted bsets (btree node entries); the pointer to a given node indicates how much data is present. Therefore there's no longer any need for journal seq blacklist gc to walk the btree - we can prune entries older than journal last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	e7f63c67fc	bcachefs: plumb data_type into bch2_bucket_alloc_trans() prep work for making the allocator try to keep btree nodes within the existing member info btree allocated bitmap Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	018b32a63f	bcachefs: Add btree_allocated_bitmap to member_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	5147b9ae76	bcachefs: Btree key cache instrumentation It turns out the btree key cache shrinker wasn't actually reclaiming anything, prior to the previous patch. This adds instrumentation so that if we have further issues we can see what's going on. Specifically, sysfs internal/btree_key_cache is greatly expanded with new counters, and the SRCU sequence numbers of the first 10 entries on each pending freelist, and we also add trigger_btree_key_cache_shrink for testing without having to prune all the system caches. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Matthew Wilcox (Oracle)	e4f2c4dfee	bcachefs: Remove calls to folio_set_error Common code doesn't test the error flag, so we don't need to set it in bcachefs. We can use folio_end_read() to combine the setting (or not) of the uptodate flag and clearing the lock flag. Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Brian Foster <bfoster@redhat.com> Cc: linux-bcachefs@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	103304021e	bcachefs: Move gc of bucket.oldest_gen to workqueue This is a nice cleanup - and we've also been having problems with kthread creation in the mount path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	b25fd02ab4	bcachefs: fix flag printing in journal_buf_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	aef7eecb57	bcachefs: Sync journal when we complete a recovery pass Make things easier when we're debugging long fsck runs - persist the work that successful recovery passes did. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	f7643bc974	bcachefs: make btree read errors silent during scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	5a2d15213d	bcachefs: Rip bch2_snapshot_equiv() out of fsck Originally, when deleting snapshots we didn't collapse redundant snapshot nodes; thus, the notion of a class of equivalent snapshot nodes leaked into fsck. Now we do, so snapshot ID equivalence classes are purely local to snapshot deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	9de40d77f0	bcachefs: Check for writing btree_ptr_v2.sectors_written == 0 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	60f2b1bcf5	bcachefs: Add asserts to bch2_dev_btree_bitmap_marked_sectors() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:20 -04:00
Kent Overstreet	427e1bb838	bcachefs: fs_alloc_debug_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	feb255537d	bcachefs: assert that online_reserved == 0 on shutdown Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	fd104e2967	bcachefs: bch2_trans_verify_not_unlocked() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	e590e4e222	bcachefs: bch2_btree_path_can_relock() With the new assertions, we shouldn't be holding locks when trans->locked is false, thus, we shouldn't use relock when we just want to check if we can relock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	650db8a87c	bcachefs: trans->locked Add a field for tracking whether a transaction object holds btree locks, and assertions to verify state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	e2e568bd97	bcachefs: bch2_btree_root_alloc_fake_trans() We're starting to be more strict about transaction locked state, and multiple transactions in a task. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	ca563dccb2	bcachefs: bch2_trans_unlock() must always be followed by relock() or begin() We're about to add new asserts for btree_trans locking consistency, and part of that requires that aren't using the btree_trans while it's unlocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	4984faff5d	bcachefs: Use bch2_btree_path_upgrade() in key cache traverse Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	5d8c9d9428	bcachefs: bch2_btree_path_upgrade() checks nodes_locked, not uptodate In the key cache fill path, we use path_upgrade() on a path that isn't uptodate yet but should be locked. This change makes bch2_btree_path_upgrade() slightly looser so we can use it in key cache upgrade, instead of the __ version. Also, make the related assert - that path->uptodate implies nodes_locked - slightly clearer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	f2d9823f46	bcachefs: maintain lock invariants in btree_iter_next_node() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	449ceafb49	bcachefs: bch2_trans_commit_flags_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	b7f10636d5	bcachefs: prefer drop_locks_do() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	91b5d97fdf	bcachefs: get_unlocked_mut_path -> bch2_path_get_unlocked_mut Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Lukas Bulwahn	d434c2398f	bcachefs: fix typo in reference to BCACHEFS_DEBUG Commit `ec9cc18fc2` ("bcachefs: Add checks for invalid snapshot IDs") intends to check the sanity of a snapshot and panic when BCACHEFS_DEBUG is set, but that conditional has a typo. Fix the typo to refer to the actual existing Kconfig symbol. This was found with ./scripts/checkkconfigsymbols.py. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Ricardo B. Marliere	af3b39b4c6	bcachefs: chardev: make bch_chardev_class constant Since commit `43a7206b09` ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the bch_chardev_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Also, correctly clean up after failing paths in bch2_chardev_init(). Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	2f724563fc	bcachefs: member helper cleanups Some renaming for better consistency bch2_member_exists -> bch2_member_alive bch2_dev_exists -> bch2_member_exists bch2_dev_exsits2 -> bch2_dev_exists bch_dev_locked -> bch2_dev_locked bch_dev_bkey_exists -> bch2_dev_bkey_exists new helper - bch2_dev_safe Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	d155272b6e	bcachefs: bucket_valid() cut out a branch from doing it the obvious way Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	923ed0ae5e	bcachefs: bch2_trans_relock_fail() - factor out slowpath Factor out slowpath into a separate helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	0c0cbfdb84	bcachefs: bch2_dir_emit() - drop_locks_do() conversion Add a new helper that calls dir_emit() and updates ctx->pos on success; this lets us convert bch2_readdir() to drop_locks_do(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:19 -04:00
Kent Overstreet	65bd442397	bcachefs: bch2_btree_insert_trans() no longer specifies BTREE_ITER_cached Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	5dd8c60e1e	bcachefs: iter/update/trigger/str_hash flag cleanup Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	bf5f6a689b	bcachefs: __BTREE_ITER_ALL_SNAPSHOTS -> BTREE_ITER_SNAPSHOT_FIELD Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	c281db0fa5	bcachefs: mark_superblock cleanup Consolidate mark_superblock() and trans_mark_superblock(), like we did with the other trigger paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	ba665494fb	bcachefs: gc_btree_init_recurse() uses gc_mark_node() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	d1adfe4e7e	bcachefs: move root node topo checks to node_check_topology() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	b982d645a4	bcachefs: move topology repair kick to gc_btrees() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	58dda9c10e	bcachefs: kill metadata only gc Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	d1b213a00d	bcachefs: Finish converting reconstruct_alloc to errors_silent with errors_silent, reconstruct_alloc no longer requires fsck and fix_errors to work Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	68e142405c	bcachefs: bch2_gc() is now private to btree_gc.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	665e8b3239	bcachefs: for_each_btree_key_continue() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	a21107eeb1	bcachefs: kill for_each_btree_key_old() Dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kuan-Wei Chiu	0ddb5f0854	bcachefs: Optimize eytzinger0_sort() with bottom-up heapsort This optimization reduces the average number of comparisons required from 2nlog2(n) - 3n + o(n) to nlog2(n) + 0.37n + o(n). When n is sufficiently large, it results in approximately 50% fewer comparisons. Currently, eytzinger0_sort employs the textbook version of heapsort, where during the heapify process, each level requires two comparisons to determine the maximum among three elements. In contrast, the bottom-up heapsort, during heapify, only compares two children at each level until reaching a leaf node. Then, it backtracks from the leaf node to find the correct position. Since heapify typically continues until very close to the leaf node, the standard heapify requires about 2log2(n) comparisons, while the bottom-up variant only needs log2(n) comparisons. The experimental data presented below is based on an array generated by get_random_u32(). \| N \| comparisons(old) \| comparisons(new) \| time(old) \| time(new) \| \|-------\|------------------\|------------------\|-----------\|-----------\| \| 10000 \| 235381 \| 136615 \| 25545 us \| 20366 us \| \| 20000 \| 510694 \| 293425 \| 31336 us \| 18312 us \| \| 30000 \| 800384 \| 457412 \| 35042 us \| 27386 us \| \| 40000 \| 1101617 \| 626831 \| 48779 us \| 38253 us \| \| 50000 \| 1409762 \| 799637 \| 62238 us \| 46950 us \| \| 60000 \| 1721191 \| 974521 \| 75588 us \| 58367 us \| \| 70000 \| 2038536 \| 1152171 \| 90823 us \| 68778 us \| \| 80000 \| 2362958 \| 1333472 \| 104165 us \| 78625 us \| \| 90000 \| 2690900 \| 1516065 \| 116111 us \| 89573 us \| \| 100000\| 3019413 \| `1699879` \| 133638 us \| 100998 us \| Refs: BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small) Ingo Wegener Theoretical Computer Science, 118(1); Pages 81-98, 13 September 1993 https://doi.org/10.1016/0304-3975(93)90364-Y Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	be31bf439c	bcachefs: When traversing to interior nodes, propagate result to paths to same leaf node Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	4dcd90b6d1	bcachefs: Don't read journal just for fsck reading the journal can take a decent amount of time compared to the rest of fsck, let's only read it when required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	19391b9294	bcachefs: allow for custom action in fsck error messages Be more explicit to the user about what we're doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	497c982f05	bcachefs: New assertion for writing to the journal after shutdown Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	00589cadb1	bcachefs: bch2_btree_path_to_text() Long form version of bch2_btree_path_to_text() - useful in error messages and tracepoints. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	5577881455	bcachefs: add btree_node_merging_disabled debug param Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:18 -04:00
Kent Overstreet	ac01928b8e	bcachefs: bch2_hash_lookup() now returns bkey_s_c small cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	6ab71b4a8e	bcachefs: bch2_journal_keys_dump() debug helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	9089376f70	bcachefs: bch2_btree_node_header_to_text() better btree node read path error messages Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	7423330e30	bcachefs: prt_printf() now respects \r\n\t Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	2dcb605e86	bcachefs: printbufs: prt_printf() now handles \t\r\n Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	acce32a51e	bcachefs: printbuf improvements - fix assorted (harmless) off-by-one errors - we were inconsistent on whether out->pos stays <= out->size on overflow; now it does, and printbuf.overflow exists to indicate if a printbuf has overflowed - factor out printbuf_advance_pos() - printbuf_nul_terminate_reserved(); use this to reduce the number of printbuf_make_room() calls Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	62606398d5	bcachefs: Run upgrade/downgrade even in -o nochanges mode We need to be able to test these paths in dry run mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	6d82869185	bcachefs: Better write_super() error messages When a superblock write is silently dropped or it's been modified by another process we need to know which device it was. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 17:29:17 -04:00
Kent Overstreet	74768337de	bcachefs: Fix xattr_to_text() unsafety Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:57:19 -04:00
Kent Overstreet	61692c7812	bcachefs: bch2_bkey_format_field_overflows() Fix another shift-by-64 by factoring out a common helper for bch2_bkey_format_invalid() and bformat_needs_redo() (where it was already fixed). Reported-by: syzbot+9833a1d29d4a44361e2c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:57:19 -04:00
Kent Overstreet	5dfd3746b6	bcachefs: Fix needs_whiteout BUG_ON() in bkey_sort() Btree nodes are log structured; thus, we need to emit whiteouts when we're deleting a key that's been written out to disk. k->needs_whiteout tracks whether a key will need a whiteout when it's deleted, and this requires some careful handling; e.g. the key we're deleting may not have been written out to disk, but it may have overwritten a key that was - thus we need to carry this flag around on overwrites. Invariants: There may be multiple key for the same position in a given node (because of overwrites), but only one of them will be a live (non deleted) key, and only one key for a given position will have the needs_whiteout flag set. Additionally, we don't want to carry around whiteouts that need to be written in the main searchable part of a btree node - btree_iter_peek() will have to skip past them, and this can lead to an O(n^2) issues when doing sequential deletions (e.g. inode rm/truncate). So there's a separate region in the btree node buffer for unwritten whiteouts; these are merge sorted with the rest of the keys we're writing in the btree node write path. The unwritten whiteouts was a later optimization that bch2_sort_keys() didn't take into account; the unwritten whiteouts area means that we never have deleted keys with needs_whiteout set in the main searchable part of a btree node. That means we can simplify and optimize some sort paths, and eliminate an assertion that syzbot found: - Unless we're in the btree node write path, it's always ok to drop whiteouts when sorting - When sorting for a btree node write, we drop the whiteout if it's not from the unwritten whiteouts area, or if it's overwritten by a real key at the same position. This completely eliminates some tricky logic for propagating the needs_whiteout flag: syzbot was able to hit the assertion that checked that there shouldn't be more than one key at the same pos with needs_whiteout set, likely due to a combination of flipping on needs_whiteout on all written keys (they need whiteouts if overwritten), combined with not always dropping unneeded whiteouts, and the tricky logic in the sort path for preserving needs_whiteout that wasn't really needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:56:09 -04:00
Kent Overstreet	5ad1f33c29	bcachefs: Fix sb_clean_validate endianness conversion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-08 14:56:09 -04:00
Linus Torvalds	45db3ab700	five ksmbd server fixes, all also for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmY6/AAACgkQiiy9cAdy T1FJIgv/ZUwoOodrFevFTrRFtQLS/ssRxgX69FEIWXpzHqeU8olsC8P2ywM974ba ATsiLmfpdreBilnW5DCHFLtJwPb1py2KzqlwYYbh7sdU3d+mGGLX6r1ucn9tKNl3 fWNCHUe8Dz3qRaKkpNFmzS3sXaekr/ZT3SsoJyYg/d8Z7fqXsTy7auo2pVXRiYFp TacIaGDc2Tw7fyf6Dt9o9YuSCOmGXaj9pUlTHrW17/cYXDMsQD+UcaNu93uuyZjo i6xvN1npZDec3x2j6a0YV159fWfag4hR7GxtwBEg0Ltzm3XSL5v0ljtFpeNGlehg u0TV5Tcfx8pEtcfaFdHbNXC5ih2vDMN2Yts0K8WGAWbcECs+XlvCJnYyvHGFVequ pCZuUGcrXM+0EqYnVTBMdY7lk3We8HbeZsbGjQA23MG9Bd537sBEdGpsA7ya43nJ kFK/ky8PjQ+BFpweGKL27fNULXZTSu+1D+IP+XgqksxKM5LYzWkvLAyVdUy+aNdA 6+MqIZIs =Ee/V -----END PGP SIGNATURE----- Merge tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "Five ksmbd server fixes, all also for stable - Three fixes related to SMB3 leases (fixes two xfstests, and a locking issue) - Unitialized variable fix - Socket creation fix when bindv6only is set" * tag '6.9-rc7-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: do not grant v2 lease if parent lease key and epoch are not set ksmbd: use rwsem instead of rwlock for lease break ksmbd: avoid to send duplicate lease break notifications ksmbd: off ipv6only for both ipv4/ipv6 binding ksmbd: fix uninitialized symbol 'share' in smb2_tree_connect()	2024-05-08 10:39:53 -07:00
Linus Torvalds	065a057a31	fuse fixes for 6.9 final -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZjsr0QAKCRDh3BK/laaZ PLrpAP9Y1Kz3gSSH1wqDJ9+XzQZdm4dSInMP2Pe47BvSGG2YlAEAwmccoyIoiM58 qvHPETImNxIRTAVZdiBM3W4S3hnzCwc= =SPoy -----END PGP SIGNATURE----- Merge tag 'fuse-fixes-6.9-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "Two one-liner fixes for issues introduced in -rc1" * tag 'fuse-fixes-6.9-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: virtiofs: include a newline in sysfs tag fuse: verify zero padding in fuse_backing_map	2024-05-08 10:33:55 -07:00
Linus Torvalds	fe35bf27a1	Description for this pull request: - Fix xfstests generic/013 test failure with dirsync mount option. - Initialize the reserved fields of deleted file and stream extension dentries to zero. -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmY7WcoWHGxpbmtpbmpl b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCMxTD/9+qFI6cEfe06Xt6RswN/RDMWrZ ZDzUjT7VATLSyjoiaeyJeCaK9/PCrJuX9+vNybq6W0TqfHzIYDmFn7Wg6HjQrZAJ 0XhiaqVwlQ2/UY4yiv7glJRKFsdgJdo3XhFfTWzV5Eaaj65QFHPjlQMo3tOrZzp9 HsO4+DwIFah2uvehKF8numJBXSZ7uoOELHnlL05A3xSmLAxY+HeueqbkQubv1r11 mIIfvmcdxnXlzdpgs1c+a0KXVg/4/0F+SZKYP+JL5x1N2xpc4y0cWsQgrfXY+7Id fPx6CoRYkchfUFGf/LlX/LKchMO/EuK3q3Q17+zoKfgJgdPbp8TkDpfur9iUOxgy 16wyq/iIPKWEFsMYLtqYN/dlNJ+fmVUVDF457VLNYYEFdDQbp8/VosGn4ct0CBQe E1uzwJlv/iUlBNFX679dNxDewAiBtIat2wyAChCauLK6a1bzHCIDpGUlS88ggBAd OLFvQgzRKILqd8fibb2VV46V/CY3R8SmVCzDBixPFmCJtNZas9crd3UXp1xNvPGA LHDnASkpUHSMQoQN0yfMGfvRosQD7wlJYw1mhMlDq35Z2IJg2HKKSESf2axOc5Z0 25AxNZ8xfgjBNiFfDQI0mClliXnz9GTRGt4LqBVS+YHjdbPYqCHNsvJDbR0r1ZM7 OzYIaxTVoTKtYsurgw== =zS+L -----END PGP SIGNATURE----- Merge tag 'exfat-for-6.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Fix xfstests generic/013 test failure with dirsync mount option - Initialize the reserved fields of deleted file and stream extension dentries to zero * tag 'exfat-for-6.9-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: zero the reserved fields of file and stream extension dentries exfat: fix timing of synchronizing bitmap and inode	2024-05-08 10:30:13 -07:00
Mateusz Guzik	7f016edaa0	fscrypt: try to avoid refing parent dentry in fscrypt_file_open Merely checking if the directory is encrypted happens for every open when using ext4, at the moment refing and unrefing the parent, costing 2 atomics and serializing opens of different files. The most common case of encryption not being used can be checked for with RCU instead. Sample result from open1_processes -t 20 ("Separate file open/close") from will-it-scale on Sapphire Rapids (ops/s): before: 12539898 after: 25575494 (+103%) v2: - add a comment justifying rcu usage, submitted by Eric Biggers - whack spurious IS_ENCRYPTED check from the refed case Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240508081400.422212-1-mjguzik@gmail.com Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-05-08 10:28:58 -07:00
Linus Torvalds	f5fcbc8b43	bcachefs fixes for 6.9 - Various syzbot fixes; mainly small gaps in validation - Fix an integer overflow in fiemap() which was preventing filefrag from returning the full list of extents - Fix a refcounting bug on the device refcount, turned up by new assertions in the development branch - Fix a device removal/readd bug; write_super() was repeatedly dropping and retaking bch_dev->io_ref references -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmY6Qz8ACgkQE6szbY3K bnZLRA/9F5dEcNF8mSuVZqJNgzzFAXgL59GZuRMMz5ECJQRTXyHB2c3N2MG6Htxg bJuyT47icTWibIqUrNJubkCaCl9MV9sl3uPRfxF9tVbiOYQg5WhE0UUhprJq8zfi YZ+wlAdPQhPHBgieycF9LIiIzjEGZcYpg8NgCFtdaU9Rxk3aBYyBuD051imvMBqH x0JEibtrIp26u6FScuH5FH5Ro+ysXgw8HZdM0j/9I9WIiYgpya6EbbVqeuXzL3Di scj4vQwA1YoVDw9eUUgXNJq+xD9m6YqJv395imDDWN7sFQm+jGNCossvj0qUKi8m 7QVup6zaO7yNYFJy84/iZCnSC/C7zs1iFJUM6gidRRArkjr7Qw8KAPtIXGRVUM2M 9ogY6Af5u8ie7qVV1NhcULIhCjiOSINUw9uJGYUwv+XtcCfjZb7maBwfvtnFa9VV kQXeoJ/dqVXqCpvnqjQbVej+I8SXCc/s9EPXD2+SHkHzDKAuJkWKzPkQGL3kTSRT 8FPfusF0NDYLTJPOh4MdzuK79YGRQvrPaRv/JyhSAyWsUubACkwmLCyZQUNVAV/f 6WaFoEYCv4coASQNsVnmISPlsoKbLwOtEZDBr14uY9CArKSsCW26QJOKyg4B7tF8 J2DU6sIy+Tzq+TiTkWV5IE/ibQijOIB2/06+5KcM7npsFJldHWs= =rlea -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefs Pull bcachefs fixes from Kent Overstreet: - Various syzbot fixes; mainly small gaps in validation - Fix an integer overflow in fiemap() which was preventing filefrag from returning the full list of extents - Fix a refcounting bug on the device refcount, turned up by new assertions in the development branch - Fix a device removal/readd bug; write_super() was repeatedly dropping and retaking bch_dev->io_ref references * tag 'bcachefs-2024-05-07.2' of https://evilpiepirate.org/git/bcachefs: bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async() bcachefs: Fix race in bch2_write_super() bcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAX bcachefs: Add missing skcipher_request_set_callback() call bcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode() bcachefs: Fix shift-by-64 in bformat_needs_redo() bcachefs: Guard against unknown k.k->type in __bkey_invalid() bcachefs: Add missing validation for superblock section clean bcachefs: Fix assert in bch2_alloc_v4_invalid() bcachefs: fix overflow in fiemap bcachefs: Add a better limit for maximum number of buckets bcachefs: Fix lifetime issue in device iterator helpers bcachefs: Fix bch2_dev_lookup() refcounting bcachefs: Initialize bch_write_op->failed in inline data path bcachefs: Fix refcount put in sb_field_resize error path bcachefs: Inodes need extra padding for varint_decode_fast() bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit() bcachefs: bucket_pos_to_bp_noerror() bcachefs: don't free error pointers bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf()	2024-05-08 10:23:18 -07:00
Allen Pais	4bbf9c3b53	fs/coredump: Enable dynamic configuration of max file note size Introduce the capability to dynamically configure the maximum file note size for ELF core dumps via sysctl. Why is this being done? We have observed that during a crash when there are more than 65k mmaps in memory, the existing fixed limit on the size of the ELF notes section becomes a bottleneck. The notes section quickly reaches its capacity, leading to incomplete memory segment information in the resulting coredump. This truncation compromises the utility of the coredumps, as crucial information about the memory state at the time of the crash might be omitted. This enhancement removes the previous static limit of 4MB, allowing system administrators to adjust the size based on system-specific requirements or constraints. Eg: $ sysctl -a \| grep core_file_note_size_limit kernel.core_file_note_size_limit = 4194304 $ sysctl -n kernel.core_file_note_size_limit 4194304 $echo 519304 > /proc/sys/kernel/core_file_note_size_limit $sysctl -n kernel.core_file_note_size_limit 519304 Attempting to write beyond the ceiling value of 16MB $echo 17194304 > /proc/sys/kernel/core_file_note_size_limit bash: echo: write error: Invalid argument Signed-off-by: Vijay Nag <nagvijay@microsoft.com> Signed-off-by: Allen Pais <apais@linux.microsoft.com> Link: https://lore.kernel.org/r/20240506193700.7884-1-apais@linux.microsoft.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-05-08 09:53:00 -07:00
Ryusuke Konishi	91d743a9c8	nilfs2: make superblock data array index computation sparse friendly Upon running sparse, "warning: dubious: x & !y" is output at an array index calculation within nilfs_load_super_block(). The calculation is not wrong, but to eliminate the sparse warning, replace it with an equivalent calculation. Also, add a comment to make it easier to understand what the unintuitive array index calculation is doing and whether it's correct. Link: https://lkml.kernel.org/r/20240430080019.4242-3-konishi.ryusuke@gmail.com Fixes: `e339ad31f5` ("nilfs2: introduce secondary super block") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Matthew Wilcox (Oracle)	bbf45b7e68	squashfs: remove calls to set the folio error flag Nobody checks the error flag on squashfs folios, so stop setting it. Link: https://lkml.kernel.org/r/20240420025029.2166544-24-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Matthew Wilcox (Oracle)	675f02e5e6	squashfs: convert squashfs_symlink_read_folio to use folio APIs Remove use of page APIs, return the errno instead of 0, switch from kmap_atomic to kmap_local and use folio_end_read() to unify the two exit paths. Link: https://lkml.kernel.org/r/20240420025029.2166544-23-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:28 -07:00
Colin Ian King	f492fb3656	ocfs2: remove redundant assignment to variable status Variable status is being assigned and error code that is never read, it is being assigned inside of a do-while loop. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never read [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20240423223018.1573213-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:27 -07:00
Eric Sandeen	36defdd9d7	nilfs2: convert to use the new mount API Convert nilfs2 to use the new mount API. [sandeen@redhat.com: v2] Link: https://lkml.kernel.org/r/33d078a7-9072-4d8e-a3a9-dec23d4191da@redhat.com Link: https://lkml.kernel.org/r/20240425190526.10905-1-konishi.ryusuke@gmail.com [konishi.ryusuke: fixed missing SB_RDONLY flag repair in nilfs_reconfigure] Link: https://lkml.kernel.org/r/33d078a7-9072-4d8e-a3a9-dec23d4191da@redhat.com Link: https://lkml.kernel.org/r/20240424182716.6024-1-konishi.ryusuke@gmail.com Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-08 08:41:27 -07:00
Gao Xiang	d69189428d	erofs: clean up z_erofs_load_full_lcluster() Only four lcluster types here, remove redundant code. No real logic changes. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240508123357.3266173-1-hsiangkao@linux.alibaba.com	2024-05-08 20:36:43 +08:00
Hongzhen Luo	1872df8dcd	erofs: derive fsid from on-disk UUID for .statfs() if possible Use the superblock's UUID to generate the fsid when it's non-null. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240409113022.74720-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:51 +08:00
Chunhai Guo	0f6273ab46	erofs: add a reserved buffer pool for lz4 decompression This adds a special global buffer pool (in the end) for reserved pages. Using a reserved pool for LZ4 decompression significantly reduces the time spent on extra temporary page allocation for the extreme cases in low memory scenarios. The table below shows the reduction in time spent on page allocation for LZ4 decompression when using a reserved pool. The results were obtained from multi-app launch benchmarks on ARM64 Android devices running the 5.15 kernel with an 8-core CPU and 8GB of memory. In the benchmark, we launched 16 frequently-used apps, and the camera app was the last one in each round. The data in the table is the average time of camera app for each round. After using the reserved pool, there was an average improvement of 150ms in the overall launch time of our camera app, which was obtained from the systrace log. +--------------+---------------+--------------+---------+ \| \| w/o page pool \| w/ page pool \| diff \| +--------------+---------------+--------------+---------+ \| Average (ms) \| 3434 \| 21 \| -99.38% \| +--------------+---------------+--------------+---------+ Based on the benchmark logs, 64 pages are sufficient for 95% of scenarios. This value can be adjusted with a module parameter `reserved_pages`. The default value is 0. This pool is currently only used for the LZ4 decompressor, but it can be applied to more decompressors if needed. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402131523.2703948-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:51 +08:00
Chunhai Guo	d6db47e571	erofs: do not use pagepool in z_erofs_gbuf_growsize() Let's use alloc_pages_bulk_array() for simplicity and get rid of unnecessary pagepool. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240402092757.2635257-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:50 +08:00
Chunhai Guo	f36f3010f6	erofs: rename per-CPU buffers to global buffer pool and make it configurable It will cost more time if compressed buffers are allocated on demand for low-latency algorithms (like lz4) so EROFS uses per-CPU buffers to keep compressed data if in-place decompression is unfulfilled. While it is kind of wasteful of memory for a device with hundreds of CPUs, and only a small number of CPUs concurrently decompress most of the time. This patch renames it as 'global buffer pool' and makes it configurable. This allows two or more CPUs to share a common buffer to reduce memory occupation. Suggested-by: Gao Xiang <xiang@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Link: https://lore.kernel.org/r/20240402100036.2673604-1-guochunhai@vivo.com Signed-off-by: Sandeep Dhavale <dhavale@google.com> Link: https://lore.kernel.org/r/20240408215231.3376659-1-dhavale@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:49 +08:00
Chunhai Guo	cacd5b04e2	erofs: rename utils.c to zutil.c Currently, utils.c is only useful if CONFIG_EROFS_FS_ZIP is on. So let's rename it to zutil.c as well as avoid its inclusion if CONFIG_EROFS_FS_ZIP is explicitly disabled. Signed-off-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240401135550.2550043-1-guochunhai@vivo.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-05-08 17:12:49 +08:00
Richard Fung	9fe2a036a2	fuse: Add initial support for fs-verity This adds support for the FS_IOC_ENABLE_VERITY and FS_IOC_MEASURE_VERITY ioctls. The FS_IOC_READ_VERITY_METADATA is missing but from the documentation, "This is a fairly specialized use case, and most fs-verity users won’t need this ioctl." Signed-off-by: Richard Fung <richardfung@google.com> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-05-08 09:31:21 +02:00
Brian Foster	96d88f65ad	virtiofs: include a newline in sysfs tag The internal tag string doesn't contain a newline. Append one when emitting the tag via sysfs. [Stefan] Orthogonal to the newline issue, sysfs_emit(buf, "%s", fs->tag) is needed to prevent format string injection. Signed-off-by: Brian Foster <bfoster@redhat.com> Fixes: `a8f62f50b4` ("virtiofs: export filesystem tags through sysfs") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-05-08 09:31:21 +02:00
Matthew Wilcox (Oracle)	413e8f014c	fuse: Convert fuse_readpages_end() to use folio_end_read() Nobody checks the error flag on fuse folios, so stop setting it. Optimise the (optional) setting of the uptodate flag and clearing of the lock flag by using folio_end_read(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-05-08 09:31:21 +02:00
Baokun Li	dc1c4663bc	ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find() In ext4_xattr_block_cache_find(), when ext4_sb_bread() returns an error, we will either continue to find the next ea block or return NULL to try to insert a new ea block. But whether ext4_sb_bread() returns -EIO or -ENOMEM, the next operation is most likely to fail with the same error. So propagate the error returned by ext4_sb_bread() to make ext4_xattr_block_set() fail to reduce pointless operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Baokun Li	0c0b4a49d3	ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find() Syzbot reports a warning as follows: ============================================ WARNING: CPU: 0 PID: 5075 at fs/mbcache.c:419 mb_cache_destroy+0x224/0x290 Modules linked in: CPU: 0 PID: 5075 Comm: syz-executor199 Not tainted 6.9.0-rc6-gb947cc5bf6d7 RIP: 0010:mb_cache_destroy+0x224/0x290 fs/mbcache.c:419 Call Trace: <TASK> ext4_put_super+0x6d4/0xcd0 fs/ext4/super.c:1375 generic_shutdown_super+0x136/0x2d0 fs/super.c:641 kill_block_super+0x44/0x90 fs/super.c:1675 ext4_kill_sb+0x68/0xa0 fs/ext4/super.c:7327 [...] ============================================ This is because when finding an entry in ext4_xattr_block_cache_find(), if ext4_sb_bread() returns -ENOMEM, the ce's e_refcnt, which has already grown in the __entry_find(), won't be put away, and eventually trigger the above issue in mb_cache_destroy() due to reference count leakage. So call mb_cache_entry_put() on the -ENOMEM error branch as a quick fix. Reported-by: syzbot+dd43bd0f7474512edc47@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=dd43bd0f7474512edc47 Fixes: `fb265c9cb4` ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240504075526.2254349-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:59:18 -04:00
Colin Ian King	8b57de1c5e	jbd2: remove redundant assignement to variable err The variable err is being assigned a value that is never read, it is being re-assigned inside the following while loop and also after the while loop. The assignment is redundant and can be removed. Cleans up clang scan build warning: fs/jbd2/commit.c:574:2: warning: Value stored to 'err' is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240410112803.232993-1-colin.i.king@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:52:20 -04:00
Zhang Yi	df0b5afc62	ext4: remove the redundant folio_wait_stable() __filemap_get_folio() with FGP_WRITEBEGIN parameter has already wait for stable folio, so remove the redundant folio_wait_stable() in ext4_da_write_begin(), it was left over from the commit `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") that removed the retry getting page logic. Fixes: `cc883236b7` ("ext4: drop unnecessary journal handle in delalloc write") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240419023005.2719050-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:48:04 -04:00
Dan Carpenter	3f4830abd2	ext4: fix potential unnitialized variable Smatch complains "err" can be uninitialized in the caller. fs/ext4/indirect.c:349 ext4_alloc_branch() error: uninitialized symbol 'err'. Set the error to zero on the success path. Fixes: `8016e29f43` ("ext4: fast commit recovery path") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/363a4673-0fb8-4adf-b4fb-90a499077276@moroto.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:44:40 -04:00
Matthew Wilcox (Oracle)	c84f1510fb	ext4: convert ac_buddy_page to ac_buddy_folio This just carries around the bd_buddy_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-6-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:17 -04:00
Matthew Wilcox (Oracle)	ccedf35b5d	ext4: convert ac_bitmap_page to ac_bitmap_folio This just carries around the bd_bitmap_folio so should also be a folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-5-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:14 -04:00
Matthew Wilcox (Oracle)	e1622a0d55	ext4: convert ext4_mb_init_cache() to take a folio All callers now have a folio, so convert this function from operating on a page to operating on a folio. The folio is assumed to be a single page. Signe-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:10 -04:00
Matthew Wilcox (Oracle)	5eea586b47	ext4: convert bd_buddy_page to bd_buddy_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-3-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:38:07 -04:00
Matthew Wilcox (Oracle)	99b150d84e	ext4: convert bd_bitmap_page to bd_bitmap_folio There is no need to make this a multi-page folio, so leave all the infrastructure around it in pages. But since we're locking it, playing with its refcount and checking whether it's uptodate, it needs to move to the folio API. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240416172900.244637-2-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-07 15:37:46 -04:00
Dan Carpenter	0e39c9e524	btrfs: qgroup: fix initialization of auto inherit array The "i++" was accidentally left out so it just sets qgids[0] over and over. This can lead to unexpected problems, as the groups[1:] would be all 0, leading to later find_qgroup_rb() unable to find a qgroup and cause snapshot creation failure. Fixes: `5343cd9364` ("btrfs: qgroup: simple quota auto hierarchy for nested subvolumes") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:11 +02:00
Matthew Wilcox (Oracle)	bc00965dbf	btrfs: count super block write errors in device instead of tracking folio error state Currently the error status of super block write is tracked in page/folio status bit Error. For that we need to keep the reference for the whole duration of write and wait. Count the number of superblock writeback errors in the btrfs_device. That means we don't need the folio to stay around until it's waited for, and can avoid the extra call to folio_get/put. Also remove a mention of PageError in a comment as it's the last mention of the page Error state. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:11 +02:00
Matthew Wilcox (Oracle)	617fb10ea8	btrfs: use the folio iterator in btrfs_end_super_write() Iterate over folios instead of bvecs. Switch the order of unlock and put to be the usual order; we know this folio can't be put until it's been waited for, but that's fragile. Remove the calls to ClearPageUptodate / SetPageUptodate -- if PAGE_SIZE is larger than BTRFS_SUPER_INFO_SIZE, we'd be marking the entire folio uptodate without having actually initialised all the bytes in the page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Matthew Wilcox (Oracle)	f93ee0df51	btrfs: convert super block writes to folio in write_dev_supers() This is a direct conversion from pages to folios, assuming single page folio. Also removes some calls to obsolete APIs and some hidden calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Matthew Wilcox (Oracle)	c94b7349b8	btrfs: convert super block writes to folio in wait_dev_supers() This is a direct conversion from pages to folios, assuming single page folio. Also removes a few calls to compound_head() and calls to obsolete APIs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Thorsten Blum	58a774ca16	btrfs: remove duplicate included header from fs.h Remove duplicate included header file linux/blkdev.h . Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	6b0a63a4fa	btrfs: add a cached state to extent_clear_unlock_delalloc Now that we have the lock_extent tightly coupled with extent_clear_unlock_delalloc we can add a cached state to extent_clear_unlock_delalloc and benefit from skipping the extra lookup when we're doing cow. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	8325f41a56	btrfs: push extent lock down in submit_one_async_extent We don't need to include the time we spend in the allocator under our extent lock protection, move it after the allocator and make sure we lock the extent in the error case to ensure we're not clearing these bits without the extent lock held. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	d456c25dbb	btrfs: push lock_extent down in cow_file_range() Now that we've got the extent lock pushed into cow_file_range() we can push it further down into the allocation loop. This allows us to only hold the extent lock during the dropping of the extent map range and inserting the ordered extent. This makes the error case a little trickier as we'll now have to lock the range before clearing any of the other extent bits for the range, but this is the error path so is less performance critical. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	cd241a8f55	btrfs: move can_cow_file_range_inline() outside of the extent lock These checks aren't reliant on the extent lock. Move this up into cow_file_range_inline(), and then update encoded writes to call this check before calling __cow_file_range_inline(). This will allow us to skip the extent lock if we're not able to inline the given extent. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	0ab540995a	btrfs: push lock_extent into cow_file_range_inline Now that we've pushed the lock_extent() into cow_file_range() we can push the extent locking into cow_file_range_inline() and move the lock_extent in cow_file_range() to after we call cow_file_range_inline(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	a0766d8f35	btrfs: push extent lock into cow_file_range Now that cow_file_range is the only function that is called with the range locked, push this call into cow_file_range so we can further narrow the scope. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:10 +02:00
Josef Bacik	00009d7bcb	btrfs: push extent lock into run_delalloc_cow This is used by zoned but also as the fallback for uncompressed extents when we fail to compress the ranges. Push the extent lock into run_dealloc_cow(), and adjust the compression case to take the extent lock after calling run_delalloc_cow(). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0e128d4e41	btrfs: remove unlock_extent from run_delalloc_compressed Since we immediately unlock the extent range when we enter run_delalloc_compressed() simply move the lock_extent() down to cover cow_file_range() and then remove the unlock_extent() from run_delalloc_compressed. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	aa56b0aa91	btrfs: push extent lock down in run_delalloc_nocow run_delalloc_nocow is a little special because we use the file extents to see if we can nocow a range. We don't actually need the protection of the extent lock to look at the file extents at this point however. We are currently holding the page lock for this range, so we are protected from anybody who would simultaneously be modifying the file extent items for this range. * mmap() - we're holding the page lock. * buffered writes - we're holding the page lock. * direct writes - we're holding the page lock and direct IO has to flush page cache before it's able to continue. * fallocate() - all callers flush the range and wait on ordered extents while holding the inode lock and the mmap lock, so we are again saved by the page lock. We want to use the extent lock to protect 1) The mapping tree for the given range. 2) The ordered extents for the given range. 3) The io_tree for the given range. Push the extent lock down to cover these operations. In the fallback_to_cow() case we simply lock before doing anything and rely on the cow_file_range() helper to handle it's range properly. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0ed30c17f6	btrfs: adjust while loop condition in run_delalloc_nocow We have the following pattern while (1) { if (cur_offset > end) break; } Which is just while (cur_offset <= end) { ... } so adjust the code to be more clear. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	7c9acd440f	btrfs: push extent lock into run_delalloc_nocow run_delalloc_nocow is a bit special as it walks through the file extents for the inode and determines what it can nocow and what it can't. This is the more complicated area for extent locking, so start with this function. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	c0707c9e1e	btrfs: push the extent lock into btrfs_run_delalloc_range We want to limit the scope of the extent lock to be around operations that can change in flight. Currently we hold the extent lock through the entire writepage operation, which isn't really necessary. We want to protect to make sure nobody has updated DELALLOC. In find_lock_delalloc_range we must lock the range in order to validate the contents of our io_tree. However once we've done that we're safe to unlock the range and continue, as we have the page lock already held for the range. We are protected from all operations at this point. * mmap() - we're holding the page lock, thus are protected. * buffered writes - again, we're protected because we take the page lock for the first and last page in our range for buffered writes so we won't create new delalloc ranges in this area. * direct IO - we invalidate pagecache before attempting to write a new area, which requires the page lock, so again are protected once we're holding the page lock on this range. Additionally this behavior actually already exists for compressed, we unlock the range as soon as we start to process the async extents, and re-lock it during compression. So this is completely safe, and makes the locking more consistent. Make this simple by just pushing the extent lock into btrfs_run_delalloc_range. From there followup patches will push the lock further down into its users. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	7034674b8a	btrfs: lock extent when doing inline extent in compression We currently don't lock the extent when we're doing a cow_file_range_inline() for a compressed extent. This isn't a problem necessarily, but it's inconsistent with the rest of our usage of cow_file_range_inline(). This also leads to some extra weird logic around whether the extent is locked or not. Fix this to lock the extent before calling cow_file_range_inline() in compression to make it consistent with the rest of the inline users. In future patches this will be pushed down into the cow_file_range_inline() helper, so we're fine with the quick and dirty locking here. This patch exists to make the behavior change obvious. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0586d0a89e	btrfs: move extent bit and page cleanup into cow_file_range_inline We duplicate the extent cleanup for cow_file_range_inline() in the cow and compressed case. The encoded case doesn't need to do cleanup the same way, so rename cow_file_range_inline to __cow_file_range_inline and then make cow_file_range_inline handle the extent cleanup appropriately, and update the callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	0332967b4d	btrfs: unlock all the pages with successful inline extent creation Since `4750af3bbe` ("btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()") we have been unlocking the locked page manually instead of via extent_clear_unlock_delalloc() because of subpage blocksize support. However we actually disable inline extent creation for subpage blocksize support, so this behavior isn't necessary. Remove this code and comment, if at some point the subpage blocksize code grows support for inline extents this can be re-evaluated. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	6eecfa2240	btrfs: push all inline logic into cow_file_range Currently we have a lot of duplicated checks of if (start == 0 && fs_info->sectorsize == PAGE_SIZE) cow_file_range_inline(); Instead of duplicating this check everywhere, consolidate all of the inline extent logic into a helper which documents all of the checks and then use that helper inside of cow_file_range_inline(). With this we can clean up all of the calls to either unconditionally call cow_file_range_inline(), or at least reduce the checks we're doing before we call cow_file_range_inline(); Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Josef Bacik	aa5ccf2917	btrfs: handle errors in btrfs_reloc_clone_csums properly In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:09 +02:00
Qu Wenruo	e98bf64f7a	btrfs: add extra sanity checks for create_io_em() The function create_io_em() is called before we submit an IO, to update the in-memory extent map for the involved range. This patch changes the following aspects: - Does not allow BTRFS_ORDERED_NOCOW type For real NOCOW (excluding NOCOW writes into preallocated ranges) writes, we never call create_io_em(), as we does not need to update the extent map at all. So remove the sanity check allowing BTRFS_ORDERED_NOCOW type. - Add extra sanity checks * PREALLOC - @block_len == len For uncompressed writes. * REGULAR - @block_len == @orig_block_len == @ram_bytes == @len We're creating a new uncompressed extent, and referring all of it. - @orig_start == @start We haven no offset inside the extent. * COMPRESSED - valid @compress_type - @len <= @ram_bytes This is to co-operate with encoded writes, which can cause a new file extent referring only part of a uncompressed extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Qu Wenruo	4bdc558bf9	btrfs: simplify the inline extent map creation With the tree-checker ensuring all inline file extents starts at file offset 0 and has a length no larger than sectorsize, we can simplify the calculation to assigned those fixes values directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Qu Wenruo	319d91ee72	btrfs: add extra comments on extent_map members The extent_map structure is very critical to btrfs, as it is involved for both read and write paths. Unfortunately the structure is not properly explained, making it pretty hard to understand nor to do further improvement. This patch adds extra comments explaining the major members based on my code reading. Hopefully we can find more members to cleanup in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Naohiro Aota	30704a0d56	btrfs: drop unused argument of calcu_metadata_size() calcu_metadata_size() has a "reserve" argument, but the only caller always set it to "1". The other usage (reserve = 0) is dropped by a commit `0647bf564f` ("Btrfs: improve forever loop when doing balance relocation"), which is more than 10 years ago. Drop the argument and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	33a44f3760	btrfs: simplify return variables in btrfs_drop_subtree() There's another return variable wret that is only passed to ret on error, we can simply use ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	1618aa3c2e	btrfs: simplify return variables in lookup_extent_data_ref() First, drop err instead reuse ret, choose to return the error instead of goto fail and then return the same error. Do not initialize the ret until where it has to be initialized. Slight logic change in handling the btrfs_search_slot() and btrfs_next_leaf() return value. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	6e812a9c65	btrfs: rename return variables in btrfs_qgroup_rescan_worker() Rename ret to ret2 compile and then err to ret. Also, new ret2 is found to be localized within the 'if (trans)' statement, so move its declaration there. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	5e8fb9b84b	btrfs: drop variable err in quick_update_accounting() In quick_update_accounting() err is used as 2nd return value, which could be achieved just with ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	acde0e8609	btrfs: reuse ret instead of err in relocate_tree_blocks() Coding style fixes the function relocate_tree_blocks(). After the fix, ret is the return value variable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	2daca1e419	btrfs: rename err and ret to ret in build_backref_tree() Code style fix in the function build_backref_tree(). Drop the ret initialization 0, as we don't need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	1e8a42375f	btrfs: rename werr and err to ret in __btrfs_wait_marked_extents() Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations in the function for better function space alignment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:08 +02:00
Anand Jain	ce87531120	btrfs: rename werr and err to ret in btrfs_write_marked_extents() Rename the function's local variable werr and err to ret. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Anand Jain	9a7b68d32a	btrfs: report filemap_fdata<write\|wait>_range() error In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write\|wait>_range() fails. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
David Sterba	fef998d1a0	btrfs: use btrfs_is_testing() everywhere There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a wrapper for that that's a compile-time constant when self-tests are not built in. As this is only for development we can save some bytes and conditions on release configs by using the helper in the remaining cases. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	905a95f3dd	btrfs: initialize delayed inodes xarray without GFP_ATOMIC There's no need to initialize the delayed inodes xarray with a GFP_ATOMIC flag because that actually does nothing on the xarray operations. That was needed for radix trees, but for xarrays the allocation flags are passed as the last argument to xa_store() (which we are using correctly). So initialize the delayed inodes xarray with a simple xa_init(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	de6f14e83e	btrfs: make try_release_extent_mapping() return a bool Currently try_release_extent_mapping() as an int return type, but we use it as a boolean. Its only caller, the release folio callback, also returns a boolean which corresponds to try_release_extent_mapping()'s return value. So change its return value type to bool as well as its helper try_release_extent_state(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	2e504418e4	btrfs: be better releasing extent maps at try_release_extent_mapping() At try_release_extent_mapping(), called during the release folio callback (btrfs_release_folio() callchain), we don't release any extent maps in the range if the GFP flags don't allow blocking. This behaviour is exaggerated because: 1) Both searching for extent maps and removing them are not blocking operations. The only thing that it is the cond_resched() call at the end of the loop that searches for and removes extent maps; 2) We currently only operate on a single page, so for the case where block size matches the page size, we can only have one extent map, and for the case where the block size is smaller than the page size, we can have at most 16 extent maps. So it's very unlikely the cond_resched() call will ever block even in the block size smaller than page size scenario. So instead of not removing any extent maps at all in case the GFP glags don't allow blocking, keep removing extent maps while we don't need to reschedule. This makes it safe for the subpage case and for a future where we can process folios with a size larger than a page. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	433a3e01dd	btrfs: remove i_size restriction at try_release_extent_mapping() Currently we don't attempt to release extent maps if the inode has an i_size that is not greater than 16M. This condition was added way back in 2008 by commit `70dec8079d` ("Btrfs: extent_io and extent_state optimizations"), without any explanation about it. A quick chat with Chris on slack revealed that the goal was probably to release the extent maps for small files only when closing the inode. This however can be harmful in case we have tons of such files being kept open for very long periods of time, since we will consume more and more pages for extent maps. So remove the condition. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	85d288309a	btrfs: use btrfs_get_fs_generation() at try_release_extent_mapping() Nowadays we have the btrfs_get_fs_generation() to get the current generation of the filesystem, so there's no need anymore to lock the transaction spinlock to read it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	078b981aaa	btrfs: rename some variables at try_release_extent_mapping() Rename the following variables: 1) "btrfs_inode" to "inode", because it's shorter to type and clear, and we don't have a VFS inode here as well, so there's no confusion; 2) "tree" to "io_tree", to be clear which tree we are dealing with, since we use 2 different trees in the function; 3) "map" to "extent_tree" since "map" gives the idea we are dealing with an extent map for example, but we are dealing with the inode's extent tree (the tree which stores extent maps). These also make the next patches simpler. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	0d89a15e1a	btrfs: add tracepoints for extent map shrinker events Add some tracepoints for the extent map shrinker to help debug and analyse main events. These have proved useful during development of the shrinker. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	65bb9fb00b	btrfs: update comment for btrfs_set_inode_full_sync() about locking Nowadays we have a lock used to synchronize mmap writes with reflink and fsync operations (struct btrfs_inode::i_mmap_lock), so update the comment for btrfs_set_inode_full_sync() to mention that it can also be called while holding that mmap lock. Besides being a valid alternative to the inode's VFS lock, we already have the extent map shrinker using that mmap lock instead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:07 +02:00
Filipe Manana	956a17d9d0	btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit `70dec8079d` ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	f1d97e7691	btrfs: add a global per cpu counter to track number of used extent maps Add a per cpu counter that tracks the total number of extent maps that are in extent trees of inodes that belong to fs trees. This is going to be used in an upcoming change that adds a shrinker for extent maps. Only extent maps for fs trees are considered, because for special trees such as the data relocation tree we don't want to evict their extent maps which are critical for the relocation to work, and since those are limited, it's not a concern to have them in memory during the relocation of a block group. Another case are extent maps for free space cache inodes, which must always remain in memory, but those are limited (there's only one per free space cache inode, which means one per block group). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	5fa8a6baff	btrfs: pass the extent map tree's inode to try_merge_map() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to try_merge_map(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change try_merge_map() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	e778724a5e	btrfs: pass the extent map tree's inode to setup_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to setup_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change setup_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	6a3a9113ae	btrfs: pass the extent map tree's inode to replace_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to replace_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change replace_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	c2fbd812d7	btrfs: pass the extent map tree's inode to remove_extent_mapping() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to remove_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change remove_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	002f3a2ce8	btrfs: pass the extent map tree's inode to clear_em_logging() Extent maps are always associated to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to clear_em_logging(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change clear_em_logging() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Filipe Manana	6c566def95	btrfs: pass the extent map tree's inode to add_extent_mapping() Extent maps are always added to an inode's extent map tree, so there's no need to pass the extent map tree explicitly to add_extent_mapping(). In order to facilitate an upcoming change that adds a shrinker for extent maps, change add_extent_mapping() to receive the inode instead of its extent map tree. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	e094f48040	btrfs: change root->root_key.objectid to btrfs_root_id() A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 \| - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	53e2415868	btrfs: set start on clone before calling copy_extent_buffer_full Our subpage testing started hanging on generic/560 and I bisected it down to `1cab1375ba` ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations"). This is subtle because we use eb->start to figure out where in the folio we're copying to when we're subpage, as our ->start may refer to an area inside of the folio. For example, assume a 16K page size machine with a 4K node size, and assume that we already have a cloned extent buffer when we cloned the previous search. copy_extent_buffer_full() will do the following when copying the extent buffer path->nodes[0] (src) into cloned (dest): src->start = 8k; // this is the new leaf we're cloning cloned->start = 4k; // this is left over from the previous clone src_addr = folio_address(src->folios[0]); dest_addr = folio_address(dest->folios[0]); memcpy(dest_addr + get_eb_offset_in_folio(dst, 0), src_addr + get_eb_offset_in_folio(src, 0), src->len); Now get_eb_offset_in_folio() is where the problems occur, because for sub-pagesize blocksize we can have multiple eb's per folio, the code for this is as follows size_t get_eb_offset_in_folio(eb, offset) { return (eb->start + offset & (folio_size(eb->folio[0]) - 1)); } So in the above example we are copying into offset 4K inside the folio. However once we update cloned->start to 8K to match the src the math for get_eb_offset_in_folio() changes, and any subsequent reads (i.e. btrfs_item_key_to_cpu()) will start reading from the offset 8K instead of 4K where we copied to, giving us garbage. Fix this by setting start before we co copy_extent_buffer_full() to make sure that we're copying into the same offset inside of the folio that we will read from later. All other sites of copy_extent_buffer_full() are correct because we either set ->start beforehand or we simply don't change it in the case of the tree-log usage. With this fix we now pass generic/560 on our subpage tests. Fixes: `1cab1375ba` ("btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations") Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:06 +02:00
Josef Bacik	99f2be1522	btrfs: replace btrfs_delayed__ref with btrfs__ref Now that these two structs are the same, move the btrfs_data_ref and btrfs_tree_ref up and use these in the btrfs_delayed_ref_node. Then remove the btrfs_delayed_*_ref structs. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	7f6af7c434	btrfs: remove the btrfs_delayed_ref_node container helpers Now that we don't use these helpers anywhere, remove them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	efc7d5dbf8	btrfs: stop referencing btrfs_delayed_tree_ref directly We only ever need to use this to get the level of the tree block ref, so use the btrfs_delayed_ref_owner() helper, which returns the level for the given reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	44cc2e38e6	btrfs: stop referencing btrfs_delayed_data_ref directly Now that most of our elements are inside of btrfs_delayed_ref_node directly and we have helpers for the delayed_data_ref bits, go ahead and remove all direct usage of btrfs_delayed_data_ref and use the helpers where needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	b4b5934ac1	btrfs: make the insert backref helpers take a btrfs_delayed_ref_node We don't need to pass in all the elements for the backrefs as function arguments, simply pass through the btrfs_delayed_ref_node and then extract the values we need from that. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	85bb9f544e	btrfs: drop unnecessary arguments from __btrfs_free_extent We have all the information we need in our btrfs_delayed_ref_node, which we already pass into __btrfs_free_extent. Drop the extra arguments and just extract the values from btrfs_delayed_ref_node. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	a502f112ad	btrfs: make __btrfs_inc_extent_ref take a btrfs_delayed_ref_node We're just extracting the values from btrfs_delayed_ref_node and passing them through, simply pass the btrfs_delayed_ref_node into __btrfs_inc_extent_ref and shrink the function arguments. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	5366763446	btrfs: rename btrfs_data_ref->ino to ->objectid This is how we refer to it in the rest of the extent reference related code, make it consistent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	cf4f04325b	btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	12390e42b6	btrfs: rename ->len to ->num_bytes in btrfs_ref We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	f75464f7bb	btrfs: unify the btrfs_add_delayed_*_ref helpers into one helper Now that these helpers are identical, create a helper function that handles everything properly and strip the individual helpers down to use just the common helper. This cleans up a significant amount of duplicated code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:05 +02:00
Josef Bacik	1bff6d4f87	btrfs: simplify delayed ref tracepoints Now that all of the delayed ref information is in the delayed ref node, drastically simplify the delayed ref tracepoints by simply passing in the btrfs_delayed_ref_node and populating the tracepoints with the values from the structure itself. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0ea4703cc2	btrfs: move ref specific initialization into init_delayed_ref_common Now that the btrfs_delayed_ref_node contains a union of the data and metadata specific information we can move the initialization into init_delayed_ref_common and just use the btrfs_ref to initialize the correct fields of the reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0509cc5661	btrfs: initialize btrfs_delayed_ref_head with btrfs_ref We are calling init_delayed_ref_head with all of the elements from btrfs_ref, clean this up to simply pass in the btrfs_ref and initialize the btrfs_delayed_ref_head with the values from the btrfs_ref directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	da3c548541	btrfs: pass btrfs_ref to init_delayed_ref_common We're extracting all of these values from the btrfs_ref we passed in already, just pass the btrfs_ref through to init_delayed_ref_common and get the values directly from the struct. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	f2e69a77aa	btrfs: move ref_root into btrfs_ref We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	4d09b4e942	btrfs: do not use a function to initialize btrfs_ref btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	d3fbb00f5e	btrfs: embed data_ref and tree_ref in btrfs_delayed_ref_node We have been embedding btrfs_delayed_ref_node in the btrfs_delayed_data_ref and btrfs_delayed_tree_ref, and then we have two sets of cachep's and a variety of handling that is awkward because of this separation. Instead union these two members inside of btrfs_delayed_ref_node and make that the first class object. This allows us to go down to one cachep for our delayed ref nodes instead of two. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Josef Bacik	0eea355fc0	btrfs: add a helper to get the delayed ref node from the data/tree ref We have several different ways we refer to references throughout the code and it's not consistent and there's a bit of duplication. In order to clean this up I want to have one structure we use to define reference information, and one structure we use for the delayed reference information. Start this process by adding a helper to get from the btrfs_delayed_data_ref/btrfs_delayed_tree_ref to the btrfs_delayed_ref_node so that it'll make moving these structures around simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	26c0fae3e7	btrfs: use btrfs_find_first_inode() at btrfs_prune_dentries() Currently btrfs_prune_dentries() has open code to find the first inode in a root with a minimum inode number. Remove that code and make it use the helper btrfs_find_first_inode() for that task. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	5e485ac6f0	btrfs: export find_next_inode() as btrfs_find_first_inode() Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	ed48adf83e	btrfs: simplify add_extent_mapping() by removing pointless label The add_extent_mapping() function is short and trivial, there's no need to have a label for a quick exit in case of an error, even because there's no error handling needed, we just need to return the error. So remove that label and return directly. Also while at it remove the redundant initialization of 'ret', as that may help avoid some warnings with clang tools such as the one reported/fixed by commit `966de47ff0` ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:04 +02:00
Filipe Manana	071533da5f	btrfs: tests: error out on unexpected extent map reference count In the extent map self tests, when freeing all extent maps from a test extent map tree we are not expecting to find any extent map with a reference count different from 1 (the tree reference). If we find any, we just log a message but we don't fail the test, which makes it very easy to miss any bug/regression - no one reads the test messages unless a test fails. So change the behaviour to make a test fail if we find an extent map in the tree with a reference count different from 1. Make the failure happen only after removing all extent maps, so that we don't leak memory. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	0a308f8095	btrfs: pass an inode to btrfs_add_extent_mapping() Instead of passing fs_info and extent map tree arguments to btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps are always inserted in the extent map tree of an inode, and the fs_info can be extracted from the inode (inode->root->fs_info). The only exception is in the self tests where we allocate an extent map tree and then use it to insert/update/remove extent maps. However the tests can be changed to use a test inode and then use the inode's extent map tree. So change btrfs_add_extent_mapping() to have an inode as an argument instead of a fs_info and an extent map tree. This reduces the number of parameters and will also be needed for an upcoming change. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	236e3107fc	btrfs: open code csum_exist_in_range() The csum_exist_in_range() function is now too trivial and is only used in one place, so open code it in its single caller. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	8d2a83a97f	btrfs: make NOCOW checks for existence of checksums in a range more efficient Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	fb90e1caf0	btrfs: simplify error path for btrfs_lookup_csums_list() In the error path we have this while loop that keeps iterating over the csums of the list and then delete them from the list and free them, testing for an error (ret < 0) and list emptyness as the conditions of the while loop. Simplify this by using list_for_each_entry_safe() so there's no need to delete elements from the list and need to test the error condition on each iteration. Also rename the 'fail' label to 'out' since the label is not exclusive to a failure path, as we also end up there when the function succeeds, and it's also a more common label name. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	c0dce8b6a3	btrfs: remove use of a temporary list at btrfs_lookup_csums_list() There's no need to use a temporary list to add the checksums, we can just add them to input list and then on error delete and free any checksums that were added. So simplify and remove the temporary list. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	afcb80624f	btrfs: remove search_commit parameter from btrfs_lookup_csums_list() All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	d800a9065b	btrfs: add function comment to btrfs_lookup_csums_list() Add a function comment to btrfs_lookup_csums_list() to document it. With another upcoming change its parameter list and return value will be less obvious. So add the documentation now so that it can be updated where needed later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	0ddefc2a7c	btrfs: move btrfs_page_mkwrite() from inode.c into file.c btrfs_page_mkwrite() is a struct vm_operations_struct callback and we define that structure in file.c. Currently the function is in inode.c and has to be exported to be used in file.c, which makes no sense because it's not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into file.c. While at it do a few minor style changes: 1) Capitalize the first word of every comment and end each sentence with punctuation; 2) Avoid splitting some statements into two lines when everything fits in 85 characters or less. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	590e2c4a1e	btrfs: remove no longer used btrfs_clone_chunk_map() There are no more users of btrfs_clone_chunk_map(), the last one (and only one ever) was removed in commit `1ec17ef591` ("btrfs: zoned: fix use-after-free in do_zone_finish()"). So remove btrfs_clone_chunk_map(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	606a1c5de1	btrfs: remove list_empty() check at warn_about_uncommitted_trans() At warn_about_uncommitted_trans(), there's no need to check if the list is empty and return, because list_for_each_entry_safe() is safe to call for an empty list, it simply does nothing. So remove the check. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:03 +02:00
Filipe Manana	47f6944877	btrfs: remove pointless return value assignment at btrfs_finish_one_ordered() At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret' variable because if it has a non-zero value (error), we have already jumped to the 'out' label. So remove that redundant assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Filipe Manana	2e438442ba	btrfs: remove not needed mod_start and mod_len from struct extent_map The mod_start and mod_len fields of struct extent_map were introduced by commit `4e2f84e63d` ("Btrfs: improve fsync by filtering extents that we want") in order to avoid too low performance when fsyncing a file that keeps getting extent maps merge, because it resulted in each fsync logging again csum ranges that were already merged before. We don't need this anymore as extent maps in the list of modified extents are never merged with other extent maps and once we log an extent map we remove it from the list of modified extent maps, so it's never logged twice. So remove the mod_start and mod_len fields from struct extent_map and use instead the start and len fields when logging checksums in the fast fsync path. This also makes EXTENT_FLAG_FILLING unused so remove it as well. Running the reproducer from the commit mentioned before, with a larger number of extents and against a null block device, so that IO is fast and we can better see any impact from searching checksums items and logging them, gave the following results from dd: Before this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s After this change: 409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s So no changes in throughput. The test was done in a release kernel (non-debug, Debian's default kernel config) and its steps are the following: $ mkfs.btrfs -f /dev/nullb0 $ mount /dev/sdb /mnt $ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync $ umount /mnt This also reduces the size of struct extent_map from 128 bytes down to 112 bytes, so now we can have 36 extents maps per 4K page instead of 32. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Boris Burkov	5f2fb819f6	btrfs: free PERTRANS at the end of cleanup_transaction() Some of the operations after the free might convert more PERTRANS metadata. Do the freeing as late as possible to eliminate a source of leaked PERTRANS metadata. This helps with the pass rate of generic/269 and generic/475. Reviewed-by: Qu Wenruo <qwu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	400b172b8c	btrfs: compression: migrate compression/decompression paths to folios For both compression and decompression paths, we always require a "struct page **pages" and "unsigned long nr_pages", this involves quite some part of the btrfs compression paths: - All the compression entry points - compressed_bio structure This affects both compression and decompression. - async_extent structure Unfortunately with all those involved parts, there is no good way to split the conversion into smaller patches while still passing compiling. So do this in one big conversion in one go. Please note this is direct page->folio conversion, no change on the page sized folio requirement yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	11e03f2f4b	btrfs: introduce btrfs_alloc_folio_array() The new helper will do the same thing as btrfs_alloc_page_array(), but with folios. One extra difference is, there is no extra helper for bulk allocation, thus it may not be as efficient as the page version. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	ae0d22a7fc	btrfs: migrate insert_inline_extent() to folio interfaces Since insert_inline_extent() now only accepts a single page, it's much easier to convert it to use folio interfaces. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	eb1fa9ab47	btrfs: make insert_inline_extent() accept one page directly Since our inline extent cannot accept anything larger than a sector, there is really no need to pass all the compressed pages to insert_inline_extent(). And just in case, expand the ASSERT()s to make sure we only try inline with compressed size no larger than sectorsize. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	98fe01af7e	btrfs: compression: convert page allocation to folio interfaces Currently we have two wrappers to allocate and free a page for compression usage: - btrfs_alloc_compr_page() - btrfs_free_compr_page() The allocator would try to grab a page from the pool, and only allocate a new page if the pool is empty. The reclaimer would check if the pool is full, and if not full it would put the page into the pool. This patch converts both helpers to use folio interfaces, and allowing further conversion of compression path to folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Qu Wenruo	6de3595473	btrfs: compression: add error handling for missed page cache For all the supported compression algorithms, the compression path would always need to grab the page cache, then do the compression. Normally we would get a page reference without any problem, since the write path should have already locked the pages in the write range. For the sake of error handling, we should handle the page cache miss case. Adds a common wrapper, btrfs_compress_find_get_page(), which calls find_get_page(), and do the error handling along with an error message. Callers inside compression path would only need to call btrfs_compress_find_get_page(), and error out if it returned any error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Filipe Manana	5d6f0e9890	btrfs: stop locking the source extent range during reflink Nowadays before starting a reflink operation we do this: 1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore); 2) Take the mmap lock of the inodes (struct btrfs_inode::i_mmap_lock); 3) Flush all delalloc in the source and target ranges; 4) Wait for all ordered extents in the source and target ranges to complete; 5) Lock the source and destination ranges in the inodes' io trees. In step 5 we lock the source range because: 1) We needed to serialize against mmap writes, but that is not needed anymore because nowadays we do that through the inode's i_mmap_lock (step 2). This happens since commit `8c99516a8c` ("btrfs: exclude mmaps while doing remap"); 2) To serialize against a concurrent relocation and avoid generating a delayed ref for an extent that was just dropped by relocation, see commit `d8b5524242` ("Btrfs: fix race between reflink/dedupe and relocation"). Locking the source range however blocks any concurrent reads for that range and makes test case generic/733 fail. So instead of locking the source range during reflinks, make relocation read lock the inode's i_mmap_lock, so that it serializes with a concurrent reflink while still able to run concurrently with mmap writes and allow concurrent reads too. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Dan Carpenter	4a43d735a6	btrfs: qgroup: delete unnecessary check in btrfs_qgroup_check_inherit() This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and unnecessary. The problem with the check is that static checkers flag it as a potential mixup of between units of bytes vs number of elements. Fortunately, the check can safely be deleted because the next check is correct and applies an even stricter limit: if (size != struct_size(inherit, qgroups, inherit->num_qgroups)) return -EINVAL; The "inherit" struct ends in a variable array of __u64 and "inherit->num_qgroups" is the number of elements in the array. At the start of the function we check that: if (size < sizeof(*inherit) \|\| size > PAGE_SIZE) return -EINVAL; Thus, since we verify that the whole struct fits within one page, that means that the number of elements in the inherit->qgroups[] array must be less than PAGE_SIZE. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:02 +02:00
Goldwyn Rodrigues	01b69bf990	btrfs: convert put_file_data() to folios Use folio instead of page in put_file_data(). Add a warning in case higher order folio is found, this will be implemented in the future. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Goldwyn Rodrigues	a16c2c48f4	btrfs: convert relocate_one_page() to folios and rename Convert page references to folios and call the respective folio functions. Since find_or_create_page() takes a mask argument, call __filemap_get_folio() instead of filemap_grab_folio(). The patch assumes folio size is PAGE_SIZE, add a warning in case it's a higher order that will be implemented in the future. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Goldwyn Rodrigues	8d6e5f9a0a	btrfs: page to folio conversion: prealloc_file_extent_cluster() Convert usage of page to folio in prealloc_file_extent_cluster() Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	70f1e5b6db	btrfs: rename err to ret in btrfs_direct_write() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	aefee7f1d8	btrfs: rename err to ret in prepare_pages() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	35cb2e90f4	btrfs: rename err to ret in btrfs_dirty_pages() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	04e4e189dd	btrfs: rename err to ret in create_reloc_inode() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	fdee5e557f	btrfs: rename err to ret in __btrfs_end_transaction() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	d5b634ae1f	btrfs: rename err to ret in convert_extent_bit() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	cbb6b5d208	btrfs: rename err to ret in __set_extent_bit() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	93bc66f4b6	btrfs: rename err to ret in btrfs_ioctl_snap_destroy() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:01 +02:00
Anand Jain	5e45b044b7	btrfs: rename err to ret in btrfs_cont_expand() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Anand Jain	c3a1cc8ff4	btrfs: rename err to ret in btrfs_rmdir() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Anand Jain	c87b979d9f	btrfs: rename err to ret in btrfs_initxattrs() Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Tavian Barnes	f32f20e2bd	btrfs: warn if EXTENT_BUFFER_UPTODATE is set while reading We recently tracked down a race condition that triggered a read for an extent buffer with EXTENT_BUFFER_UPTODATE already set. While this read was in progress, other concurrent readers would see the UPTODATE bit and return early as if the read was already complete, making accesses to the extent buffer conflict with the read operation that was overwriting it. Add a WARN_ON() to end_bbio_meta_read() for this situation to make similar races easier to spot in the future. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Tavian Barnes	1e2d183709	btrfs: add helper to clear EXTENT_BUFFER_READING We are clearing the bit and waking up any waiters in two different places. Factor that code out into a static helper function. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	c79f57eafc	btrfs: avoid pointless wake ups of drew lock readers When unlocking a write lock on a drew lock, at btrfs_drew_write_unlock(), it's pointless to wake up tasks waiting to acquire a read lock if we didn't decrement the 'writers' counter down to 0, since a read lock can only be acquired when the counter reaches a value of 0. Doing so is harmless from a functional point of view, but it's not efficient due to unnecessarily waking up tasks just for them to sleep again on the waitqueue. So change this to wake up readers only if we decremented the 'writers' counter to 0. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	c66f2afc71	btrfs: remove pointless writepages callback wrapper There's no point in having a static writepages callback in inode.c that does nothing besides calling extent_writepages from extent_io.c. So just remove the callback at inode.c and rename extent_writepages() to btrfs_writepages(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	7938d38b94	btrfs: remove pointless readahead callback wrapper There's no point in having a static readahead callback in inode.c that does nothing besides calling extent_readahead() from extent_io.c. So just remove the callback at inode.c and rename extent_readahead() to btrfs_readahead(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	2066bbfccf	btrfs: locking: rename __btrfs_tree_lock() and __btrfs_tree_read_lock() The __btrfs_tree_lock() and __btrfs_tree_read_lock() are using a naming with a double underscore prefix, which is specially not proper for exported functions. Remove the double underscore prefix from their name and add the "_nested" suffix. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	f40ca9cb58	btrfs: locking: inline btrfs_tree_lock() and btrfs_tree_read_lock() The functions btrfs_tree_lock() and btrfs_tree_read_lock() are very trivial so that can be made inline and avoid call overhead, as they are very often called inside critical sections (when searching a btree for example, attempting to lock a child node/leaf while holding a lock on the parent). So make them static inline, which even reduces the size of the btrfs module a little bit. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1718786 156276 16920 1891982 1cde8e fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1718650 156260 16920 1891830 1cddf6 fs/btrfs/btrfs.ko Running fs_mark also showed a tiny improvement with this script: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) echo "performance" \| \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 180894.0 10705410 16 2400000 0 228211.4 10765738 23 3600000 0 215969.6 11011072 30 4800000 0 199077.1 11145587 46 6000000 0 176624.1 11658470 After this change: FSUse% Count Size Files/sec App Overhead 10 1200000 0 185312.3 10708377 16 2400000 0 229320.4 10858013 23 3600000 0 217958.7 11006167 30 4800000 0 205122.9 11112899 46 6000000 0 178039.1 11438852 Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Filipe Manana	05aa024382	btrfs: remove pointless BUG_ON() when creating snapshot When creating a snapshot we first check with btrfs_lookup_dir_item() if there is a name collision in the parent directory and then return an error if there's a collision. Then later on when trying to insert a dir item for the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW: static noinline int create_pending_snapshot(...) { (...) /* check if there is a file/dir which has the same name. / dir_item = btrfs_lookup_dir_item(...); (...) ret = btrfs_insert_dir_item(...); / We have check then name at the beginning, so it is impossible. */ BUG_ON(ret == -EEXIST \|\| ret == -EOVERFLOW); if (ret) { btrfs_abort_transaction(trans, ret); goto fail; } (...) } It's impossible to get the -EEXIST because we previously checked for a potential collision with btrfs_lookup_dir_item() and we know that after that no one could have added a colliding name because at this point the transaction is in its critical section, state TRANS_STATE_COMMIT_DOING, so no one can join this transaction to add a colliding name and neither can anyone start a new transaction to do that. As for the -EOVERFLOW, that can't happen as long as we have the extended references feature enabled, which is a mkfs default for many years now. In either case, the BUG_ON() is excessive as we can properly deal with any error and can abort the transaction and jump to the 'fail' label, in which case we'll also get the useful stack trace (just like a BUG_ON()) from the abort if the error is either -EEXIST or -EOVERFLOW. So remove the BUG_ON(). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-07 21:31:00 +02:00
Kent Overstreet	6e297a73bc	bcachefs: Add missing sched_annotate_sleep() in bch2_journal_flush_seq_async() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-07 11:02:37 -04:00
Kent Overstreet	54541c1f78	bcachefs: Fix race in bch2_write_super() bch2_write_super() was looping over online devices multiple times - dropping and retaking io_ref each time. This meant it could race with device removal; it could increment the sequence number on a device but fail to write it - and then if the device was re-added, it would get confused the next time around thinking a superblock write was silently dropped. Fix this by taking io_ref once, and stashing pointers to online devices in a darray. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-07 11:02:36 -04:00
Wolfram Sang	c1c53c26e3	gfs2: make timeout values more explicit 'timeout' is a vague name for the return value of wait_event_*_timeout because it actually returns the time left. Because the variable is never used later, just drop the return value. Since variable 'timeout' is then only used to carry a fixed timeout value, drop this in favor of a fixed function argument as in the other call to wait_event_timeout() above. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-05-07 12:42:48 +02:00
Linus Torvalds	dccb07f291	for-6.9-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmY5LaQACgkQxWXV+ddt WDs7aQ/8DbxhYTNqHmEv6860w/o7Sb856foIqlZ81v1r55XYIFGhTbvQntjQgvcI Kf8+Du6ijpYeAO2Iuj/7EQP/6yA+f5NCogpW8Nsr24riUCvzjhXr49KQbVg1SsdX i8iec0UCQlxq7RncbpGiIwgxPRMJhUEG/wRHWnGR3jOJFXSvsLJpywZbn+Yw1d7w kcUbHEzZPZqrWPAIcifpv/7qVCd1sPN8P3mMevcWtc1diEhQlHVVF7JnCcHxrwBP dIsNSWyt0YmgIt231GW6GKDwuQHyv870yHK9gumvpePsfcZnDBgeMuMvv0TykhgJ BHV2gwhIK11bNala1pw1F7CX4oiiHEeI/09/nh7xopcjnULMRFItGus2dkqDagSa ex4g48J412crWayZ5uFqAVYeO9MNufvLvCutUj1sD/teh2ymMq82gHzQO0FTu5GL NjWLoJXXyU18BgbXTmbm5rSMycDf1BG9Hv+MdxwEFrasF2q6Lhp+EIljUxN7+n49 i9GrLWptd8sBx/GtZXhsZlWP+vPSuHqdjZe61LD4B3IgBeGDJg6tJmHv8rEFO4Ws 9nkvaDVF03pHWxWOocDIzbrkpVwOLBaDHGwjH9Cn/lgIHL+zjXVpMaKz4/klpOr8 4/ehUajrOK6Wmyoi3fKYxZACnWK5HhFHYcB8zc1R8+zt+Pj/mbk= =2no9 -----END PGP SIGNATURE----- Merge tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two more fixes, both have some visible effects on user space: - add check if quotas are enabled when passing qgroup inheritance info, this affects snapper that could fail to create a snapshot - do check for leaf/node flag WRITTEN earlier so that nodes are completely validated before access, this used to be done by integrity checker but it's been removed and left an unhandled case" * tag 'for-6.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make sure that WRITTEN is set on all metadata blocks btrfs: qgroup: do not check qgroup inherit if qgroup is disabled	2024-05-06 13:43:13 -07:00
Trond Myklebust	939cb14d51	NFS/knfsd: Remove the invalid NFS error 'NFSERR_OPNOTSUPP' NFSERR_OPNOTSUPP is not described by any RFC, and should not be used. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 12:47:24 -04:00
Trond Myklebust	e221c45da3	knfsd: LOOKUP can return an illegal error value The 'NFS error' NFSERR_OPNOTSUPP is not described by any of the official NFS related RFCs, but appears to have snuck into some older .x files for NFSv2. Either way, it is not in RFC1094, RFC1813 or any of the NFSv4 RFCs, so should not be returned by the knfsd server, and particularly not by the "LOOKUP" operation. Instead, let's return NFSERR_STALE, which is more appropriate if the filesystem encodes the filehandle as FILEID_INVALID. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 12:47:24 -04:00
Kent Overstreet	71dac2482a	bcachefs: BCH_SB_LAYOUT_SIZE_BITS_MAX Define a constant for the max superblock size, to avoid a too-large shift. Reported-by: syzbot+a8b0fb419355c91dda7f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	88ab10186c	bcachefs: Add missing skcipher_request_set_callback() call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	8060bf1d83	bcachefs: Fix snapshot_t() usage in bch2_fs_quota_read_inode() bch2_fs_quota_read_inode() wasn't entirely updated to the bch2_snapshot_tree() helper, which takes rcu lock. Reported-by: syzbot+a3a9a61224ed3b7f0010@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	0ec5b3b7cc	bcachefs: Fix shift-by-64 in bformat_needs_redo() Ancient versions of bcachefs produced packed formats that could represent keys that our in memory format cannot represent; bformat_needs_redo() has some tricky shifts to check for this sort of overflow. Reported-by: syzbot+594427aebfefeebe91c6@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	2bb9600d5d	bcachefs: Guard against unknown k.k->type in __bkey_invalid() For forwards compatibility we have to allow unknown key types, and only run the checks that make sense against them. Fix a missing guard on k.k->type being known. Reported-by: syzbot+ae4dc916da3ce51f284f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	f39055220f	bcachefs: Add missing validation for superblock section clean We were forgetting to check for jset entries that overrun the end of the section - both in validate and to_text(); to_text() needs to be safe for types that fail to validate. Reported-by: syzbot+c48865e11e7e893ec4ab@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	6b8cbfc3db	bcachefs: Fix assert in bch2_alloc_v4_invalid() Reported-by: syzbot+10827fa6b176e1acf1d0@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Reed Riley	9a0ec04511	bcachefs: fix overflow in fiemap filefrag (and potentially other utilities that call fiemap) sometimes pass ULONG_MAX as the length. fiemap_prep clamps excessively large lengths - but the calculation of end can overflow if it occurs before calling fiemap_prep. When this happens, filefrag assumes it has read to the end and exits. Signed-off-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	db42549d40	bcachefs: Add a better limit for maximum number of buckets The bucket_gens array is a single array allocation (one byte per bucket), and kernel allocations are still limited to INT_MAX. Check this limit to avoid failing the bucket_gens array allocation. Reported-by: syzbot+b29f436493184ea42e2b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	18b4abcead	bcachefs: Fix lifetime issue in device iterator helpers bch2_get_next_dev() and bch2_get_next_online_dev() iterate over devices, dropping and taking refs as they go; we can't access the previous device (for ca->dev_idx) after we've dropped our ref to it, unless we take rcu_read_lock() first. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	3a2d025927	bcachefs: Fix bch2_dev_lookup() refcounting bch2_dev_lookup() is supposed to take a ref on the device it returns, but for_each_member_device() takes refs as it iterates, for_each_member_device_rcu() does not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	1267df40ac	bcachefs: Initialize bch_write_op->failed in inline data path Normally this is initialized in __bch2_write(), which is executed in a loop, but the inline data path skips this. Reported-by: syzbot+fd3ccb331eb21f05d13b@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	feb077c177	bcachefs: Fix refcount put in sb_field_resize error path Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	4a8521b6bb	bcachefs: Inodes need extra padding for varint_decode_fast() Reported-by: syzbot+66b9b74f6520068596a9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	b30b70ad8b	bcachefs: Fix early error path in bch2_fs_btree_key_cache_exit() Reported-by: syzbot+a35cdb62ec34d44fb062@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	a2ddaf965f	bcachefs: bucket_pos_to_bp_noerror() We don't want the assert when we're checking if the backpointer is valid. Reported-by: syzbot+bf7215c0525098e7747a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	7ffec9ccdc	bcachefs: don't free error pointers Reported-by: syzbot+3333603f569fc2ef258c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:58:17 -04:00
Kent Overstreet	72e71bf029	bcachefs: Fix a scheduler splat in __bch2_next_write_buffer_flush_journal_buf() We're using mutex_lock() inside a wait_event() conditional - prepare_to_wait() has already flipped task state, so potentially blocking ops need annotation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-05-06 10:14:13 -04:00
Mike Marshall	53e4efa470	orangefs: fix out-of-bounds fsid access Arnd Bergmann sent a patch to fsdevel, he says: "orangefs_statfs() copies two consecutive fields of the superblock into the statfs structure, which triggers a warning from the string fortification helpers" Jan Kara suggested an alternate way to do the patch to make it more readable. I ran both ideas through xfstests and both seem fine. This patch is based on Jan Kara's suggestion. Signed-off-by: Mike Marshall <hubcap@omnibond.com>	2024-05-06 10:10:36 -04:00
Stephen Smalley	442d27ff09	nfsd: set security label during create operations When security labeling is enabled, the client can pass a file security label as part of a create operation for the new file, similar to mode and other attributes. At present, the security label is received by nfsd and passed down to nfsd_create_setattr(), but nfsd_setattr() is never called and therefore the label is never set on the new file. This bug may have been introduced on or around commit `d6a97d3f58` ("NFSD: add security label to struct nfsd_attrs"). Looking at nfsd_setattr() I am uncertain as to whether the same issue presents for file ACLs and therefore requires a similar fix for those. An alternative approach would be to introduce a new LSM hook to set the "create SID" of the current task prior to the actual file creation, which would atomically label the new inode at creation time. This would be better for SELinux and a similar approach has been used previously (see security_dentry_create_files_as) but perhaps not usable by other LSMs. Reproducer: 1. Install a Linux distro with SELinux - Fedora is easiest 2. git clone https://github.com/SELinuxProject/selinux-testsuite 3. Install the requisite dependencies per selinux-testsuite/README.md 4. Run something like the following script: MOUNT=$HOME/selinux-testsuite sudo systemctl start nfs-server sudo exportfs -o rw,no_root_squash,security_label localhost:$MOUNT sudo mkdir -p /mnt/selinux-testsuite sudo mount -t nfs -o vers=4.2 localhost:$MOUNT /mnt/selinux-testsuite pushd /mnt/selinux-testsuite/ sudo make -C policy load pushd tests/filesystem sudo runcon -t test_filesystem_t ./create_file -f trans_test_file \ -e test_filesystem_filetranscon_t -v sudo rm -f trans_test_file popd sudo make -C policy unload popd sudo umount /mnt/selinux-testsuite sudo exportfs -u localhost:$MOUNT sudo rmdir /mnt/selinux-testsuite sudo systemctl stop nfs-server Expected output: <eliding noise from commands run prior to or after the test itself> Process context: unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023 Created file: trans_test_file File context: unconfined_u:object_r:test_filesystem_filetranscon_t:s0 File context is correct Actual output: <eliding noise from commands run prior to or after the test itself> Process context: unconfined_u:unconfined_r:test_filesystem_t:s0-s0:c0.c1023 Created file: trans_test_file File context: system_u:object_r:test_file_t:s0 File context error, expected: test_filesystem_filetranscon_t got: test_file_t Signed-off-by: Stephen Smalley <stephen.smalley.work@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:24 -04:00
Chuck Lever	cc63c21682	NFSD: Add COPY status code to OFFLOAD_STATUS response Clients that send an OFFLOAD_STATUS might want to distinguish between an async COPY operation that is still running, has completed successfully, or that has failed. The intention of this patch is to make NFSD behave like this: * Copy still running: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied so far, and an empty osr_status array * Copy completed successfully: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied, and an osr_status of NFS4_OK * Copy failed: OFFLOAD_STATUS returns NFS4_OK, the number of bytes copied, and an osr_status other than NFS4_OK * Copy operation lost, canceled, or otherwise unrecognized: OFFLOAD_STATUS returns NFS4ERR_BAD_STATEID NB: Though RFC 7862 Section 11.2 lists a small set of NFS status codes that are valid for OFFLOAD_STATUS, there do not seem to be any explicit spec limits on the status codes that may be returned in the osr_status field. At this time we have no unit tests for COPY and its brethren, as pynfs does not yet implement support for NFSv4.2. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:23 -04:00
Chuck Lever	a8483b9ad9	NFSD: Record status of async copy operation in struct nfsd4_copy After a client has started an asynchronous COPY operation, a subsequent OFFLOAD_STATUS operation will need to report the status code once that COPY operation has completed. The recorded status record will be used by a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:23 -04:00
Lorenzo Bianconi	16a4711774	NFSD: add listener-{set,get} netlink command Introduce write_ports netlink command. For listener-set, userspace is expected to provide a NFS listeners list it wants enabled. All other sockets will be closed. Reviewed-by: Jeff Layton <jlayton@kernel.org> Co-developed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:22 -04:00
Lorenzo Bianconi	5a939bea25	NFSD: add write_version to netlink command Introduce write_version netlink command through a "declarative" interface. This patch introduces a change in behavior since for version-set userspace is expected to provide a NFS major/minor version list it wants to enable while all the other ones will be disabled. (procfs write_version command implements imperative interface where the admin writes +3/-3 to enable/disable a single version. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:21 -04:00
Lorenzo Bianconi	924f4fb003	NFSD: convert write_threads to netlink command Introduce write_threads netlink command similar to the one available through the procfs. Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Co-developed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:21 -04:00
Jeff Layton	9077d59847	NFSD: allow callers to pass in scope string to nfsd_svc Currently admins set this by using unshare to create a new uts namespace, and then resetting the hostname. With the new netlink interface we can just pass this in directly. Prepare nfsd_svc for this change. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:20 -04:00
Jeff Layton	0842b4c80b	NFSD: move nfsd_mutex handling into nfsd_svc callers Currently nfsd_svc holds the nfsd_mutex over the whole function. For some of the later netlink patches though, we want to do some other things to the server before starting it. Move the mutex handling into the callers. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:20 -04:00
Li kunyu	03b0036f45	lockd: host: Remove unnecessary statements＇host = NULL;＇ In 'nlm_alloc_host', the host has already been assigned a value of NULL when defined, so 'host=NULL;' Can be deleted. Signed-off-by: Li kunyu <kunyu@nfschina.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:20 -04:00
NeilBrown	0770249b90	nfsd: don't create nfsv4recoverydir in nfsdfs when not used. When CONFIG_NFSD_LEGACY_CLIENT_TRACKING is not set, the virtual file /proc/fs/nfsd/nfsv4recoverydir is created but responds EINVAL to any access. This is not useful, is somewhat surprising, and it causes ltp to complain. The only known user of this file is in nfs-utils, which handles non-existence and read-failure equally well. So there is nothing to gain from leaving the file present but inaccessible. So this patch removes the file when its content is not available - i.e. when that config option is not selected. Also remove the #ifdef which hides some of the enum values when CONFIG_NFSD_V$ not selection. simple_fill_super() quietly ignores array entries that are not present, so having slots in the array that don't get used is perfectly acceptable. So there is no value in this #ifdef. Reported-by: Petr Vorel <pvorel@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: `74fd48739d` ("nfsd: new Kconfig option for legacy client tracking") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:19 -04:00
NeilBrown	d43113fbbf	nfsd: optimise recalculate_deny_mode() for a common case recalculate_deny_mode() takes time that is linear in the number of stateids active on the file. When called from release_openowner -> free_ol_stateid_reaplist ->nfs4_free_ol_stateid -> release_all_access the number of times it is called is linear in the number of stateids. The net result is that time taken by release_openowner is quadratic in the number of stateids. When the nfsd server is shut down while there are many active stateids this can result in a soft lockup. ("CPU stuck for 302s" seen in one case). In many cases all the states have the same deny modes and there is no need to examine the entire list in recalculate_deny_mode(). In particular, recalculate_deny_mode() will only reduce the deny mode, never increase it. So if some prefix of the list causes the original deny mode to be required, there is no need to examine the remainder of the list. So we can improve recalculate_deny_mode() to usually run in constant time, so release_openowner will typically be only linear in the number of states. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:19 -04:00
Jeff Layton	9320f27fda	nfsd: add tracepoint in mark_client_expired_locked Show client info alongside the number of cl_rpc_users. If that's elevated, then we can infer that this function returned nfserr_jukebox. [ cel: For additional debugging of RPC user refcounting ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:19 -04:00
Chuck Lever	2d49901150	nfsd: new tracepoint for check_slot_seqid Replace a dprintk in check_slot_seqid with tracepoints. These new tracepoints track slot sequence numbers during operation. Suggested-by: Jeffrey Layton <jlayton@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:18 -04:00
Jeff Layton	db2acd6e8a	nfsd: drop extraneous newline from nfsd tracepoints We never want a newline in tracepoint output. Signed-off-by: Jeff Layton <jlayton@kernel.org> Tested-by: Vladimir Benes <vbenes@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:18 -04:00
Kefeng Wang	7d12cce878	fs: nfsd: use group allocation/free of per-cpu counters API Use group allocation/free of per-cpu counters api to accelerate nfsd percpu_counters init/destroy(), and also squash the nfsd_percpu_counters_init/reset/destroy() and nfsd_counters_init/destroy() into callers to simplify code. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:17 -04:00
Jeff Layton	33a1e6ea73	nfsd: trivial GET_DIR_DELEGATION support This adds basic infrastructure for handing GET_DIR_DELEGATION calls from clients, including the decoders and encoders. For now, it always just returns NFS4_OK + GDD4_UNAVAIL. Eventually clients may start sending this operation, and it's better if we can return GDD4_UNAVAIL instead of having to abort the whole compound. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:17 -04:00
Chuck Lever	38f080f3cd	NFSD: Move callback_wq into struct nfs4_client Commit `8838203667` ("nfsd: update workqueue creation") made the callback_wq single-threaded, presumably to protect modifications of cl_cb_client. See documenting comment for nfsd4_process_cb_update(). However, cl_cb_client is per-lease. There's no other reason that all callback operations need to be dispatched via a single thread. The single threading here means all client callbacks can be blocked by a problem with one client. Change the NFSv4 callback client so it serializes per-lease instead of serializing all NFSv4 callback operations on the server. Reported-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:16 -04:00
NeilBrown	56c35f43ee	nfsd: drop st_mutex before calling move_to_close_lru() move_to_close_lru() is currently called with ->st_mutex held. This can lead to a deadlock as move_to_close_lru() waits for sc_count to drop to 2, and some threads holding a reference might be waiting for the mutex. These references will never be dropped so sc_count will never reach 2. There can be no harm in dropping ->st_mutex before move_to_close_lru() because the only place that takes the mutex is nfsd4_lock_ol_stateid(), and it quickly aborts if sc_type is NFS4_CLOSED_STID, which it will be before move_to_close_lru() is called. See also https://lore.kernel.org/lkml/4dd1fe21e11344e5969bb112e954affb@jd.com/T/ where this problem was raised but not successfully resolved. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:16 -04:00
NeilBrown	eec7620800	nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru() move_to_close_lru() waits for sc_count to become zero while holding rp_mutex. This can deadlock if another thread holds a reference and is waiting for rp_mutex. By the time we get to move_to_close_lru() the openowner is unhashed and cannot be found any more. So code waiting for the mutex can safely retry the lookup if move_to_close_lru() has started. So change rp_mutex to an atomic_t with three states: RP_UNLOCK - state is still hashed, not locked for reply RP_LOCKED - state is still hashed, is locked for reply RP_UNHASHED - state is not hashed, no code can get a lock. Use wait_var_event() to wait for either a lock, or for the owner to be unhashed. In the latter case, retry the lookup. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:16 -04:00
NeilBrown	b3f03739ca	nfsd: move nfsd4_cstate_assign_replay() earlier in open handling. Rather than taking the rp_mutex (via nfsd4_cstate_assign_replay) in nfsd4_cleanup_open_state() (which seems counter-intuitive), take it and assign rp_owner as soon as possible - in nfsd4_process_open1(). This will support a future change when nfsd4_cstate_assign_replay() might fail. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:15 -04:00
NeilBrown	23df17788c	nfsd: perform all find_openstateowner_str calls in the one place. Currently find_openstateowner_str look ups are done both in nfsd4_process_open1() and alloc_init_open_stateowner() - the latter possibly being a surprise based on its name. It would be easier to follow, and more conformant to common patterns, if the lookup was all in the one place. So replace alloc_init_open_stateowner() with find_or_alloc_open_stateowner() and use the latter in nfsd4_process_open1() without any calls to find_openstateowner_str(). This means all finds are find_openstateowner_str_locked() and find_openstateowner_str() is no longer needed. So discard find_openstateowner_str() and rename find_openstateowner_str_locked() to find_openstateowner_str(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-05-06 09:07:15 -04:00
Matthew Wilcox	e0ffb29bc5	mm: simplify thp_vma_allowable_order Combine the three boolean arguments into one flags argument for readability. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:53 -07:00
Matthew Wilcox (Oracle)	196ad49cd6	f2fs: convert f2fs_clear_page_cache_dirty_tag to use a folio Removes uses of page_mapping() and page_index(). Link: https://lkml.kernel.org/r/20240423225552.4113447-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:47 -07:00
Matthew Wilcox (Oracle)	262f014dd7	fscrypt: convert bh_get_inode_and_lblk_num to use a folio Patch series "Remove page_mapping()". There are only a few users left. Convert them all to either call folio_mapping() or just use folio->mapping directly. This patch (of 6): Remove uses of page->index, page_mapping() and b_page. Saves a call to compound_head(). Link: https://lkml.kernel.org/r/20240423225552.4113447-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240423225552.4113447-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:47 -07:00
David Hildenbrand	6401a2e690	fs/proc/task_mmu: convert smaps_hugetlb_range() to work on folios Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, use huge_ptep_get() + pte_page() instead of ptep_get() + vm_normal_page(), just like we do in pagemap_hugetlb_range(). No functional change intended. Link: https://lkml.kernel.org/r/20240417092313.753919-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:41 -07:00
David Hildenbrand	88e4e47c12	fs/proc/task_mmu: convert pagemap_hugetlb_range() to work on folios Patch series "fs/proc/task_mmu: convert hugetlb functions to work on folis". Let's convert two more functions, getting rid of two more page_mapcount() calls. This patch (of 2): Let's get rid of another page_mapcount() check and simply use folio_likely_mapped_shared(), which is precise for hugetlb folios. While at it, also check for PMD table sharing, like we do in smaps_hugetlb_range(). No functional change intended, except that we would now detect hugetlb folios shared via PMD table sharing correctly. Link: https://lkml.kernel.org/r/20240417092313.753919-1-david@redhat.com Link: https://lkml.kernel.org/r/20240417092313.753919-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:41 -07:00
Matthew Wilcox (Oracle)	0b116ff4dc	buffer: improve bdev_getblk documentation Add some more information about the state of the buffer_head returned. Link: https://lkml.kernel.org/r/20240416031754.4076917-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:40 -07:00
Matthew Wilcox (Oracle)	b73a936f99	buffer: add kernel-doc for bforget() and __bforget() Distinguish these functions from brelse() and __brelse(). Link: https://lkml.kernel.org/r/20240416031754.4076917-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:40 -07:00
Matthew Wilcox (Oracle)	66924fdaf8	buffer: add kernel-doc for brelse() and __brelse() Move the documentation for __brelse() to brelse(), format it as kernel-doc and update it from talking about pages to folios. Link: https://lkml.kernel.org/r/20240416031754.4076917-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:39 -07:00
Matthew Wilcox (Oracle)	324ecaee46	buffer: fix __bread and __bread_gfp kernel-doc The extra indentation confused the kernel-doc parser, so remove it. Fix some other wording while I'm here, and advise the user they need to call brelse() on this buffer. __bread_gfp() isn't used directly by filesystems, but the other wrappers for it don't have documentation, so document it accordingly. Link: https://lkml.kernel.org/r/20240416031754.4076917-5-willy@infradead.org Co-developed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:39 -07:00
Matthew Wilcox (Oracle)	b1888d1432	buffer: add kernel-doc for try_to_free_buffers() The documentation for this function has become separated from it over time; move it to the right place and turn it into kernel-doc. Mild editing of the content to make it more about what the function does, and less about how it does it. Link: https://lkml.kernel.org/r/20240416031754.4076917-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:39 -07:00
Matthew Wilcox (Oracle)	3814ec8954	buffer: add kernel-doc for block_dirty_folio() Turn the excellent documentation for this function into kernel-doc. Replace 'page' with 'folio' and make a few other minor updates. Link: https://lkml.kernel.org/r/20240416031754.4076917-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:53:39 -07:00
Ryan Roberts	2c7ad9a590	fs/proc/task_mmu: fix uffd-wp confusion in pagemap_scan_pmd_entry() pagemap_scan_pmd_entry() checks if uffd-wp is set on each pte to avoid unnecessary if set. However it was previously checking with `pte_uffd_wp(ptep_get(pte))` without first confirming that the pte was present. It is only valid to call pte_uffd_wp() for present ptes. For swap ptes, pte_swp_uffd_wp() must be called because the uffd-wp bit may be kept in a different position, depending on the arch. This was leading to test failures in the pagemap_ioctl mm selftest, when bringing up uffd-wp support on arm64 due to incorrectly interpretting the uffd-wp status of migration entries. Let's fix this by using the correct check based on pte_present(). While we are at it, let's pass the pte to make_uffd_wp_pte() to avoid the pointless extra ptep_get() which can't be optimized out due to READ_ONCE() on many arches. Link: https://lkml.kernel.org/r/20240429114104.182890-1-ryan.roberts@arm.com Fixes: `12f6b01a0b` ("fs/proc/task_mmu: add fast paths to get/clear PAGE_IS_WRITTEN flag") Closes: https://lore.kernel.org/linux-arm-kernel/ZiuyGXt0XWwRgFh9@x1n/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:07 -07:00
Ryan Roberts	c70dce4982	fs/proc/task_mmu: fix loss of young/dirty bits during pagemap scan make_uffd_wp_pte() was previously doing: pte = ptep_get(ptep); ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); But if another thread accessed or dirtied the pte between the first 2 calls, this could lead to loss of that information. Since ptep_modify_prot_start() gets and clears atomically, the following is the correct pattern and prevents any possible race. Any access after the first call would see an invalid pte and cause a fault: pte = ptep_modify_prot_start(ptep); pte = pte_mkuffd_wp(pte); ptep_modify_prot_commit(ptep, pte); Link: https://lkml.kernel.org/r/20240429114017.182570-1-ryan.roberts@arm.com Fixes: `52526ca7fd` ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:06 -07:00
Peter Xu	c88033efe9	mm/userfaultfd: reset ptes when close() for wr-protected ones Userfaultfd unregister includes a step to remove wr-protect bits from all the relevant pgtable entries, but that only covered an explicit UFFDIO_UNREGISTER ioctl, not a close() on the userfaultfd itself. Cover that too. This fixes a WARN trace. The only user visible side effect is the user can observe leftover wr-protect bits even if the user close()ed on an userfaultfd when releasing the last reference of it. However hopefully that should be harmless, and nothing bad should happen even if so. This change is now more important after the recent page-table-check patch we merged in mm-unstable (446dd9ad37d0 ("mm/page_table_check: support userfault wr-protect entries")), as we'll do sanity check on uffd-wp bits without vma context. So it's better if we can 100% guarantee no uffd-wp bit leftovers, to make sure each report will be valid. Link: https://lore.kernel.org/all/000000000000ca4df20616a0fe16@google.com/ Fixes: `f369b07c86` ("mm/uffd: reset write protection when unregister with wp-mode") Analyzed-by: David Hildenbrand <david@redhat.com> Link: https://lkml.kernel.org/r/20240422133311.2987675-1-peterx@redhat.com Reported-by: syzbot+d8426b591c36b21c750e@syzkaller.appspotmail.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-05-05 17:28:04 -07:00
Linus Torvalds	4efaa5acf0	epoll: be better about file lifetimes epoll can call out to vfs_poll() with a file pointer that may race with the last 'fput()'. That would make f_count go down to zero, and while the ep->mtx locking means that the resulting file pointer tear-down will be blocked until the poll returns, it means that f_count is already dead, and any use of it won't actually get a reference to the file any more: it's dead regardless. Make sure we have a valid ref on the file pointer before we call down to vfs_poll() from the epoll routines. Link: https://lore.kernel.org/lkml/0000000000002d631f0615918f1e@google.com/ Reported-by: syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-05-05 14:00:48 -07:00
Linus Torvalds	e92b99ae82	tracing and tracefs fixes for v6.9 - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZjXxzxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqzGAQCX8g7gtngGgwSsWqPW5GmecCifwFja k7cVEDhMYPnDeAEAkYi2ZBgJRkPsWPfMRClDK/DXP4woOo58asxtIxfTMgg= =mCkt -----END PGP SIGNATURE----- Merge tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode	2024-05-05 09:53:09 -07:00
Namjae Jeon	691aae4f36	ksmbd: do not grant v2 lease if parent lease key and epoch are not set This patch fix xfstests generic/070 test with smb2 leases = yes. cifs.ko doesn't set parent lease key and epoch in create context v2 lease. ksmbd suppose that parent lease and epoch are vaild if data length is v2 lease context size and handle directory lease using this values. ksmbd should hanle it as v1 lease not v2 lease if parent lease key and epoch are not set in create context v2 lease. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-05-04 23:53:36 -05:00
Namjae Jeon	d1c189c6cb	ksmbd: use rwsem instead of rwlock for lease break lease break wait for lease break acknowledgment. rwsem is more suitable than unlock while traversing the list for parent lease break in ->m_op_list. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-05-04 23:53:36 -05:00
Namjae Jeon	97c2ec6466	ksmbd: avoid to send duplicate lease break notifications This patch fixes generic/011 when enable smb2 leases. if ksmbd sends multiple notifications for a file, cifs increments the reference count of the file but it does not decrement the count by the failure of queue_work. So even if the file is closed, cifs does not send a SMB2_CLOSE request. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-05-04 23:53:35 -05:00
Namjae Jeon	cc00bc83f2	ksmbd: off ipv6only for both ipv4/ipv6 binding ΕΛΕΝΗ reported that ksmbd binds to the IPV6 wildcard (::) by default for ipv4 and ipv6 binding. So IPV4 connections are successful only when the Linux system parameter bindv6only is set to 0 [default value]. If this parameter is set to 1, then the ipv6 wildcard only represents any IPV6 address. Samba creates different sockets for ipv4 and ipv6 by default. This patch off sk_ipv6only to support IPV4/IPV6 connections without creating two sockets. Cc: stable@vger.kernel.org Reported-by: ΕΛΕΝΗ ΤΖΑΒΕΛΛΑ <helentzavellas@yahoo.gr> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-05-04 23:53:35 -05:00
Li zeming	75cde4e37a	kernfs: mount: Remove unnecessary ‘NULL’ values from knparent knparent is assigned first, so it does not need to initialize the assignment. Signed-off-by: Li zeming <zeming@nfschina.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240415102009.9926-1-zeming@nfschina.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2024-05-04 19:02:39 +02:00
Steven Rostedt (Google)	d57cf30c4c	eventfs: Have "events" directory get permissions from its parent The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r----- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 18:55 current_tracer -rw-r----- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w------- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r----- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r----- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r----- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:16 current_tracer -rw-r----- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w------- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r----- 1 root lkp 0 May 1 21:16 set_event Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.161887248@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `8186fff7ab` ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	22e61e15af	eventfs: Do not treat events directory different than other directories Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Link: https://lore.kernel.org/linux-trace-kernel/20240502200906.002923579@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `8186fff7ab` ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	d53891d348	eventfs: Do not differentiate the toplevel events directory The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.846448710@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `8186fff7ab` ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	6599bd5517	tracefs: Still use mount point as default permissions for instances If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r----- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r----- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.686838327@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `8186fff7ab` ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	baa23a8d43	tracefs: Reset permissions on remount if permissions are options There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r----- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root root 0 May 1 21:25 dynamic_events -r--r----- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r----- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r----- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r----- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r----- 1 root lkp 0 May 1 21:25 current_tracer -rw-r----- 1 root tracing 0 May 1 21:25 dynamic_events -r--r----- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r----- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.529542160@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `8186fff7ab` ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	ee4e037947	eventfs: Free all of the eventfs_inode after RCU The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Link: https://lore.kernel.org/linux-trace-kernel/20240502200905.370261163@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `43aa6f97c2` ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Steven Rostedt (Google)	b63db58e2f	eventfs/tracing: Add callback for release of an eventfs_inode Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240502090315.448cba46@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `5790b1fb3d` ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu <Tze-nan.Wu@mediatek.com> Tested-by: Tze-nan Wu (吳澤南) <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-05-04 04:25:37 -04:00
Matthew Wilcox (Oracle)	50fabd42cb	gfs2: Convert gfs2_aspace_writepage() to use a folio Convert the incoming struct page to a folio and use it throughout. Saves six calls to compound_head(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-05-03 21:01:02 +02:00
Matthew Wilcox (Oracle)	b844048011	gfs2: Add a migrate_folio operation for journalled files For journalled data, folio migration currently works by writing the folio back, freeing the folio and faulting the new folio back in. We can bypass that by telling the migration code to migrate the buffer_heads attached to our folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-05-03 21:01:02 +02:00
Eric Biggers	ee5814ddde	fsverity: use register_sysctl_init() to avoid kmemleak warning Since the fsverity sysctl registration runs as a builtin initcall, there is no corresponding sysctl deregistration and the resulting struct ctl_table_header is not used. This can cause a kmemleak warning just after the system boots up. (A pointer to the ctl_table_header is stored in the fsverity_sysctl_header static variable, which kmemleak should detect; however, the compiler can optimize out that variable.) Avoid the kmemleak warning by using register_sysctl_init() which is intended for use by builtin initcalls and uses kmemleak_not_leak(). Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/r/CAHj4cs8DTSvR698UE040rs_pX1k-WVe7aR6N2OoXXuhXJPDC-w@mail.gmail.com Cc: stable@vger.kernel.org Reviewed-by: Joel Granados <j.granados@samsung.com> Link: https://lore.kernel.org/r/20240501025331.594183-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2024-05-03 08:30:58 -07:00
Ritesh Harjani (IBM)	290fa9471e	ext2: Remove LEGACY_DIRECT_IO dependency commit `fb5de4358e` ("ext2: Move direct-io to use iomap"), converted ext2 direct-io to iomap which killed the call to blockdev_direct_IO(). So let's remove LEGACY_DIRECT_IO config dependency from ext2 Kconfig. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <f3303addc0b5cd7e5760beb2374b7e538a49d898.1714727887.git.ritesh.list@gmail.com>	2024-05-03 11:50:28 +02:00
Al Viro	3386183382	nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode I suspect that inode_attach_wb() use is rather unidiomatic, but that's a separate story - in any case, its use is a few times per mount and the route by which we access that inode is "the host of address_space a page belongs to". Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-03 02:36:55 -04:00
Al Viro	2d0026a490	gfs2: more obvious initializations of mapping->host what's going on is copying the ->host of bdev's address_space Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-4-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	53cd4cd3b1	fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping both for ->i_blkbits and both want the address_space in question anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-03 02:36:51 -04:00
Al Viro	22f89a4f8c	grow_dev_folio(): we only want ->bd_inode->i_mapping there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-3-viro@zeniv.linux.org.uk Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	224941e837	use ->bd_mapping instead of ->bd_inode->i_mapping Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	a09cd552e0	Merge branch 'misc.erofs' into work.bdev	2024-05-03 02:36:28 -04:00
Yu Kuai	9baf50dfff	bcachefs: remove dead function bdev_sectors() bdev_sectors() is not used hence remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-10-viro@zeniv.linux.org.uk Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:28:28 -04:00
Yu Kuai	559428a915	ext4: remove block_device_ejected() block_device_ejected() is added by commit `bdfe0cbd74` ("Revert "ext4: remove block_device_ejected"") in 2015. At that time 'bdi->wb' is destroyed synchronized from del_gendisk(), hence if ext4 is still mounted, and then mark_buffer_dirty() will reference destroyed 'wb'. However, such problem doesn't exist anymore: - commit `d03f6cdc1f` ("block: Dynamically allocate and refcount backing_dev_info") switch bdi to use refcounting; - commit `13eec2363e` ("fs: Get proper reference for s_bdi"), will grab additional reference of bdi while mounting, so that 'bdi->wb' will not be destroyed until generic_shutdown_super(). Hence remove this dead function block_device_ejected(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-7-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:28:27 -04:00
Christoph Hellwig	25576c5420	xfs: simplify iext overflow checking and upgrade Currently the calls to xfs_iext_count_may_overflow and xfs_iext_count_upgrade are always paired. Merge them into a single function to simplify the callers and the actual check and upgrade logic itself. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:20:06 +05:30
Christoph Hellwig	86de848403	xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent Accessing if_bytes without the ilock is racy. Remove the initial if_bytes == 0 check in xfs_reflink_end_cow_extent and let ext_iext_lookup_extent fail for this case after we've taken the ilock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:20:06 +05:30
Christoph Hellwig	99fb6b7ad1	xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later Defer the extent counter size upgrade until we know we're going to modify the extent mapping. This also defers dirtying the transaction and will allow us safely back out later in the function in later changes. Fixes: `4f86bb4b66` ("xfs: Conditionally upgrade existing inodes to use large extent counters") Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:20:06 +05:30
Christoph Hellwig	cc3c92e7e7	xfs: xfs_quota_unreserve_blkres can't fail Unreserving quotas can't fail due to quota limits, and we'll notice a shut down file system a bit later in all the callers anyway. Return void and remove the error checking and propagation in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:15:03 +05:30
Christoph Hellwig	f7b9ee7845	xfs: consolidate the xfs_quota_reserve_blkres definitions xfs_trans_reserve_quota_nblks is already stubbed out if quota support is disabled, no need for an extra xfs_quota_reserve_blkres stub. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:15:03 +05:30
Christoph Hellwig	67a841f9d7	xfs: clean up buffer allocation in xlog_do_recovery_pass Merge the initial xlog_alloc_buffer calls, and pass the variable designating the length that is initialized to 1 above instead of passing the open coded 1 directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:10:17 +05:30
Christoph Hellwig	45cf976008	xfs: fix log recovery buffer allocation for the legacy h_size fixup Commit `a70f9fe52d` ("xfs: detect and handle invalid iclog size set by mkfs") added a fixup for incorrect h_size values used for the initial umount record in old xfsprogs versions. Later commit `0c771b99d6` ("xfs: clean up calculation of LR header blocks") cleaned up the log reover buffer calculation, but stoped using the fixed up h_size value to size the log recovery buffer, which can lead to an out of bounds access when the incorrect h_size does not come from the old mkfs tool, but a fuzzer. Fix this by open coding xlog_logrec_hblks and taking the fixed h_size into account for this calculation. Fixes: `0c771b99d6` ("xfs: clean up calculation of LR header blocks") Reported-by: Sam Sun <samsun1006219@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-05-03 11:10:17 +05:30
Kemeng Shi	da5704eef7	ext4: open coding repeated check in next_linear_group Open coding repeated check in next_linear_group. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-6-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:33 -04:00
Kemeng Shi	2caffb6a27	ext4: use correct criteria name instead stale integer number in comment Use correct criteria name instead stale integer number in comment Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	d1a3924e43	ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk In mb_mark_used, we will find free chunk and mark it inuse. For chunk in mid of passed range, we could simply mark whole chunk inuse. For chunk at end of range, we may need to mark a continuous bits at end of part of chunk inuse and keep rest part of chunk free. To only mark a part of chunk inuse, we firstly mark whole chunk inuse and then mark a continuous range at end of chunk free. Function mb_mark_used does several times of "mb_find_buddy; mb_clear_bit; ..." to mark a continuous range free which can be done by simply calling ext4_mb_mark_free_simple which free continuous bits in a more effective way. Just call ext4_mb_mark_free_simple in mb_mark_used to use existing and effective code to free continuous blocks in chunk at end of passed range. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	d0b88624f8	ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used Add test_mb_mark_used_cost to estimate cost of mb_mark_used Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20240424061904.987525-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	9c97c34a99	ext4: keep "prefetch_grp" and "nr" consistent Keep "prefetch_grp" and "nr" consistent to avoid to call ext4_mb_prefetch_fini with non-prefetched groups. When we step into next criteria, "prefetch_grp" is set to prefetch start of new criteria while "nr" is number of the prefetched group in previous criteria. If previous criteria and next criteria are both inexpensive (< CR_GOAL_LEN_SLOW) and prefetch_ios reachs sbi->s_mb_prefetch_limit in previous criteria, "prefetch_grp" and "nr" will be inconsistent and may introduce unexpected cost to do ext4_mb_init_group for non-prefetched groups. Reset "nr" to 0 when we reset "prefetch_grp" to goal group to keep them consistent. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240424061904.987525-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:12:32 -04:00
Kemeng Shi	a11adf7be9	ext4: implement filesystem specific alloc_inode in unit test We expect inode with ext4_info_info type as following: mbt_kunit_init mbt_mb_init ext4_mb_init ext4_mb_init_backend sbi->s_buddy_cache = new_inode(sb); EXT4_I(sbi->s_buddy_cache)->i_disksize = 0; Implement alloc_inode ionde with ext4_inode_info type to avoid out-of-bounds write. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322165518.8147-1-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:04:35 -04:00
Jan Kara	0a46ef2347	ext4: do not create EA inode under buffer lock ext4_xattr_set_entry() creates new EA inodes while holding buffer lock on the external xattr block. This is problematic as it nests all the allocation locking (which acquires locks on other buffers) under the buffer lock. This can even deadlock when the filesystem is corrupted and e.g. quota file is setup to contain xattr block as data block. Move the allocation of EA inode out of ext4_xattr_set_entry() into the callers. Reported-by: syzbot+a43d4f48b8397d0e41a9@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:02:24 -04:00
Jan Kara	4f3e6db3c3	Revert "ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()" This reverts commit `7f48212678`. We will need the special cleanup handling once we move allocation of EA inode outside of the buffer lock in the following patch. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240321162657.27420-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-03 00:02:08 -04:00
Justin Stitt	744a56389f	ext4: replace deprecated strncpy with alternatives strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. in file.c: s_last_mounted is marked as __nonstring meaning it does not need to be NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the string source to the byte array destination -- while also ensuring to pad with zeroes. in ioctl.c: We can drop the memset and size argument in favor of using the new 2-argument version of strscpy_pad() -- which was introduced with Commit `e6584c3964` ("string: Allow 2-argument strscpy()"). This guarantees NUL-termination and NUL-padding on the destination buffer -- which seems to be a requirement judging from this comment: \| static int ext4_ioctl_getlabel(struct ext4_sb_info sbi, char __user user_label) \| { \| char label[EXT4_LABEL_MAX + 1]; \| \| /* \| * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because \| * FSLABEL_MAX must include terminating null byte, while s_volume_name \| * does not have to. \| */ in super.c: s_first_error_func is marked as __nonstring meaning we can take the same approach as in file.c; just use strtomem_pad() Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:55:10 -04:00
Baokun Li	e19089dff5	ext4: clean up s_mb_rb_lock to fix build warnings with C=1 Running sparse (make C=1) on mballoc.c we get the following warning: fs/ext4/mballoc.c:3194:13: warning: context imbalance in 'ext4_mb_seq_structs_summary_start' - wrong count at exit This is because __acquires(&EXT4_SB(sb)->s_mb_rb_lock) was called in ext4_mb_seq_structs_summary_start(), but s_mb_rb_lock was removed in commit `83e80a6e35` ("ext4: use buckets for cr 1 block scan instead of rbtree"), so remove the __acquires to silence the warning. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-10-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	261341a932	ext4: set the type of max_zeroout to unsigned int to avoid overflow The max_zeroout is of type int and the s_extent_max_zeroout_kb is of type uint, and the s_extent_max_zeroout_kb can be freely modified via the sysfs interface. When the block size is 1024, max_zeroout may overflow, so declare it as unsigned int to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	9a9f3a9842	ext4: set type of ac_groups_linear_remaining to __u32 to avoid overflow Now ac_groups_linear_remaining is of type __u16 and s_mb_max_linear_groups is of type unsigned int, so an overflow occurs when setting a value above 65535 through the mb_max_linear_groups sysfs interface. Therefore, the type of ac_groups_linear_remaining is set to __u32 to avoid overflow. Fixes: `196e402adf` ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:31 -04:00
Baokun Li	63bfe84105	ext4: add positive int attr pointer to avoid sysfs variables overflow The following variables controlled by the sysfs interface are of type int and are normally used in the range [0, INT_MAX], but are declared as attr_pointer_ui, and thus may be set to values that exceed INT_MAX and result in overflows to get negative values. err_ratelimit_burst msg_ratelimit_burst warning_ratelimit_burst err_ratelimit_interval_ms msg_ratelimit_interval_ms warning_ratelimit_interval_ms Therefore, we add attr_pointer_pi (aka positive int attr pointer) with a value range of 0-INT_MAX to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	b7b2a5799b	ext4: add new attr pointer attr_mb_order The s_mb_best_avail_max_trim_order is of type unsigned int, and has a range of values well beyond the normal use of the mb_order. Although the mballoc code is careful enough that large numbers don't matter there, but this can mislead the sysadmin into thinking that it's normal to set such values. Hence add a new attr_id attr_mb_order with values in the range [0, 64] to avoid storing garbage values and make us more resilient to surprises in the future. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	13df4d44a3	ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists() We can trigger a slab-out-of-bounds with the following commands: mkfs.ext4 -F /dev/$disk 10G mount /dev/$disk /tmp/test echo 2147483647 > /sys/fs/ext4/$disk/mb_group_prealloc echo test > /tmp/test/file && sync ================================================================== BUG: KASAN: slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] Read of size 8 at addr ffff888121b9d0f0 by task kworker/u2:0/11 CPU: 0 PID: 11 Comm: kworker/u2:0 Tainted: GL 6.7.0-next-20240118 #521 Call Trace: dump_stack_lvl+0x2c/0x50 kasan_report+0xb6/0xf0 ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4] ext4_mb_regular_allocator+0x19e9/0x2370 [ext4] ext4_mb_new_blocks+0x88a/0x1370 [ext4] ext4_ext_map_blocks+0x14f7/0x2390 [ext4] ext4_map_blocks+0x569/0xea0 [ext4] ext4_do_writepages+0x10f6/0x1bc0 [ext4] [...] ================================================================== The flow of issue triggering is as follows: // Set s_mb_group_prealloc to 2147483647 via sysfs ext4_mb_new_blocks ext4_mb_normalize_request ext4_mb_normalize_group_request ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc ext4_mb_regular_allocator ext4_mb_choose_next_group ext4_mb_choose_next_group_best_avail mb_avg_fragment_size_order order = fls(len) - 2 = 29 ext4_mb_find_good_group_avg_frag_lists frag_list = &sbi->s_mb_avg_fragment_size[order] if (list_empty(frag_list)) // Trigger SOOB! At 4k block size, the length of the s_mb_avg_fragment_size list is 14, but an oversized s_mb_group_prealloc is set, causing slab-out-of-bounds to be triggered by an attempt to access an element at index 29. Add a new attr_id attr_clusters_in_group with values in the range [0, sbi->s_clusters_per_group] and declare mb_group_prealloc as that type to fix the issue. In addition avoid returning an order from mb_avg_fragment_size_order() greater than MB_NUM_ORDERS(sb) and reduce some useless loops. Fixes: `7e170922f0` ("ext4: Add allocation criteria 1.5 (CR1_5)") CC: stable@vger.kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240319113325.3110393-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	57341fe317	ext4: refactor out ext4_generic_attr_show() Refactor out the function ext4_generic_attr_show() to handle the reading of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	f536808adc	ext4: refactor out ext4_generic_attr_store() Refactor out the function ext4_generic_attr_store() to handle the setting of values of various common types, with no functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Baokun Li	9e8e819f8f	ext4: avoid overflow when setting values via sysfs When setting values of type unsigned int through sysfs, we use kstrtoul() to parse it and then truncate part of it as the final set value, when the set value is greater than UINT_MAX, the set value will not match what we see because of the truncation. As follows: $ echo 4294967296 > /sys/fs/ext4/sda/mb_max_linear_groups $ cat /sys/fs/ext4/sda/mb_max_linear_groups 0 So we use kstrtouint() to parse the attr_pointer_ui type to avoid the inconsistency described above. In addition, a judgment is added to avoid setting s_resv_clusters less than 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240319113325.3110393-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 23:48:30 -04:00
Max Kellermann	c77194965d	Revert "ext4: apply umask if ACL support is disabled" This reverts commit `484fd6c1de`. The commit caused a regression because now the umask was applied to symlinks and the fix is unnecessary because the umask/O_TMPFILE bug has been fixed somewhere else already. Fixes: https://lore.kernel.org/lkml/28DSITL9912E1.2LSZUVTGTO52Q@mforney.org/ Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Tested-by: Michael Forney <mforney@mforney.org> Link: https://lore.kernel.org/r/20240315142956.2420360-1-max.kellermann@ionos.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 18:25:39 -04:00
Al Viro	ead083aeee	set_blocksize(): switch to passing struct file * Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:39:44 -04:00
Al Viro	b85c42981a	btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens btrfs_get_bdev_and_sb() has two callers - btrfs_open_one_device(), which asks for open to be exclusive and btrfs_get_dev_args_from_path(), which doesn't. Currently it does set_blocksize() in all cases. I'm rather dubious about the need to do set_blocksize() anywhere in btrfs, to be honest - there's some access to page cache of underlying block devices in there, but it's nowhere near the hot paths, AFAICT. In any case, btrfs_get_dev_args_from_path() only needs to read the on-disk superblock and copy several fields out of it; all callers are only interested in devices that are already opened and brought into per-filesystem set, so setting the block size is redundant for those and actively harmful if we are given a pathname of unrelated device. So we only need btrfs_get_bdev_and_sb() to call set_blocksize() when it's asked to open exclusive. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-05-02 17:39:44 -04:00
Josef Bacik	e03418abde	btrfs: make sure that WRITTEN is set on all metadata blocks We previously would call btrfs_check_leaf() if we had the check integrity code enabled, which meant that we could only run the extended leaf checks if we had WRITTEN set on the header flags. This leaves a gap in our checking, because we could end up with corruption on disk where WRITTEN isn't set on the leaf, and then the extended leaf checks don't get run which we rely on to validate all of the item pointers to make sure we don't access memory outside of the extent buffer. However, since `732fab95ab` ("btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only ever call it on blocks that are being written out, and thus have WRITTEN set, or that are being read in, which should have WRITTEN set. Add checks to make sure we have WRITTEN set appropriately, and then make sure __btrfs_check_leaf() always does the item checking. This will protect us from file systems that have been corrupted and no longer have WRITTEN set on some of the blocks. This was hit on a crafted image tweaking the WRITTEN bit and reported by KASAN as out-of-bound access in the eb accessors. The example is a dir item at the end of an eb. [2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2 [2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI [2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f] [2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1 [2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0 [2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206 [2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0 [2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748 [2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9 [2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a [2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8 [2.621] FS: 00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 [2.621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0 [2.621] Call Trace: [2.621] <TASK> [2.621] ? show_regs+0x74/0x80 [2.621] ? die_addr+0x46/0xc0 [2.621] ? exc_general_protection+0x161/0x2a0 [2.621] ? asm_exc_general_protection+0x26/0x30 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? btrfs_get_16+0x34b/0x6d0 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? __pfx_btrfs_get_16+0x10/0x10 [2.621] ? __pfx_mutex_unlock+0x10/0x10 [2.621] btrfs_match_dir_item_name+0x101/0x1a0 [2.621] btrfs_lookup_dir_item+0x1f3/0x280 [2.621] ? __pfx_btrfs_lookup_dir_item+0x10/0x10 [2.621] btrfs_get_tree+0xd25/0x1910 Reported-by: lei lu <llfamsec@gmail.com> CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from report ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-02 22:11:13 +02:00
Qu Wenruo	b5357cb268	btrfs: qgroup: do not check qgroup inherit if qgroup is disabled [BUG] After kernel commit `86211eea8a` ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter"), user space tool snapper will fail to create snapshot using its timeline feature. [CAUSE] It turns out that, if using timeline snapper would unconditionally pass btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0) for snapshot creation. In that case, since qgroup is disabled there would be no qgroup 1/0, and btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole snapshot creation. [FIX] Just skip the check if qgroup is not enabled. This is to keep the older behavior for user space tools, as if the kernel behavior changed for user space, it is a regression of kernel. Thankfully snapper is also fixing the behavior by detecting if qgroup is running in the first place, so the effect should not be that huge. Link: https://github.com/openSUSE/snapper/issues/894 Fixes: `86211eea8a` ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-05-02 21:30:14 +02:00
Kent Overstreet	fb092d4072	ext4: add support for FS_IOC_GETFSSYSFSPATH The new sysfs path ioctl lets us get the /sys/fs/ path for a given filesystem in a fs agnostic way, potentially nudging us towards standarizing some of our reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Link: https://lore.kernel.org/r/20240315035308.3563511-4-kent.overstreet@linux.dev Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 15:21:23 -04:00
Jakub Kicinski	e958da0ddb	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c `66e13b615a` ("bpf: verifier: prevent userspace memory access") `d503a04f8b` ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/20240429114939.210328b0@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-05-02 12:06:25 -07:00
Jan Kara	35a1f12f0c	ext4: avoid excessive credit estimate in ext4_tmpfile() A user with minimum journal size (1024 blocks these days) complained about the following error triggered by generic/697 test in ext4_tmpfile(): run fstests generic/697 at 2024-02-28 05:34:46 JBD2: vfstest wants too many credits credits:260 rsv_credits:0 max:256 EXT4-fs error (device loop0) in __ext4_new_inode:1083: error 28 Indeed the credit estimate in ext4_tmpfile() is huge. EXT4_MAXQUOTAS_INIT_BLOCKS() is 219, then 10 credits from ext4_tmpfile() itself and then ext4_xattr_credits_for_new_inode() adds more credits needed for security attributes and ACLs. Now the EXT4_MAXQUOTAS_INIT_BLOCKS() is in fact unnecessary because we've already initialized quotas with dquot_init() shortly before and so EXT4_MAXQUOTAS_TRANS_BLOCKS() is enough (which boils down to 3 credits). Fixes: `af51a2ac36` ("ext4: ->tmpfile() support") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Luis Henriques <lhenriques@suse.de> Tested-by: Disha Goel <disgoel@linux.ibm.com> Link: https://lore.kernel.org/r/20240307115320.28949-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 14:49:16 -04:00
Thorsten Blum	ea7d09ad7c	ext4: remove unneeded if checks before kfree kfree already checks if its argument is NULL. This fixes two Coccinelle/coccicheck warnings reported by ifnullfree.cocci. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20240317153638.2136-2-thorsten.blum@toblux.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 14:44:51 -04:00
Miklos Szeredi	096802748e	ovl: remove upper umask handling from ovl_create_upper() This is already done by vfs_prepare_mode() when creating the upper object by vfs_create(), vfs_mkdir() and vfs_mknod(). No regressions have been observed in xfstests run with posix acls turned off for the upper filesystem. Fixes: `1639a49ccd` ("fs: move S_ISGID stripping into the vfs_*() helpers") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2024-05-02 20:35:57 +02:00
Miklos Szeredi	9a87907de3	ovl: implement tmpfile Combine inode creation with opening a file. There are six separate objects that are being set up: the backing inode, dentry and file, and the overlay inode, dentry and file. Cleanup in case of an error is a bit of a challenge and is difficult to test, so careful review is needed. All tmpfile testcases except generic/509 now run/pass, and no regressions are observed with full xfstests. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com>	2024-05-02 20:35:57 +02:00
Linus Torvalds	f03359bca0	for-6.9-rc6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYzivoACgkQxWXV+ddt WDu4TxAAgK+W1RSvrc2xe6MfHFMi2x2pL2qM0IEcYbmjNZJDQlmGYNj3jILho62/ /mHyA5skMr9hN58FFUJveiBj3qOds/lZD0640sGGpysFJKzA4/Wdg5xJvpsQtyDM jr6BcgZOQ+j7Pqe7zsm/sc0n5yG4P+cydnlCFMNvpRfZjg1kYIV9F92qEPAHtLCx BoDJyHhCEqFWWyH2nALu3syTHyvGECUCBEHLFgyGcG/IXT6Oq/BpsDZPm1j72NCt 9f58OY7/2R9QJYfCjYidFGnr3EYdI5CnCOtR2sQLcRUOISOOQSni52r5tonPdpm2 7QRPyuXTiVxpM909phGJt5wwyssK/JQgxUjUo3s0U04+qXb3cRoJny3vAcGcnuyk W7lYh08QRQa3dzZ/Q+GFxqPPovdZalTHXYMAYP7QGwLuv+fZkqh39oz6LQfw7F7c JxEjuSCSd8lJpFyIDkirZF9lELurjgt0Zn3RNe25BLiBpeqFvTQdAYGo5wML3Ug0 kHSmZVFC2En8Ad2AahpkGToVKGgUumo4RAZDiRGIUaHEoS7XfBbnPOAtC7Z1RKTS 9N++XVtJ1/uYQiLM5afiZRtUTkA/jqjSNH/v3YYTS18SczKEOWlHnpJeQSWK0rD1 rzbKZ+2MhBL5CGQnwkhUi0u07QorvMkQhWCHpf9au9rtUggg+nU= =zEs6 -----END PGP SIGNATURE----- Merge tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - set correct ram_bytes when splitting ordered extent. This can be inconsistent on-disk but harmless as it's not used for calculations and it's only advisory for compression - fix lockdep splat when taking cleaner mutex in qgroups disable ioctl - fix missing mutex unlock on error path when looking up sys chunk for relocation * tag 'for-6.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: set correct ram_bytes when splitting ordered extent btrfs: take the cleaner_mutex earlier in qgroup disable btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks()	2024-05-02 10:49:12 -07:00
Matthew Wilcox (Oracle)	75377ae754	gfs2: Simplify gfs2_read_super Use submit_bio_wait() instead of hand-rolling our own synchronous wait. Also allocate the BIO on the stack since we're not deep in the call stack at this point. There's no need to kmap the page, since it isn't allocated from HIGHMEM. Turn the GFP_NOFS allocation into GFP_KERNEL; if the page allocator enters reclaim, we cannot be called as the filesystem has not yet been initialised and so has no pages to reclaim. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-05-02 19:24:08 +02:00
Christoph Hellwig	a0c7cce824	ext4: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method Since commit `a2ad63daa8` ("VFS: add FMODE_CAN_ODIRECT file flag") file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> [RH: Rebased to upstream] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/e5797bb597219a49043e53e4e90aa494b97dc328.1709215665.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:53:32 -04:00
Ritesh Harjani (IBM)	53c17fe55a	ext4: Remove PAGE_MASK dependency on mpage_submit_folio This patch simply removes the PAGE_MASK dependency since mpage_submit_folio() is already converted to work with folio. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/d6eadb090334ea49ceef4e643b371fabfcea328f.1709182251.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:50:44 -04:00
Ritesh Harjani (IBM)	c2a09f3d78	ext4: Fixes len calculation in mpage_journal_page_buffers Truncate operation can race with writeback, in which inode->i_size can get truncated and therefore size - folio_pos() can be negative. This fixes the len calculation. However this path doesn't get easily triggered even with data journaling. Cc: stable@kernel.org # v6.5 Fixes: `80be8c5cc9` ("Fixes: ext4: Make mpage_journal_page_buffers use folio") Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/cff4953b5c9306aba71e944ab176a5d396b9a1b7.1709182250.git.ritesh.list@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2024-05-02 10:50:44 -04:00
Darrick J. Wong	1a3f1afb25	xfs: widen flags argument to the xfs_iflags_* helpers xfs_inode.i_flags is an unsigned long, so make these helpers take that as the flags argument instead of unsigned short. This is needed for the next patch. While we're at it, remove the iflags variable from xfs_iget_cache_miss because we no longer need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com>	2024-05-02 07:48:37 -07:00
Darrick J. Wong	3791a05329	xfs: minor cleanups of xfs_attr3_rmt_blocks Clean up the type signature of this function since we don't have negative attr lengths or block counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-05-02 07:48:37 -07:00
Darrick J. Wong	204a26aa1d	xfs: create a helper to compute the blockcount of a max sized remote value Create a helper function to compute the number of fsblocks needed to store a maximally-sized extended attribute value. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-05-02 07:48:36 -07:00
Darrick J. Wong	a5714b67ca	xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function Turn this into a properly typechecked function, and actually use the correct blocksize for extended attributes. The function cannot be static inline because xfsprogs userspace uses it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-05-02 07:48:36 -07:00
Darrick J. Wong	a86f8671d0	xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c In the next few patches we're going to refactor the attr remote code so that we can support headerless remote xattr values for storing merkle tree blocks. For now, let's change the code to use unsigned int to describe quantities of bytes and blocks that cannot be negative. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-05-02 07:48:35 -07:00
Christophe JAILLET	e035af9f6e	seq_file: Simplify __seq_puts() Change the implementation of the out-of-line __seq_puts() to simply be a seq_write() call instead of duplicating the overflow/memcpy logic. Suggested-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/7cebc1412d8d1338a7e52cc9291d00f5368c14e4.1713781332.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-02 16:28:20 +02:00
Christophe JAILLET	45751097ae	seq_file: Optimize seq_puts() Most of seq_puts() usages are done with a string literal. In such cases, the length of the string car be computed at compile time in order to save a strlen() call at run-time. seq_putc() or seq_write() can then be used instead. This saves a few cycles. To have an estimation of how often this optimization triggers: $ git grep seq_puts.\" \| wc -l 3436 $ git grep seq_puts.\".\" \| wc -l 84 Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/a8589bffe4830dafcb9111e22acf06603fea7132.1713781332.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christian Brauner <brauner@kernel.org> The output for seq_putc() generation has also be checked and works.	2024-05-02 16:28:15 +02:00
Tyler Hicks (Microsoft)	0a960ba498	proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation The following commits loosened the permissions of /proc/<PID>/fdinfo/ directory, as well as the files within it, from 0500 to 0555 while also introducing a PTRACE_MODE_READ check between the current task and <PID>'s task: - commit `7bc3fa0172` ("procfs: allow reading fdinfo with PTRACE_MODE_READ") - commit `1927e498ae` ("procfs: prevent unprivileged processes accessing fdinfo dir") Before those changes, inode based system calls like inotify_add_watch(2) would fail when the current task didn't have sufficient read permissions: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0500, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] This matches the documented behavior in the inotify_add_watch(2) man page: ERRORS EACCES Read access to the given file is not permitted. After those changes, inotify_add_watch(2) started succeeding despite the current task not having PTRACE_MODE_READ privileges on the target task: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = 1757 openat(AT_FDCWD, "/proc/1/task/1/fdinfo", O_RDONLY\|O_NONBLOCK\|O_CLOEXEC\|O_DIRECTORY) = -1 EACCES (Permission denied) [...] This change in behavior broke .NET prior to v7. See the github link below for the v7 commit that inadvertently/quietly (?) fixed .NET after the kernel changes mentioned above. Return to the old behavior by moving the PTRACE_MODE_READ check out of the file .open operation and into the inode .permission operation: [...] lstat("/proc/1/task/1/fdinfo", {st_mode=S_IFDIR\|0555, st_size=0, ...}) = 0 inotify_add_watch(64, "/proc/1/task/1/fdinfo", IN_MODIFY\|IN_ATTRIB\|IN_MOVED_FROM\|IN_MOVED_TO\|IN_CREATE\|IN_DELETE\| IN_ONLYDIR\|IN_DONT_FOLLOW\|IN_EXCL_UNLINK) = -1 EACCES (Permission denied) [...] Reported-by: Kevin Parsons (Microsoft) <parsonskev@gmail.com> Link: `89e5469ac5` Link: https://stackoverflow.com/questions/75379065/start-self-contained-net6-build-exe-as-service-on-raspbian-system-unauthorizeda Fixes: `7bc3fa0172` ("procfs: allow reading fdinfo with PTRACE_MODE_READ") Cc: stable@vger.kernel.org Cc: Christian Brauner <brauner@kernel.org> Cc: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Hardik Garg <hargar@linux.microsoft.com> Cc: Allen Pais <apais@linux.microsoft.com> Signed-off-by: Tyler Hicks (Microsoft) <code@tyhicks.com> Link: https://lore.kernel.org/r/20240501005646.745089-1-code@tyhicks.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-02 11:42:04 +02:00
David Howells	7c1ac89480	cifs: Enable large folio support Now that cifs is using netfslib for its VM interaction, it only sees I/O in terms of iov_iter iterators and does not see pages or folios. This makes large multipage folios transparent to cifs and so we can turn on multipage folios on regular files. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:22 +01:00
David Howells	b593634424	cifs: Remove some code that's no longer used, part 3 Remove some code that was #if'd out with the netfslib conversion. This is split into parts for file.c as the diff generator otherwise produces a hard to read diff for part of it where a big chunk is cut out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:22 +01:00
David Howells	2f99c0bce6	cifs: Remove some code that's no longer used, part 2 Remove some code that was #if'd out with the netfslib conversion. This is split into parts for file.c as the diff generator otherwise produces a hard to read diff for part of it where a big chunk is cut out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:22 +01:00
David Howells	742b3443e2	cifs: Remove some code that's no longer used, part 1 Remove some code that was #if'd out with the netfslib conversion. This is split into parts for file.c as the diff generator otherwise produces a hard to read diff for part of it where a big chunk is cut out. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:21 +01:00
David Howells	3ee1a1fc39	cifs: Cut over to using netfslib Make the cifs filesystem use netfslib to handle reading and writing on behalf of cifs. The changes include: (1) Various read_iter/write_iter type functions are turned into wrappers around netfslib API functions or are pointed directly at those functions: cifs_file_direct{,_nobrl}_ops switch to use netfs_unbuffered_read_iter and netfs_unbuffered_write_iter. Large pieces of code that will be removed are #if'd out and will be removed in subsequent patches. [?] Why does cifs mark the page dirty in the destination buffer of a DIO read? Should that happen automatically? Does netfs need to do that? Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:21 +01:00
David Howells	69c3c023af	cifs: Implement netfslib hooks Provide implementation of the netfslib hooks that will be used by netfslib to ask cifs to set up and perform operations. Of particular note are () cifs_clamp_length() - This is used to negotiate the size of the next subrequest in a read request, taking into account the credit available and the rsize. The credits are attached to the subrequest. () cifs_req_issue_read() - This is used to issue a subrequest that has been set up and clamped. () cifs_prepare_write() - This prepares to fill a subrequest by picking a channel, reopening the file and requesting credits so that we can set the maximum size of the subrequest and also sets the maximum number of segments if we're doing RDMA. () cifs_issue_write() - This releases any unneeded credits and issues an asynchronous data write for the contiguous slice of file covered by the subrequest. This should possibly be folded in to all ->async_writev() ops and that called directly. () cifs_begin_writeback() - This gets the cached writable handle through which we do writeback (this does not affect writethrough, unbuffered or direct writes). At this point, cifs is not wired up to actually use* netfslib; that will be done in a subsequent patch. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:21 +01:00
David Howells	c20c0d7325	cifs: Make add_credits_and_wake_if() clear deducted credits Make add_credits_and_wake_if() clear the amount of credits in the cifs_credits struct after it has returned them to the overall counter. This allows add_credits_and_wake_if() to be called multiple times during the error handling and cleanup without accidentally returning the credits again and again. Note that the wake_up() in add_credits_and_wake_if() may also be superfluous as ->add_credits() also does a wake on the request_q. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:08:21 +01:00
David Howells	edea94a697	cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs Add mempools for the allocation of cifs_io_request and cifs_io_subrequest structs for netfslib to use so that it can guarantee eventual allocation in writeback. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:20 +01:00
David Howells	3758c485f6	cifs: Set zero_point in the copy_file_range() and remap_file_range() Set zero_point in the copy_file_range() and remap_file_range() implementations so that we don't skip reading data modified on a server-side copy. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:20 +01:00
David Howells	1a5b4edd97	cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c so that they are colocated with similar functions rather than being split with cifsfs.c. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:20 +01:00
David Howells	dc5939de82	cifs: Replace the writedata replay bool with a netfs sreq flag Replace the 'replay' bool in cifs_writedata (now cifs_io_subrequest) with a flag in the netfs_io_subrequest flags. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:19 +01:00
David Howells	56257334e8	cifs: Make wait_mtu_credits take size_t args Make the wait_mtu_credits functions use size_t for the size and num arguments rather than unsigned int as netfslib uses size_t/ssize_t for arguments and return values to allow for extra capacity. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: linux-cachefs@redhat.com cc: netfs@lists.linux.dev cc: linux-mm@kvack.org	2024-05-01 18:08:19 +01:00
David Howells	ab58fbdeeb	cifs: Use more fields from netfs_io_subrequest Use more fields from netfs_io_subrequest instead of those incorporated into cifs_io_subrequest from cifs_readdata and cifs_writedata. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:19 +01:00
David Howells	a975a2f22c	cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest Replace the cifs_writedata struct with the same wrapper around netfs_io_subrequest that was used to replace cifs_readdata. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:18 +01:00
David Howells	753b67eb63	cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest Netfslib has a facility whereby the allocation for netfs_io_subrequest can be increased to so that filesystem-specific data can be tagged on the end. Prepare to use this by making a struct, cifs_io_subrequest, that wraps netfs_io_subrequest, and absorb struct cifs_readdata into it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:08:18 +01:00
David Howells	0f7c0f3f51	cifs: Use alternative invalidation to using launder_folio Use writepages-based flushing invalidation instead of invalidate_inode_pages2() and ->launder_folio(). This will allow ->launder_folio() to be removed eventually. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:08:18 +01:00
David Howells	1ecb146f7c	netfs, afs: Use writeback retry to deal with alternate keys Use a hook in the new writeback code's retry algorithm to rotate the keys once all the outstanding subreqs have failed rather than doing it separately on each subreq. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:38 +01:00
David Howells	d41ca44c20	netfs: Miscellaneous tidy ups Do a couple of miscellaneous tidy ups: (1) Add a qualifier into a file banner comment. (2) Put the writeback folio traces back into alphabetical order. (3) Remove some unused folio traces. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:38 +01:00
David Howells	c245868524	netfs: Remove the old writeback code Remove the old writeback code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:38 +01:00
David Howells	2df86547b2	netfs: Cut over to using new writeback code Cut over to using the new writeback code. The old code is #ifdef'd out or otherwise removed from compilation to avoid conflicts and will be removed in a future patch. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:37 +01:00
David Howells	64e64e6c18	netfs, cachefiles: Implement helpers for new write code Implement the helpers for the new write code in cachefiles. There's now an optional ->prepare_write() that allows the filesystem to set the parameters for the next write, such as maximum size and maximum segment count, and an ->issue_write() that is called to initiate an (asynchronous) write operation. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-erofs@lists.ozlabs.org cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:37 +01:00
David Howells	5fb70e7275	netfs, 9p: Implement helpers for new write code Implement the helpers for the new write code in 9p. There's now an optional ->prepare_write() that allows the filesystem to set the parameters for the next write, such as maximum size and maximum segment count, and an ->issue_write() that is called to initiate an (asynchronous) write operation. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: v9fs@lists.linux.dev cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:37 +01:00
David Howells	ed22e1dbf8	netfs, afs: Implement helpers for new write code Implement the helpers for the new write code in afs. There's now an optional ->prepare_write() that allows the filesystem to set the parameters for the next write, such as maximum size and maximum segment count, and an ->issue_write() that is called to initiate an (asynchronous) write operation. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:36 +01:00
David Howells	4824e5917f	netfs: Add some write-side stats and clean up some stat names Add some write-side stats to count buffered writes, buffered writethrough, and writepages calls. Whilst we're at it, clean up the naming on some of the existing stats counters and organise the output into two sets. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:36 +01:00
David Howells	288ace2f57	netfs: New writeback implementation The current netfslib writeback implementation creates writeback requests of contiguous folio data and then separately tiles subrequests over the space twice, once for the server and once for the cache. This creates a few issues: (1) Every time there's a discontiguity or a change between writing to only one destination or writing to both, it must create a new request. This makes it harder to do vectored writes. (2) The folios don't have the writeback mark removed until the end of the request - and a request could be hundreds of megabytes. (3) In future, I want to support a larger cache granularity, which will require aggregation of some folios that contain unmodified data (which only need to go to the cache) and some which contain modifications (which need to be uploaded and stored to the cache) - but, currently, these are treated as discontiguous. There's also a move to get everyone to use writeback_iter() to extract writable folios from the pagecache. That said, currently writeback_iter() has some issues that make it less than ideal: (1) there's no way to cancel the iteration, even if you find a "temporary" error that means the current folio and all subsequent folios are going to fail; (2) there's no way to filter the folios being written back - something that will impact Ceph with it's ordered snap system; (3) and if you get a folio you can't immediately deal with (say you need to flush the preceding writes), you are left with a folio hanging in the locked state for the duration, when really we should unlock it and relock it later. In this new implementation, I use writeback_iter() to pump folios, progressively creating two parallel, but separate streams and cleaning up the finished folios as the subrequests complete. Either or both streams can contain gaps, and the subrequests in each stream can be of variable size, don't need to align with each other and don't need to align with the folios. Indeed, subrequests can cross folio boundaries, may cover several folios or a folio may be spanned by multiple folios, e.g.: +---+---+-----+-----+---+----------+ Folios: \| \| \| \| \| \| \| +---+---+-----+-----+---+----------+ +------+------+ +----+----+ Upload: \| \| \|.....\| \| \| +------+------+ +----+----+ +------+------+------+------+------+ Cache: \| \| \| \| \| \| +------+------+------+------+------+ The progressive subrequest construction permits the algorithm to be preparing both the next upload to the server and the next write to the cache whilst the previous ones are already in progress. Throttling can be applied to control the rate of production of subrequests - and, in any case, we probably want to write them to the server in ascending order, particularly if the file will be extended. Content crypto can also be prepared at the same time as the subrequests and run asynchronously, with the prepped requests being stalled until the crypto catches up with them. This might also be useful for transport crypto, but that happens at a lower layer, so probably would be harder to pull off. The algorithm is split into three parts: (1) The issuer. This walks through the data, packaging it up, encrypting it and creating subrequests. The part of this that generates subrequests only deals with file positions and spans and so is usable for DIO/unbuffered writes as well as buffered writes. (2) The collector. This asynchronously collects completed subrequests, unlocks folios, frees crypto buffers and performs any retries. This runs in a work queue so that the issuer can return to the caller for writeback (so that the VM can have its kswapd thread back) or async writes. (3) The retryer. This pauses the issuer, waits for all outstanding subrequests to complete and then goes through the failed subrequests to reissue them. This may involve reprepping them (with cifs, the credits must be renegotiated, and a subrequest may need splitting), and doing RMW for content crypto if there's a conflicting change on the server. [!] Note that some of the functions are prefixed with "new_" to avoid clashes with existing functions. These will be renamed in a later patch that cuts over to the new algorithm. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:36 +01:00
David Howells	7ba167c4c7	netfs: Switch to using unsigned long long rather than loff_t Switch to using unsigned long long rather than loff_t in netfslib to avoid problems with the sign flipping in the maths when we're dealing with the byte at position 0x7fffffffffffffff. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:35 +01:00
David Howells	d9f85a04fb	netfs: Use mempools for allocating requests and subrequests Use mempools for allocating requests and subrequests in an effort to make sure that allocation always succeeds so that when performing writeback we can always make progress. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-05-01 18:07:35 +01:00
David Howells	b4ff7b178b	netfs: Remove ->launder_folio() support Remove support for ->launder_folio() from netfslib and expect filesystems to use filemap_invalidate_inode() instead. netfs_launder_folio() can then be got rid of. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Steve French <sfrench@samba.org> cc: Matthew Wilcox <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: devel@lists.orangefs.org	2024-05-01 18:07:34 +01:00
David Howells	d73065e60d	afs: Use alternative invalidation to using launder_folio Use writepages-based flushing invalidation instead of invalidate_inode_pages2() and ->launder_folio(). This will allow ->launder_folio() to be removed eventually. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:34 +01:00
David Howells	40fb4828d5	9p: Use alternative invalidation to using launder_folio Use writepages-based flushing invalidation instead of invalidate_inode_pages2() and ->launder_folio(). This will allow ->launder_folio() to be removed eventually. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Jeff Layton <jlayton@kernel.org> cc: v9fs@lists.linux.dev cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-05-01 18:07:34 +01:00
David Howells	74e797d79c	mm: Provide a means of invalidation without using launder_folio Implement a replacement for launder_folio. The key feature of invalidate_inode_pages2() is that it locks each folio individually, unmaps it to prevent mmap'd accesses interfering and calls the ->launder_folio() address_space op to flush it. This has problems: firstly, each folio is written individually as one or more small writes; secondly, adjacent folios cannot be added so easily into the laundry; thirdly, it's yet another op to implement. Instead, use the invalidate lock to cause anyone wanting to add a folio to the inode to wait, then unmap all the folios if we have mmaps, then, conditionally, use ->writepages() to flush any dirty data back and then discard all pages. The invalidate lock prevents ->read_iter(), ->write_iter() and faulting through mmap all from adding pages for the duration. This is then used from netfslib to handle the flusing in unbuffered and direct writes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Miklos Szeredi <miklos@szeredi.hu> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Christoph Hellwig <hch@lst.de> cc: Andrew Morton <akpm@linux-foundation.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: devel@lists.orangefs.org	2024-05-01 18:07:06 +01:00
Qu Wenruo	63a6ce5a1a	btrfs: set correct ram_bytes when splitting ordered extent [BUG] When running generic/287, the following file extent items can be generated: item 16 key (258 EXTENT_DATA 2682880) itemoff 15305 itemsize 53 generation 9 type 1 (regular) extent data disk byte 1378414592 nr 462848 extent data offset 0 nr 462848 ram 2097152 extent compression 0 (none) Note that file extent item is not a compressed one, but its ram_bytes is way larger than its disk_num_bytes. According to btrfs on-disk scheme, ram_bytes should match disk_num_bytes if it's not a compressed one. [CAUSE] Since commit `b73a6fd1b1` ("btrfs: split partial dio bios before submit"), for partial dio writes, we would split the ordered extent. However the function btrfs_split_ordered_extent() doesn't update the ram_bytes even it has already shrunk the disk_num_bytes. Originally the function btrfs_split_ordered_extent() is only introduced for zoned devices in commit `d22002fd37` ("btrfs: zoned: split ordered extent when bio is sent"), but later commit `b73a6fd1b1` ("btrfs: split partial dio bios before submit") makes non-zoned btrfs affected. Thankfully for un-compressed file extent, we do not really utilize the ram_bytes member, thus it won't cause any real problem. [FIX] Also update btrfs_ordered_extent::ram_bytes inside btrfs_split_ordered_extent(). Fixes: `d22002fd37` ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-04-30 12:03:44 +02:00
Christoph Hellwig	21255afdd7	xfs: do not allocate the entire delalloc extent in xfs_bmapi_write While trying to convert the entire delalloc extent is a good decision for regular writeback as it leads to larger contigous on-disk extents, but for other callers of xfs_bmapi_write is is rather questionable as it forced them to loop creating new transactions just in case there is no large enough contiguous extent to cover the whole delalloc reservation. Change xfs_bmapi_write to only allocate the passed in range instead, whіle the writeback path through xfs_bmapi_convert_delalloc and xfs_bmapi_allocate still always converts the full extents. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	d69bee6a35	xfs: fix xfs_bmap_add_extent_delay_real for partial conversions xfs_bmap_add_extent_delay_real takes parts or all of a delalloc extent and converts them to a real extent. It is written to deal with any potential overlap of the to be converted range with the delalloc extent, but it turns out that currently only converting the entire extents, or a part starting at the beginning is actually exercised, as the only caller always tries to convert the entire delalloc extent, and either succeeds or at least progresses partially from the start. If it only converts a tiny part of a delalloc extent, the indirect block calculation for the new delalloc extent (da_new) might be equivalent to that of the existing delalloc extent (da_old). If this extent conversion now requires allocating an indirect block that gets accounted into da_new, leading to the assert that da_new must be smaller or equal to da_new unless we split the extent to trigger. Except for the assert that case is actually handled by just trying to allocate more space, as that already handled for the split case (which currently can't be reached at all), so just reusing it should be fine. Except that without dipping into the reserved block pool that would make it a bit too easy to trigger a fs shutdown due to ENOSPC. So in addition to adjusting the assert, also dip into the reserved block pool. Note that I could only reproduce the assert with a change to only convert the actually asked range instead of the full delalloc extent from xfs_bmapi_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	a8bb258f70	xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate Both callers of xfs_bmapi_allocate already initialize bma->prev, don't redo that in xfs_bmapi_allocate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	2a9b99d45b	xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate xfs_bmapi_allocate currently overwrites offset and len when converting delayed allocations, and duplicates the length cap done for non-delalloc allocations. Move all that logic into the callers to avoid duplication and to make the calling conventions more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	9d06960341	xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write XFS_FILBLKS_MIN uses min_t and thus does the comparison using the correct xfs_filblks_t type. Use it in xfs_bmapi_write and slightly adjust the comment document th potential pitfall to take account of this Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	04c609e6e5	xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate xfs_bmapi_convert_delalloc has a xfs_valid_startblock check on the block allocated by xfs_bmapi_allocate. Lift it into xfs_bmapi_allocate as we should assert the same for xfs_bmapi_write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	b11ed354c9	xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate tmp_logflags is initialized to 0 and then ORed into bma->logflags, which isn't actually doing anything. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:19 +05:30
Christoph Hellwig	6773da870a	xfs: fix error returns from xfs_bmapi_write xfs_bmapi_write can return 0 without actually returning a mapping in mval in two different cases: 1) when there is absolutely no space available to do an allocation 2) when converting delalloc space, and the allocation is so small that it only covers parts of the delalloc extent before the range requested by the caller Callers at best can handle one of these cases, but in many cases can't cope with either one. Switch xfs_bmapi_write to always return a mapping or return an error code instead. For case 1) above ENOSPC is the obvious choice which is very much what the callers expect anyway. For case 2) there is no really good error code, so pick a funky one from the SysV streams portfolio. This fixes the reproducer here: https://lore.kernel.org/linux-xfs/CAEJPjCvT3Uag-pMTYuigEjWZHn1sGMZ0GCjVVCv29tNHK76Cgg@mail.gmail.com0/ which uses reserved blocks to create file systems that are gravely out of space and thus cause at least xfs_file_alloc_space to hang and trigger the lack of ENOSPC handling in xfs_dquot_disk_alloc. Note that this patch does not actually make any caller but xfs_alloc_file_space deal intelligently with case 2) above. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: 刘通 <lyutoon@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-30 09:45:18 +05:30
Linus Torvalds	a91bae8794	nfsd-6.9 fixes: - Avoid freeing unallocated memory (v6.7 regression) -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYv/dwACgkQM2qzM29m f5dcFQ//R5MRfOwLFqvNO2wwh750ijbR9tlUn9ce+PVQRpM1m1AhcDD9h98ZxNjn 8rwxMPTJVd0E8GV5BG5B7vaYYQySuU6C9LAbDDFL1mWOSf2VFIY8U6YntqasQ77K HGXdFhHKlCLCwIGTvQgJ4Dnu6Ds2mjm83wH9IZ0SXdwHZsr5MeSiOmtpjDZn/Efm isx+OBjf4uYEwo02qzMzdX5mmj2eXDwincjK9dEUXYn6yyzIDEwo/wUDtzsaxuui TKm1Ni0uul8idukNOH+Clrp+sSDN+bR57mLhBZuToGxBbdMLaHBHOABYSX5BIDKv YCHYaIFrya9dtcZdTk5sIEj94hSwl63PP37RZTS/o/djMNWYNkiUmay+UtRpDkk2 xbzx+C2hgb5OeGiI4ZDBrcIDi5bf5fIAT6yE8hfL853phloH15jlup9dnZQlB4B3 PsdQEdJZY3uoOQSJ31/okDRFSxR98xClZxZX+N7zdXWcYtP1LXUk+vNTt44riw8C i3MBBWF2kxsGQ/r5J0phFRKunWCRjUQD36co4m7hJUYHqdJq2jNnSvk6O6F4R+yv IRQOs+qpX2TdMkNJ2jGaV8d4pt0f5bMLPeh3vl3Smf1xzfX+X0E4trtXTsw0nIrD m52ZV93Zag3Q4SkQ3ir5r/lfqX8sybJbVqZba85abtiqSSDIsFY= =GqdB -----END PGP SIGNATURE----- Merge tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Avoid freeing unallocated memory (v6.7 regression) * tag 'nfsd-6.9-6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix nfsd4_encode_fattr4() crasher	2024-04-29 14:22:24 -07:00
Linus Torvalds	9e4bc4bcae	NFS client bugfixes for Linux 6.9 Bugfixes: - Fix an Oops in xs_tcp_tls_setup_socket - Fix an Oops due to missing error handling in nfs_net_init() -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmYv7PEACgkQZwvnipYK APKcgA//XE7iII6qQU9IC6jiv44qO7NtB6Zy3PQFCWO2ssMSqXbc4lO2eCmR/1nA 3Mlf1RwPxp+M+iOqFgANZV7voQ/r6djEyM1ycr+J2G/mfoxmKMVmnvg3lcyAfYNj 3fm6n0t8ZCkb3URoO4K0ejw007QfN2zpCL2psKucBdahOX7OYHT45o02liKN8ge5 0h7XKUjDStKsId4y5UVNB+QUeaQaWzKKMCzTzX4CxfHXZpIjbDjkdJ9WxAue2+Th 5yvNtPKkTi9EYwHjFAgN7ZKAC7Gu+jZXs9ewdqdyaSfYlioGk7ALz2ZZSptKb5zr +nwuR+SC4Yzg5uBdjOLXS7/6Z/CVyp2bmgoFAzrP8cC0zfB7wJMNwUTZbUx3d823 Q7xYwecj1F9PE5CjHJlYpFZiKkiMHB242EFmBrcUR+yoRPHRgEg32UpI1/YWYnVO pG+Hto0O8JnlGzkqslKy/qN7OMFgNTli+nrFnJT7TxDk27GpOY3gv271164TgeUt MOk7iY6QjDqk6Zpzbg5AMcq6UhB+QUe76XAAFtDvjXLLBbsRckmbNTBbkFv6/O3p a3bOd7oeugNIaRJJaR/lQ/EVAoBSeNUUj5G1ivoXDWsNXE+ZKNNdYyrCociVgI6S j31k3XtbGVMk8M/7Lu5slPOGL4xqDtJAddF0eE7buQhRZNX7pPA= =OtQc -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs Pull NFS client fixes from Trond Myklebust: - Fix an Oops in xs_tcp_tls_setup_socket - Fix an Oops due to missing error handling in nfs_net_init() * tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Handle error of rpc_proc_register() in nfs_net_init(). SUNRPC: add a missing rpc_stat for TCP TLS	2024-04-29 12:07:37 -07:00
Linus Torvalds	0a2e230514	bcachefs fixes for 6.9-rc7 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYvyT4ACgkQE6szbY3K bnYbOA/8Cot9WqlyXy+OTm6kVx4S2atMruZ/cj+3jQR+dZbMv375YDBuMiqtg/vW Cd2zjnRwoDNGxueqpjK+A66eEHm6tcVFQ7zp+Nb5FeEHAMQcxi/3PpqxkM7zjX9q YQ/dUiWae/oCuM/u50/Hw3xn9VkmXtahmfTjFHe6xqfbrZf0HGyNOGBgMInI8Wa6 tCABISS2FAe/+8D5UzL0sxLbqdvfjeyFHbOfZiGhdTPCDsUDKTwZHsOnV9Z37y3W 6Ne1sduiEPRJl6U6faTuhgXqKg6oBBcjjtWvXPAMYxLn4/zt79sY5mwTaCAAI733 pq/WBwD0EESFCIUeZAA5vEQB2hvIOy9fdaO9tIxWrnXlbEfoZdbQNzL0Kr9//jpf qXz52nLp17rdJ5bsFqR+zITSckziud5p9Yqsp+zJxLda/VjteXG3efMG2giK9GJW yoVVgQOY/P3T9t1bgJtitu21kjCCv7vIxTjFjmIKIspUk+hpKk2W0Xv4ZSVymXP6 AgxFGgGulMm67+V+YTaSo9AspobDYwrW5B645e5wNNpBG/igikwUgKh339lah6zt J27IUTTRIAof53uVT2O1AOMu6iWIw3DfMd3kvvLpuoG1032HcDEUfQForUD+6AwS 3b3BuMeXNrZ60Oj8/tdobxN2vnQTF8+kOdma8/IJ3KYXtT3ccTA= =oGU4 -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-04-29' of https://evilpiepirate.org/git/bcachefs Pull bcachefs fixes from Kent Overstreet: "Tiny set of fixes this time" * tag 'bcachefs-2024-04-29' of https://evilpiepirate.org/git/bcachefs: bcachefs: fix integer conversion bug bcachefs: btree node scan now fills in sectors_written bcachefs: Remove accidental debug assert	2024-04-29 11:04:40 -07:00
Chao Yu	48d180e2bf	f2fs: zone: fix to don't trigger OPU on pinfile for direct IO Otherwise, it breaks pinfile's sematics. Cc: Daeho Jeong <daeho43@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-29 17:41:17 +00:00
Chao Yu	20faaf30e5	f2fs: fix to do sanity check on i_xattr_nid in sanity_check_inode() syzbot reports a kernel bug as below: F2FS-fs (loop0): Mounted with checkpoint version = 48b305e4 ================================================================== BUG: KASAN: slab-out-of-bounds in f2fs_test_bit fs/f2fs/f2fs.h:2933 [inline] BUG: KASAN: slab-out-of-bounds in current_nat_addr fs/f2fs/node.h:213 [inline] BUG: KASAN: slab-out-of-bounds in f2fs_get_node_info+0xece/0x1200 fs/f2fs/node.c:600 Read of size 1 at addr ffff88807a58c76c by task syz-executor280/5076 CPU: 1 PID: 5076 Comm: syz-executor280 Not tainted 6.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 f2fs_test_bit fs/f2fs/f2fs.h:2933 [inline] current_nat_addr fs/f2fs/node.h:213 [inline] f2fs_get_node_info+0xece/0x1200 fs/f2fs/node.c:600 f2fs_xattr_fiemap fs/f2fs/data.c:1848 [inline] f2fs_fiemap+0x55d/0x1ee0 fs/f2fs/data.c:1925 ioctl_fiemap fs/ioctl.c:220 [inline] do_vfs_ioctl+0x1c07/0x2e50 fs/ioctl.c:838 __do_sys_ioctl fs/ioctl.c:902 [inline] __se_sys_ioctl+0x81/0x170 fs/ioctl.c:890 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is we missed to do sanity check on i_xattr_nid during f2fs_iget(), so that in fiemap() path, current_nat_addr() will access nat_bitmap w/ offset from invalid i_xattr_nid, result in triggering kasan bug report, fix it. Reported-and-tested-by: syzbot+3694e283cf5c40df6d14@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/00000000000094036c0616e72a1d@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-29 17:41:17 +00:00
Chao Yu	a320b2f08b	f2fs: fix to avoid allocating WARM_DATA segment for direct IO If active_log is not 6, we never use WARM_DATA segment, let's avoid allocating WARM_DATA segment for direct IO. Signed-off-by: Yunlei He <heyunlei@oppo.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-29 17:41:17 +00:00
Yifan Zhao	ecd69be71a	f2fs: remove redundant parameter in is_next_segment_free() is_next_segment_free() takes a redundant `type` parameter. Remove it. Signed-off-by: Yifan Zhao <zhaoyifan@sjtu.edu.cn> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-29 17:41:17 +00:00
Linus Torvalds	b947cc5bf6	Changes since last update: - Better error message when prepare_ondemand_read failed; - Fix unmount of bdev-based mode if CONFIG_EROFS_FS_ONDEMAND is on. -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmYvafURHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qrgAg/+PK6WKc2t5SiLpG+dSLgQLhWfebJXhf3+ IWr4hT04JcFJgmxBJAcgh7uoJFrxjTxGDX0vtiLhAAOqRCXb9rDlpqbu+PXfURa1 YyiBAEuqnwqKCm3BixVixV8JCSLHu+UzJ72mZGw+wtSdhInXS/ddBAc8WNVbfjyU cUJb30+N5vjYVRYtDQG5qP0dbpstm1cTjNJIhjStcSpk9LQkQ9degYctjJj1mk8B ilplRIyy0bEBVbupUHPtmrHzpDZYe4QXFTtQaEGst1uDw/Cxw8I8zpwO4wavdxVz DZwDF6bcKGsaCFIV4UuJvowU7syEDkBHk5JvJoyNEhRuZQYOIvMdYRxl6AxAp1Jk vV5sn/LbWaqF7uR2Lw4NJBapulcDjEMfHuP2FyzzRqMmPO/4BiGrLnK+IRu3cSTH 0/bh6oRdplx8Q7faylMnSTXxYigXfu6vjPXuIZ95Wisk9L+URzhmdwkhnnu8MRPn DSQfD2VvRrRr52uFiR+LppspPrb8ru6EqORW9ShpugM0V9XPiGyYnNFlCMtoCpgf fYx5p2F23fXNq461dB6UwZfIM9jvMvrNsb3h3+pvpN/y5UbsW6kMUD38MwZxlSPT z18P87HKHXGNc6+pQK7zI89I7g+2W5TCZVg0/w5DGTXkOlDQPi9X9woaOYcJIPYJ LcDG0uM18kI= =v0WH -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.9-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Three fixes related to EROFS fscache mode. The most important two patches fix calling kill_block_super() in bdev-based mode instead of kill_anon_super(). The remaining patch is an informative one. Summary: - Better error message when prepare_ondemand_read failed - Fix unmount of bdev-based mode if CONFIG_EROFS_FS_ONDEMAND is on" * tag 'erofs-for-6.9-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: reliably distinguish block based and fscache mode erofs: get rid of erofs_fs_context erofs: modify the error message when prepare_ondemand_read failed	2024-04-29 08:52:08 -07:00
David Howells	120b878158	netfs: Use subreq_counter to allocate subreq debug_index values Use the subreq_counter in netfs_io_request to allocate subrequest debug_index values in read ops as well as write ops. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-04-29 15:01:44 +01:00
David Howells	93bf1cc009	netfs: Make netfs_io_request::subreq_counter an atomic_t Make the netfs_io_request::subreq_counter, used to generate values for netfs_io_subrequest::debug_index, into an atomic_t so that it can be called from the retry thread at the same time as the app thread issuing writes. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org	2024-04-29 15:01:43 +01:00
David Howells	ae678317b9	netfs: Remove deprecated use of PG_private_2 as a second writeback flag Remove the deprecated use of PG_private_2 in netfslib. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-04-29 15:01:43 +01:00
David Howells	2e9d7e4b98	mm: Remove the PG_fscache alias for PG_private_2 Remove the PG_fscache alias for PG_private_2 and use the latter directly. Use of this flag for marking pages undergoing writing to the cache should be considered deprecated and the folios should be marked dirty instead and the write done in ->writepages(). Note that PG_private_2 itself should be considered deprecated and up for future removal by the MM folks too. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: Bharath SM <bharathsm@microsoft.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-04-29 15:01:42 +01:00
David Howells	2ff1e97587	netfs: Replace PG_fscache by setting folio->private and marking dirty When dirty data is being written to the cache, setting/waiting on/clearing the fscache flag is always done in tandem with setting/waiting on/clearing the writeback flag. The netfslib buffered write routines wait on and set both flags and the write request cleanup clears both flags, so the fscache flag is almost superfluous. The reason it isn't superfluous is because the fscache flag is also used to indicate that data just read from the server is being written to the cache. The flag is used to prevent a race involving overlapping direct-I/O writes to the cache. Change this to indicate that a page is in need of being copied to the cache by placing a magic value in folio->private and marking the folios dirty. Then when the writeback code sees a folio marked in this way, it only writes it to the cache and not to the server. If a folio that has this magic value set is modified, the value is just replaced and the folio will then be uplodaded too. With this, PG_fscache is no longer required by the netfslib core, 9p and afs. Ceph and nfs, however, still need to use the old PG_fscache-based tracking. To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the flags in the netfs_inode struct for those filesystems. This reenables the use of PG_fscache in that inode. 9p and afs use the netfslib write helpers so get switched over; cifs, for the moment, does page-by-page manual access to the cache, so doesn't use PG_fscache and is unaffected. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: Bharath SM <bharathsm@microsoft.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-04-29 15:01:42 +01:00
David Howells	5f24162f87	netfs: Update i_blocks when write committed to pagecache Update i_blocks when i_size is updated when we finish making a write to the pagecache to reflect the amount of space we think will be consumed. This maintains cifs commit `dbfdff402d` ("smb3: update allocation size more accurately on write completion") which would otherwise be removed by the cifs part of the netfs writeback rewrite. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: Rohith Surabattula <rohiths.msft@gmail.com> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org	2024-04-29 15:01:42 +01:00
Zhang Yi	5ce5674187	xfs: convert delayed extents to unwritten when zeroing post eof blocks Current clone operation could be non-atomic if the destination of a file is beyond EOF, user could get a file with corrupted (zeroed) data on crash. The problem is about preallocations. If you write some data into a file: [A...B) and XFS decides to preallocate some post-eof blocks, then it can create a delayed allocation reservation: [A.........D) The writeback path tries to convert delayed extents to real ones by allocating blocks. If there aren't enough contiguous free space, we can end up with two extents, the first real and the second still delalloc: [A....C)[C.D) After that, both the in-memory and the on-disk file sizes are still B. If we clone into the range [E...F) from another file: [A....C)[C.D) [E...F) then xfs_reflink_zero_posteof() calls iomap_zero_range() to zero out the range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc extent, its pagecache will be zeroed and both the in-memory and on-disk size will be updated to D after flushing but before cloning. This is wrong, because the user can see the size change and read the zeroes while the clone operation is ongoing. We need to keep the in-memory and on-disk size before the clone operation starts, so instead of writing zeroes through the page cache for delayed ranges beyond EOF, we convert these ranges to unwritten and invalidate any cached data over that range beyond EOF. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-29 17:23:11 +05:30
Zhang Yi	2e08371a83	xfs: make xfs_bmapi_convert_delalloc() to allocate the target offset Since xfs_bmapi_convert_delalloc() only attempts to allocate the entire delalloc extent and require multiple invocations to allocate the target offset. So xfs_convert_blocks() add a loop to do this job and we call it in the write back path, but xfs_convert_blocks() isn't a common helper. Let's do it in xfs_bmapi_convert_delalloc() and drop xfs_convert_blocks(), preparing for the post EOF delalloc blocks converting in the buffered write begin path. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-29 17:23:11 +05:30
Zhang Yi	fc8d0ba0ff	xfs: make the seq argument to xfs_bmapi_convert_delalloc() optional Allow callers to pass a NULLL seq argument if they don't care about the fork sequence number. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-29 17:23:11 +05:30
Zhang Yi	bb712842a8	xfs: match lock mode in xfs_buffered_write_iomap_begin() Commit `1aa91d9c99` ("xfs: Add async buffered write support") replace xfs_ilock(XFS_ILOCK_EXCL) with xfs_ilock_for_iomap() when locking the writing inode, and a new variable lockmode is used to indicate the lock mode. Although the lockmode should always be XFS_ILOCK_EXCL, it's still better to use this variable instead of useing XFS_ILOCK_EXCL directly when unlocking the inode. Fixes: `1aa91d9c99` ("xfs: Add async buffered write support") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-29 17:23:11 +05:30
Matthew Wilcox (Oracle)	f3851fed07	gfs2: Convert gfs2_page_mkwrite() to use a folio Convert the incoming page to a folio and use it throughout, saving several calls to compound_head(). Also use 'pos' for file position rather than the ambiguous 'offset' and convert 'length' to type size_t in case we get some truly ridiculous sized folios in the future. This function should now be large-folio safe, but I may have missed something. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-29 13:04:40 +02:00
Andreas Gruenbacher	fcd63086bc	gfs2: gfs2_freeze_unlock cleanup Function gfs2_freeze_unlock() is always called with &sdp->sd_freeze_gh as its argument, so clean up the code by passing in sdp instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-29 12:35:15 +02:00
Kent Overstreet	c258c08add	bcachefs: fix integer conversion bug Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-04-28 21:34:29 -04:00
Kent Overstreet	f7c3dc2646	bcachefs: btree node scan now fills in sectors_written Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-04-28 21:34:29 -04:00
Kent Overstreet	ae92765373	bcachefs: Remove accidental debug assert Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-04-28 21:34:29 -04:00
Namjae Jeon	bc642d7bfd	ksmbd: fix uninitialized symbol 'share' in smb2_tree_connect() Fix uninitialized symbol 'share' in smb2_tree_connect(). Fixes: `e9d8c2f95a` ("ksmbd: add continuous availability share parameter") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-04-28 19:36:03 -05:00
Christian Brauner	7af2ae1b15	erofs: reliably distinguish block based and fscache mode When erofs_kill_sb() is called in block dev based mode, s_bdev may not have been initialised yet, and if CONFIG_EROFS_FS_ONDEMAND is enabled, it will be mistaken for fscache mode, and then attempt to free an anon_dev that has never been allocated, triggering the following warning: ============================================ ida_free called for id=0 which is not allocated. WARNING: CPU: 14 PID: 926 at lib/idr.c:525 ida_free+0x134/0x140 Modules linked in: CPU: 14 PID: 926 Comm: mount Not tainted 6.9.0-rc3-dirty #630 RIP: 0010:ida_free+0x134/0x140 Call Trace: <TASK> erofs_kill_sb+0x81/0x90 deactivate_locked_super+0x35/0x80 get_tree_bdev+0x136/0x1e0 vfs_get_tree+0x2c/0xf0 do_new_mount+0x190/0x2f0 [...] ============================================ Now when erofs_kill_sb() is called, erofs_sb_info must have been initialised, so use sbi->fsid to distinguish between the two modes. Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240419123611.947084-3-libaokun1@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-04-28 20:36:52 +08:00
Baokun Li	07abe43a28	erofs: get rid of erofs_fs_context Instead of allocating the erofs_sb_info in fill_super() allocate it during erofs_init_fs_context() and ensure that erofs can always have the info available during erofs_kill_sb(). After this erofs_fs_context is no longer needed, replace ctx with sbi, no functional changes. Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240419123611.947084-2-libaokun1@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-04-28 20:36:51 +08:00
Hongbo Li	17597b1e18	erofs: modify the error message when prepare_ondemand_read failed When prepare_ondemand_read failed, wrong error message is printed. The prepare_read is also implemented in cachefiles, so we amend it. Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240424084247.759432-1-lihongbo22@huawei.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-04-28 20:36:44 +08:00
Linus Torvalds	d43df69f38	three smb3 client fixes, all also for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYtKK8ACgkQiiy9cAdy T1EvKAwAtvALEtcuvh3rjMevy4lA12HMAKoAMrHbALKlC3Ile6lL8x6B7Os04S04 qs8aawmSONbL8W8be2gFOfKlaaAVU827HYllHaBi54xYUCj3ZaeDN0LuSHZEa37f UR8SbJr+BVIigQd93Z+Sq/WglvFJig+AY4sIvvThWvlNXDQXRpCTsEi4nIMvQ68+ 1jEAk5TlOS/CnNLGQI13m+Z3wbGYr9tj/fyvNh9C6pyP0nDP6COw9xDwLqYWHOEP yBmEoDVThMbjj4thimrbF7n6K8TNf4dCXdhyo74ggotia5CvHeKtQ82DCNRG1sCZ aMbPL53wM3BM+1Kuvsxx72bk4xOHQINKLM+AXPPgG/amjcs5OnLNEeiUfPWUZ4Rn fMQkDblz2vwGzpRXAziQVyXutlJKqq2jXq4L52H5KCRChfU8KLy7rv27/CaOnDbr Azjpk66RH4hjfaQnVWXFxqqY3h0d2M3ZUqQWcJRLDZvgnBwUhutytTYUf0G++lnP 7oSjF08O =2FY2 -----END PGP SIGNATURE----- Merge tag '6.9-rc5-cifs-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: "Three smb3 client fixes, all also for stable: - two small locking fixes spotted by Coverity - FILE_ALL_INFO and network_open_info packing fix" * tag '6.9-rc5-cifs-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: fix lock ordering potential deadlock in cifs_sync_mid_result smb3: missing lock when picking channel smb: client: Fix struct_group() usage in __packed structs	2024-04-27 11:35:40 -07:00
Linus Torvalds	e6ebf01172	11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZiwPZwAKCRDdBJ7gKXxA jmcQAPkB6UT/rBUMvFZb1dom9R6SDYl5ZBr20Vj1HvfakCLxmQEAqEd0N7QoWvKS hKNCMDujiEKqDUWeUaJen4cqXFFE2Qg= =1wP7 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over" * tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() selftests: mm: protection_keys: save/restore nr_hugepages value from launch script stackdepot: respect __GFP_NOLOCKDEP allocation flag hugetlb: check for anon_vma prior to folio allocation mm: zswap: fix shrinker NULL crash with cgroup_disable=memory mm: turn folio_test_hugetlb into a PageType mm: support page_mapcount() on page_has_type() pages mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros mm/hugetlb: fix missing hugetlb_lock for resv uncharge selftests: mm: fix unused and uninitialized variable warning selftests/harness: remove use of LINE_MAX	2024-04-26 13:48:03 -07:00
Linus Torvalds	52034cae02	vfs-6.9-rc6.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZiulnAAKCRCRxhvAZXjc ogO+AP9z3+WAvgGmJkWOjT1aOrcQWVe+ZEdEUdK26ufkHhM5vAD/RXmdUBVHcYWk 3oE1hG8bONOASUc6dUIATPHBDjvqFg8= =LtmL -----END PGP SIGNATURE----- Merge tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window and the attempt to handle the ntfs removal regression that was reported a little while ago: - After the removal of the legacy ntfs driver we received reports about regressions for some people that do mount "ntfs" explicitly and expect the driver to be available. Since ntfs3 is a drop-in for legacy ntfs we alias legacy ntfs to ntfs3 just like ext3 is aliased to ext4. We also enforce legacy ntfs is always mounted read-only and give it custom file operations to ensure that ioctl()'s can't be abused to perform write operations. - Fix an unbalanced module_get() in bdev_open(). - Two smaller fixes for the netfs work done earlier in this cycle. - Fix the errno returned from the new FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH ioctls. Both commands just pull information out of the superblock so there's no need to call into the actual ioctl handlers. So instead of returning ENOIOCTLCMD to indicate to fallback we just return ENOTTY directly avoiding that indirection" * tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the pre-flush when appending to a file in writethrough mode netfs: Fix writethrough-mode error handling ntfs3: add legacy ntfs file operations ntfs3: enforce read-only when used as legacy ntfs driver ntfs3: serve as alias for the legacy ntfs driver block: fix module reference leakage from bdev_open_by_dev error path fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail	2024-04-26 11:01:28 -07:00
David Howells	c97f59e276	netfs: Fix the pre-flush when appending to a file in writethrough mode In netfs_perform_write(), when the file is marked NETFS_ICTX_WRITETHROUGH or O_SYNC or RWF_SYNC was specified, write-through caching is performed on a buffered file. When setting up for write-through, we flush any conflicting writes in the region and wait for the write to complete, failing if there's a write error to return. The issue arises if we're writing at or above the EOF position because we skip the flush and - more importantly - the wait. This becomes a problem if there's a partial folio at the end of the file that is being written out and we want to make a write to it too. Both the already-running write and the write we start both want to clear the writeback mark, but whoever is second causes a warning looking something like: ------------[ cut here ]------------ R=00000012: folio 11 is not under writeback WARNING: CPU: 34 PID: 654 at fs/netfs/write_collect.c:105 ... CPU: 34 PID: 654 Comm: kworker/u386:27 Tainted: G S ... ... Workqueue: events_unbound netfs_write_collection_worker ... RIP: 0010:netfs_writeback_lookup_folio Fix this by making the flush-and-wait unconditional. It will do nothing if there are no folios in the pagecache and will return quickly if there are no folios in the region specified. Further, move the WBC attachment above the flush call as the flush is going to attach a WBC and detach it again if it is not present - and since we need one anyway we might as well share it. Fixes: `41d8e7673a` ("netfs: Implement a write-through caching option") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202404161031.468b84f-oliver.sang@intel.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/2150448.1714130115@warthog.procyon.org.uk Reviewed-by: Jeffrey Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-26 14:56:18 +02:00
Dawid Osuchowski	55394d29c9	fs: Create anon_inode_getfile_fmode() Creates an anon_inode_getfile_fmode() function that works similarly to anon_inode_getfile() with the addition of being able to set the fmode member. Signed-off-by: Dawid Osuchowski <linux@osuchow.ski> Link: https://lore.kernel.org/r/20240426075854.4723-1-linux@osuchow.ski Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-26 10:33:05 +02:00
Christoph Hellwig	e58ac1770d	xfs: refactor dir format helpers Add a new enum and a xfs_dir2_format helper that returns it to allow the code to switch on the format of a directory in a single operation and switch all helpers of xfs_dir2_isblock and xfs_dir2_isleaf to it. This also removes the explicit xfs_iread_extents call in a few of the call sites given that xfs_bmap_last_offset already takes care of it underneath. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-26 11:21:46 +05:30
Christoph Hellwig	dfe5febe2b	xfs: factor out a xfs_dir_replace_args helper Add a helper to switch between the different directory formats for removing a directory entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-26 11:19:04 +05:30
Christoph Hellwig	3866e6e669	xfs: factor out a xfs_dir_removename_args helper Add a helper to switch between the different directory formats for removing a directory entry. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-26 11:19:04 +05:30
Christoph Hellwig	4d893a4051	xfs: factor out a xfs_dir_createname_args helper Add a helper to switch between the different directory formats for creating a directory entry and to handle the XFS_DA_OP_JUSTCHECK flag based on the passed in ino number field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-26 11:19:03 +05:30
Christoph Hellwig	14ee22fef4	xfs: factor out a xfs_dir_lookup_args helper Add a helper to switch between the different directory formats for lookup and to handle the -EEXIST return for a successful lookup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-26 11:19:03 +05:30
Yang Li	4a45857694	nilfs2: add kernel-doc comments to nilfs_remove_all_gcinodes() This commit adds kernel-doc style comments with complete parameter descriptions for the function nilfs_remove_all_gcinodes. Link: https://lkml.kernel.org/r/20240410075629.3441-4-konishi.ryusuke@gmail.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:08 -07:00
Yang Li	3da9b9650a	nilfs2: add kernel-doc comments to nilfs_btree_convert_and_insert() This commit adds kernel-doc style comments with complete parameter descriptions for the function nilfs_btree_convert_and_insert. Link: https://lkml.kernel.org/r/20240410075629.3441-3-konishi.ryusuke@gmail.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:08 -07:00
Yang Li	2725844080	nilfs2: add kernel-doc comments to nilfs_do_roll_forward() Patch series "nilfs2: fix missing kernel-doc comments". This commit adds kernel-doc style comments with complete parameter descriptions for the function nilfs_do_roll_forward. Link: https://lkml.kernel.org/r/20240410075629.3441-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240410075629.3441-2-konishi.ryusuke@gmail.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:08 -07:00
Su Yue	b8cb324277	ocfs2: use coarse time for new created files The default atime related mount option is '-o realtime' which means file atime should be updated if atime <= ctime or atime <= mtime. atime should be updated in the following scenario, but it is not: ========================================================== $ rm /mnt/testfile; $ echo test > /mnt/testfile $ stat -c "%X %Y %Z" /mnt/testfile 1711881646 1711881646 1711881646 $ sleep 5 $ cat /mnt/testfile > /dev/null $ stat -c "%X %Y %Z" /mnt/testfile 1711881646 1711881646 1711881646 ========================================================== And the reason the atime in the test is not updated is that ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during file creation. Then inode_set_ctime_current() is called in inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to get current time. ktime_get_real_ts64() is more accurate than ktime_get_coarse_real_ts64(). In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is less than ktime_get_real_ts64() even ctime is set later. The ctime of the new inode is smaller than atime. The call trace is like: ocfs2_create ocfs2_mknod __ocfs2_mknod_locked .... ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate ocfs2_populate_inode ... ocfs2_init_acl ocfs2_acl_set_mode inode_set_ctime_current current_time ktime_get_coarse_real_ts64 <-------less accurate ocfs2_file_read_iter ocfs2_inode_lock_atime ocfs2_should_update_atime atime <= ctime ? <-------- false, ctime < atime due to accuracy So here call ktime_get_coarse_real_ts64 to set inode time coarser while creating new files. It may lower the accuracy of file times. But it's not a big deal since we already use coarse time in other places like ocfs2_update_inode_atime and inode_set_ctime_current. Link: https://lkml.kernel.org/r/20240408082041.20925-5-glass.su@suse.com Fixes: `c62c38f6b9` ("ocfs2: replace CURRENT_TIME macro") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:07 -07:00
Su Yue	8c40984eeb	ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link transaction id should be updated in ocfs2_unlink and ocfs2_link. Otherwise, inode link will be wrong after journal replay even fsync was called before power failure: ======================================================================= $ touch testdir/bar $ ln testdir/bar testdir/bar_link $ fsync testdir/bar $ stat -c %h $SCRATCH_MNT/testdir/bar 1 $ stat -c %h $SCRATCH_MNT/testdir/bar 1 ======================================================================= Link: https://lkml.kernel.org/r/20240408082041.20925-4-glass.su@suse.com Fixes: `ccd979bdbc` ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:07 -07:00
Su Yue	952b023f06	ocfs2: fix races between hole punching and AIO+DIO After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block", fstests/generic/300 become from always failed to sometimes failed: ======================================================================== [ 473.293420 ] run fstests generic/300 [ 475.296983 ] JBD2: Ignoring recovery information on journal [ 475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0) with ordered data mode. [ 494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag: Owner 5668 has an extent at cpos 78723 which can no longer be found [ 494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted. [ 494.292018 ] OCFS2: File system is now read-only. [ 494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272 ERROR: status = -30 [ 494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374 ERROR: status = -3 fio: io_u error on file /mnt/scratch/racer: Read-only file system: write offset=460849152, buflen=131072 ========================================================================= In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add unwritten extents to a list. extents are also inserted into extent tree in ocfs2_write_begin_nolock. Then another thread call fallocate to puch a hole at one of the unwritten extent. The extent at cpos was removed by ocfs2_remove_extent(). At end io worker thread, ocfs2_search_extent_list found there is no such extent at the cpos. T1 T2 T3 inode lock ... insert extents ... inode unlock ocfs2_fallocate __ocfs2_change_file_space inode lock lock ip_alloc_sem ocfs2_remove_inode_range inode ocfs2_remove_btree_range ocfs2_remove_extent ^---remove the extent at cpos 78723 ... unlock ip_alloc_sem inode unlock ocfs2_dio_end_io ocfs2_dio_end_io_write lock ip_alloc_sem ocfs2_mark_extent_written ocfs2_change_extent_flag ocfs2_search_extent_list ^---failed to find extent ... unlock ip_alloc_sem In most filesystems, fallocate is not compatible with racing with AIO+DIO, so fix it by adding to wait for all dio before fallocate/punch_hole like ext4. Link: https://lkml.kernel.org/r/20240408082041.20925-3-glass.su@suse.com Fixes: `b25801038d` ("ocfs2: Support xfs style space reservation ioctls") Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:06 -07:00
Su Yue	d11547071a	ocfs2: return real error code in ocfs2_dio_wr_get_block Patch series "ocfs2 bugs fixes exposed by fstests", v3. The patchset is to fix some wrong behavior of ocfs2 exposed by fstests. Patch 1 makes userspace happy when some error happens when doing direct io. Before the patch, DIO always return -EIO in case of error. After the patch, it returns real error code such like -ENOSPC, EDQUOT... Patch 2 fixes an error case when doing AIO+DIO and hole punching at same file position in parallel. generic/300 Patch 3 fixes inode link count mismatch after power failure. Without the patch, inode link would be wrong even fync was called on the file. tests/generic/040,041,104,107,336 patch 4 fixes wrong atime with mount option realtime. Without the patch, atime of new created file won't be updated in right time. tests/generic/192 For stable kernels, I added fixes to patch 2,3,4. The patch 1 is not recommended to be backported since ocfs2_dio_wr_get_block calls too many functions. It's diffcult to check every git history of ocfs2 for every LTS kernel. This patch (of 4): ocfs2_dio_wr_get_block always returns -EIO in case of errors. However, some programs expect right exit codes while doing dio. For example, tools like fio treat -ENOSPC as expected code while doing stress jobs. And quota tools expect -EDQUOT when disk quota exceeds. -EIO is too strong return code in the dio path. The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is widely used and it handles error codes well. I have checked functions called by ocfs2_dio_wr_get_block and their return codes look good and clear. So I think it's safe to let ocfs2_dio_wr_get_block return real error code. Link: https://lkml.kernel.org/r/20240408082041.20925-1-glass.su@suse.com Link: https://lkml.kernel.org/r/20240408082041.20925-2-glass.su@suse.com Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:06 -07:00
Justin Stitt	ad5f0eb540	vmcore: replace strncpy with strscpy_pad strncpy() is in the process of being replaced as it is deprecated [1]. We should move towards safer and less ambiguous string interfaces. Looking at vmcoredd_header's definition: \| struct vmcoredd_header { \| __u32 n_namesz; /* Name size / \| __u32 n_descsz; / Content size / \| __u32 n_type; / NT_VMCOREDD / \| __u8 name[8]; / LINUX\0\0\0 / \| __u8 dump_name[VMCOREDD_MAX_NAME_BYTES]; / Device dump's name / \| }; .. we see that @name wants to be NUL-padded. We're copying data->dump_name which is defined as: \| char dump_name[VMCOREDD_MAX_NAME_BYTES]; / Unique name of the dump */ .. which shares the same size as vdd_hdr->dump_name. Let's make sure we NUL-pad this as well. Use strscpy_pad() which NUL-terminates and NUL-pads its destination buffers. Specifically, use the new 2-argument version of strscpy_pad introduced in Commit `e6584c3964` ("string: Allow 2-argument strscpy()"). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 Link: https://lkml.kernel.org/r/20240401-strncpy-fs-proc-vmcore-c-v2-1-dd0a73f42635@google.com Signed-off-by: Justin Stitt <justinstitt@google.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:06 -07:00
Phillip Lougher	040bf9a717	Squashfs: remove deprecated strncpy by not copying the string Squashfs copied the passed string (name) into a temporary buffer to ensure it was NUL-terminated. This however is completely unnecessary as the string is already NUL-terminated. So remove the deprecated strncpy() by completely removing the string copy. The background behind this unnecessary string copy is that it dates back to the days when Squashfs was an out of kernel patch. The code deliberately did not assume the string was NUL-terminated in case in future this changed (due to kernel changes). This would mean the out of tree patches would be broken but still compile OK. Link: https://lkml.kernel.org/r/20240403183352.391308-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:05 -07:00
Heming Zhao	fc07d2a211	ocfs2: fix sparse warnings 1. fs/ocfs2/localalloc.c:1224:41: warning: incorrect type in argument 1 (different base types) fs/ocfs2/localalloc.c:1224:41: expected unsigned long long val1 fs/ocfs2/localalloc.c:1224:41: got restricted __le32 [usertype] la_bm_off 2. fs/ocfs2/export.c:258:32: warning: cast to restricted __le32 fs/ocfs2/export.c:259:33: warning: cast to restricted __le32 fs/ocfs2/export.c:260:32: warning: cast to restricted __le32 fs/ocfs2/export.c:272:32: warning: cast to restricted __le32 fs/ocfs2/export.c:273:33: warning: cast to restricted __le32 fs/ocfs2/export.c:274:32: warning: cast to restricted __le32 3. fs/ocfs2/inode.c:1623:13: warning: context imbalance in 'ocfs2_inode_cache_lock' - wrong count at exit fs/ocfs2/inode.c:1630:13: warning: context imbalance in 'ocfs2_inode_cache_unlock' - unexpected unlock 4. fs/ocfs2/refcounttree.c:633:27: warning: incorrect type in assignment (different base types) fs/ocfs2/refcounttree.c:633:27: expected restricted __le32 [usertype] rf_generation fs/ocfs2/refcounttree.c:633:27: got unsigned int 5. fs/ocfs2/dlm/dlmdomain.c:1316:20: warning: context imbalance in 'dlm_query_nodeinfo_handler' - different lock contexts for basic block Link: https://lkml.kernel.org/r/20240328125203.20892-5-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:04 -07:00
Heming Zhao	525350221b	ocfs2: speed up chain-list searching Add short-circuit code to speed up searching Link: https://lkml.kernel.org/r/20240328125203.20892-4-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:04 -07:00
Heming Zhao	f51dac026f	ocfs2: adjust enabling place for la window Patch series "improve write IO performance when fragmentation is high", v6. This patch (of 4): After introducing gd->bg_contig_free_bits, the code path 'ocfs2_cluster_group_search() => ocfs2_local_alloc_seen_free_bits()' becomes death when all the gd->bg_contig_free_bits are set to the correct value. This patch relocates ocfs2_local_alloc_seen_free_bits() to a more appropriate location. (The new place being ocfs2_block_group_set_bits().) In ocfs2_local_alloc_seen_free_bits(), the scope of the spin-lock has been adjusted to reduce meaningless lock races. e.g: when userspace creates & deletes 1 cluster_size files in parallel, acquiring the spin-lock in ocfs2_local_alloc_seen_free_bits() is totally pointless and impedes IO performance. Link: https://lkml.kernel.org/r/20240328125203.20892-3-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:03 -07:00
Heming Zhao	4eb7b93e03	ocfs2: improve write IO performance when fragmentation is high The group_search function ocfs2_cluster_group_search() should bypass groups with insufficient space to avoid unnecessary searches. This patch is particularly useful when ocfs2 is handling huge number small files, and volume fragmentation is very high. In this case, ocfs2 is busy with looking up available la window from //global_bitmap. This patch introduces a new member in the Group Description (gd) struct called 'bg_contig_free_bits', representing the max contigous free bits in this gd. When ocfs2 allocates a new la window from //global_bitmap, 'bg_contig_free_bits' helps expedite the search process. Let's image below path. 1. la state (->local_alloc_state) is set THROTTLED or DISABLED. 2. when user delete a large file and trigger ocfs2_local_alloc_seen_free_bits set osb->local_alloc_state unconditionally. 3. a write IOs thread run and trigger the worst performance path ``` ocfs2_reserve_clusters_with_limit ocfs2_reserve_local_alloc_bits ocfs2_local_alloc_slide_window //[1] + ocfs2_local_alloc_reserve_for_window //[2] + ocfs2_local_alloc_new_window //[3] ocfs2_recalc_la_window ``` [1]: will be called when la window bits used up. [2]: under la state is ENABLED, and this func only check global_bitmap free bits, it will succeed in general. [3]: will use the default la window size to search clusters then fail. ocfs2_recalc_la_window attempts other la window sizes. the timing complexity is O(n^4), resulting in a significant time cost for scanning global bitmap. This leads to a dramatic slowdown in write I/Os (e.g., user space 'dd'). i.e. an ocfs2 partition size: 1.45TB, cluster size: 4KB, la window default size: 106MB. The partition is fragmentation by creating & deleting huge mount of small files. before this patch, the timing of [3] should be (the number got from real world): - la window size change order (size: MB): 106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8 only 0.8MB succeed, 0.8MB also triggers la window to disable. ocfs2_local_alloc_new_window retries 8 times, first 7 times totally runs in worst case. - group chain number: 242 ocfs2_claim_suballoc_bits calls for-loop 242 times - each chain has 49 block group ocfs2_search_chain calls while-loop 49 times - each bg has 32256 blocks ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits. for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use (32256/64) (this is not worst value) for timing calucation. the loop times: 724249(32256/64) = 41835024 (~42 million times) In the worst case, user space writes 1MB data will trigger 42M scanning times. under this patch, the timing is '7242*49 = 83006', reduced by three orders of magnitude. Link: https://lkml.kernel.org/r/20240328125203.20892-2-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:03 -07:00
Douglas Anderson	6b839b3b76	regset: use kvzalloc() for regset_get_alloc() While browsing through ChromeOS crash reports, I found one with an allocation failure that looked like this: chrome: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL\|__GFP_COMP\|__GFP_ZERO), nodemask=(null),cpuset=urgent,mems_allowed=0 CPU: 7 PID: 3295 Comm: chrome Not tainted 5.15.133-20574-g8044615ac35c #1 (HASH:1162 1) Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Call trace: ... warn_alloc+0x104/0x174 __alloc_pages+0x5f0/0x6e4 kmalloc_order+0x44/0x98 kmalloc_order_trace+0x34/0x124 __kmalloc+0x228/0x36c __regset_get+0x68/0xcc regset_get_alloc+0x1c/0x28 elf_core_dump+0x3d8/0xd8c do_coredump+0xeb8/0x1378 get_signal+0x14c/0x804 ... An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not a surprise that this allocation failed on a system that's been running for a while. More digging showed that it was fairly easy to see the order 7 allocation by just sending a SIGQUIT to chrome (or other processes) to generate a core dump. The actual amount being allocated was 279,584 bytes and it was for "core_note_type" NT_ARM_SVE. There was quite a bit of discussion [1] on the mailing lists in response to my v1 patch attempting to switch to vmalloc. The overall conclusion was that we could likely reduce the 279,584 byte allocation by quite a bit and Mark Brown has sent a patch to that effect [2]. However even with the 279,584 byte allocation gone there are still 65,552 byte allocations. These are just barely more than the 65,536 bytes and thus would require an order 5 allocation. An order 5 allocation is still something to avoid unless necessary and nothing needs the memory here to be contiguous. Change the allocation to kvzalloc() which should still be efficient for small allocations but doesn't force the memory subsystem to work hard (and maybe fail) at getting a large contiguous chunk. [1] https://lore.kernel.org/r/20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid [2] https://lore.kernel.org/r/20240203-arm64-sve-ptrace-regset-size-v1-1-2c3ba1386b9e@kernel.org Link: https://lkml.kernel.org/r/20240205092626.v2.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:03 -07:00
Yang Li	4d9784c00a	fs: add kernel-doc comments to fat_parse_long() Add kernel-doc style comments with complete parameter descriptions for fat_parse_long(). Link: https://lkml.kernel.org/r/20240322073724.102332-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:02 -07:00
Su Yue	c9abe09986	ocfs2: update inode ctime in ocfs2_fileattr_set inode ctime should be updated if ocfs2_fileattr_set is called. Link: https://lkml.kernel.org/r/20240318115609.3194-1-l@damenly.org Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:01 -07:00
Joseph Qi	30dd3478c3	ocfs2: correctly use ocfs2_find_next_zero_bit() If no bits are zero, ocfs2_find_next_zero_bit() will return max size, so check the return value with -1 is meaningless. Correct this usage and cleanup the code. Link: https://lkml.kernel.org/r/20240314021713.240796-1-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 21:07:01 -07:00
Matthew Wilcox (Oracle)	039d26d10d	proc: convert smaps_pmd_entry to use a folio Replace two calls to compound_head() with one. Link: https://lkml.kernel.org/r/20240403171456.1445117-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:36 -07:00
Matthew Wilcox (Oracle)	27bb0a70e5	proc: pass a folio to smaps_page_accumulate() Both callers already have a folio; pass it in instead of doing the conversion each time. Link: https://lkml.kernel.org/r/20240403171456.1445117-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:36 -07:00
Matthew Wilcox (Oracle)	cfc96da432	proc: convert smaps_page_accumulate to use a folio Replaces three calls to compound_head() with one. Shrinks the function from 2614 bytes to 1112 bytes in an allmodconfig build. Link: https://lkml.kernel.org/r/20240403171456.1445117-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:35 -07:00
Matthew Wilcox (Oracle)	f1dc623fa0	proc: convert gather_stats to use a folio Patch series "Use folio APIs in procfs". We're down to very few users of the PageFoo macros, with proc being a major user. After this patchset and another patchset I have for khugepaged, we can get rid of PageActive, PageReadahead and PageSwapBacked. This patchset has the usual advantages in its own right of removing hidden calls to compound_head(). We have the page table lock, so the mapcount & refcount are stable and there can't be any races with folios suddenly becoming tail pages. This patch (of 4): Replaces six calls to compound_head() with one. Shrinks the function from 5054 bytes to 1756 bytes in an allmodconfig build. Link: https://lkml.kernel.org/r/20240403171456.1445117-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240403171456.1445117-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:35 -07:00
Matthew Wilcox (Oracle)	6c977f36dc	proc: convert smaps_account() to use a folio Replace seven calls to compound_head() with one. Link: https://lkml.kernel.org/r/20240402201252.917342-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:34 -07:00
Matthew Wilcox (Oracle)	03aa577f3b	proc: convert clear_refs_pte_range to use a folio Patch series "Remove page_idle and page_young wrappers". There are only a couple of places left using the page wrappers for idle & young tracking. Convert the two users in proc and then we can remove the wrappers. That enables the further simplification of autogenerating the definitions when CONFIG_PAGE_IDLE_FLAG is disabled. This patch (of 4): Replaces four calls to compound_head() with two. Link: https://lkml.kernel.org/r/20240402201252.917342-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240402201252.917342-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:34 -07:00
Jinjiang Tu	3a9e567ca4	mm/ksm: fix ksm exec support for prctl Patch series "mm/ksm: fix ksm exec support for prctl", v4. commit `3c6f33b727` ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). However, it doesn't create the mm_slot, so ksmd will not try to scan this task. The first patch fixes the issue. The second patch refactors to prepare for the third patch. The third patch extends the selftests of ksm to verfity the deduplication really happens after fork/exec inherits ths KSM setting. This patch (of 3): commit `3c6f33b727` ("mm/ksm: support fork/exec for prctl") inherits MMF_VM_MERGE_ANY flag when a task calls execve(). Howerver, it doesn't create the mm_slot, so ksmd will not try to scan this task. To fix it, allocate and add the mm_slot to ksm_mm_head in __bprm_mm_init() when the mm has MMF_VM_MERGE_ANY flag. Link: https://lkml.kernel.org/r/20240328111010.1502191-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20240328111010.1502191-2-tujinjiang@huawei.com Fixes: `3c6f33b727` ("mm/ksm: support fork/exec for prctl") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:29 -07:00
Rick Edgecombe	b80fa3cbb7	treewide: use initializer for struct vm_unmapped_area_info Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new ones are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area(), if the convention is to zero initialize the struct and any new field addition missed a call site that initializes each field manually. So it is useful to do things similar across the kernel. The consensus (see links) was that in general the best way to accomplish taking into account both code cleanliness and minimizing the chance of introducing bugs, was to do C99 static initialization. As in: struct vm_unmapped_area_info info = {}; With this method of initialization, the whole struct will be zero initialized, and any statements setting fields to zero will be unneeded. The change should not leave cleanup at the call sides. While iterating though the possible solutions a few archs kindly acked other variations that still zero initialized the struct. These sites have been modified in previous changes using the pattern acked by the respective arch. So to be reduce the chance of bugs via uninitialized fields, perform a tree wide change using the consensus for the best general way to do this change. Use C99 static initializing to zero the struct and remove and statements that simply set members to zero. Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/ Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:27 -07:00
Rick Edgecombe	529ce23a76	mm: switch mm->get_unmapped_area() to a flag The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:25 -07:00
Rick Edgecombe	5def1e0f47	proc: refactor pde_get_unmapped_area as prep Patch series "Cover a guard gap corner case", v4. In working on x86’s shadow stack feature, I came across some limitations around the kernel’s handling of guard gaps. AFAICT these limitations are not too important for the traditional stack usage of guard gaps, but have bigger impact on shadow stack’s usage. And now in addition to x86, we have two other architectures implementing shadow stack like features that plan to use guard gaps. I wanted to see about addressing them, but I have not worked on mmap() placement related code before, so would greatly appreciate if people could take a look and point me in the right direction. The nature of the limitations of concern is as follows. In order to ensure guard gaps between mappings, mmap() would need to consider two things: 1. That the new mapping isn’t placed in an any existing mapping’s guard gap. 2. That the new mapping isn’t placed such that any existing mappings are not in its guard gaps Currently mmap never considers (2), and (1) is not considered in some situations. When not passing an address hint, or passing one without MAP_FIXED_NOREPLACE, (1) is enforced. With MAP_FIXED_NOREPLACE, (1) is not enforced. With MAP_FIXED, (1) is not considered, but this seems to be expected since MAP_FIXED can already clobber existing mappings. For MAP_FIXED_NOREPLACE I would have guessed it should respect the guard gaps of existing mappings, but it is probably a little ambiguous. In this series I just tried to add enforcement of (2) for the normal (no address hint) case and only for the newer shadow stack memory (not stacks). The reason is that with the no-address-hint situation, landing next to a guard gap could come up naturally and so be more influencable by attackers such that two shadow stacks could be adjacent without a guard gap. Where as the address-hint scenarios would require more control - being able to call mmap() with specific arguments. As for why not just fix the other corner cases anyway, I thought it might have some greater possibility of affecting existing apps. This patch (of 14): Future changes will perform a treewide change to remove the indirect branch that is involved in calling mm->get_unmapped_area(). After doing this, the function will no longer be able to be handled as a function pointer. To make the treewide change diff cleaner and easier to review, refactor pde_get_unmapped_area() such that mm->get_unmapped_area() is called without being stored in a local function pointer. With this in refactoring, follow on changes will be able to simply replace the call site with a future function that calls it directly. Link: https://lkml.kernel.org/r/20240326021656.202649-1-rick.p.edgecombe@intel.com Link: https://lkml.kernel.org/r/20240326021656.202649-2-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:25 -07:00
ZhangPeng	afd584398b	userfaultfd: early return in dup_userfaultfd() When vma->vm_userfaultfd_ctx.ctx is NULL, vma->vm_flags should have cleared __VM_UFFD_FLAGS. Therefore, there is no need to down_write or clear the flag, which will affect fork performance. Fix this by returning early if octx is NULL in dup_userfaultfd(). By applying this patch we can get a 1.3% performance improvement for lmbench fork_prot. Results are as follows: base early return Process fork+exit: 419.1106 413.4804 Link: https://lkml.kernel.org/r/20240327090835.3232629-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:25 -07:00
Matthew Wilcox (Oracle)	c93012d849	dax: use huge_zero_folio Convert from huge_zero_page to huge_zero_folio. Link: https://lkml.kernel.org/r/20240326202833.523759-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:20 -07:00
Matthew Wilcox (Oracle)	5beaee54a3	mm: add is_huge_zero_folio() This is the folio equivalent of is_huge_zero_page(). It doesn't add any efficiency, but it does prevent the caller from passing a tail page and getting confused when the predicate returns false. Link: https://lkml.kernel.org/r/20240326202833.523759-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:18 -07:00
Matthew Wilcox (Oracle)	dee3d0bef2	proc: rewrite stable_page_flags() Reduce the usage of PageFlag tests and reduce the number of compound_head() calls. For multi-page folios, we'll now show all pages as having the flags that apply to them, e.g. if it's dirty, all pages will have the dirty flag set instead of just the head page. The mapped flag is still per page, as is the hwpoison flag. [willy@infradead.org: fix up some bits vs masks] Link: https://lkml.kernel.org/r/20240403173112.1450721-1-willy@infradead.org [willy@infradead.org: fix warnings] Link: https://lkml.kernel.org/r/ZhBPtCYfSuFuUMEz@casper.infradead.org Link: https://lkml.kernel.org/r/20240326171045.410737-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Svetly Todorov <svetly.todorov@memverge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:56:15 -07:00
Suren Baghdasaryan	2c321f3f70	mm: change inlined allocation helpers to account at the call site Main goal of memory allocation profiling patchset is to provide accounting that is cheap enough to run in production. To achieve that we inject counters using codetags at the allocation call sites to account every time allocation is made. This injection allows us to perform accounting efficiently because injected counters are immediately available as opposed to the alternative methods, such as using _RET_IP_, which would require counter lookup and appropriate locking that makes accounting much more expensive. This method requires all allocation functions to inject separate counters at their call sites so that their callers can be individually accounted. Counter injection is implemented by allocation hooks which should wrap all allocation functions. Inlined functions which perform allocations but do not use allocation hooks are directly charged for the allocations they perform. In most cases these functions are just specialized allocation wrappers used from multiple places to allocate objects of a specific type. It would be more useful to do the accounting at their call sites instead. Instrument these helpers to do accounting at the call site. Simple inlined allocation wrappers are converted directly into macros. More complex allocators or allocators with documentation are converted into _noprof versions and allocation hooks are added. This allows memory allocation profiling mechanism to charge allocations to the callers of these functions. Link: https://lkml.kernel.org/r/20240415020731.1152108-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Jan Kara <jack@suse.cz> [jbd2] Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:59 -07:00
Johannes Weiner	91cdcd8d62	mm: zswap: optimize zswap pool size tracking Profiling the munmap() of a zswapped memory region shows 60% of the total cycles currently going into updating the zswap_pool_total_size. There are three consumers of this counter: - store, to enforce the globally configured pool limit - meminfo & debugfs, to report the size to the user - shrink, to determine the batch size for each cycle Instead of aggregating everytime an entry enters or exits the zswap pool, aggregate the value from the zpools on-demand: - Stores aggregate the counter anyway upon success. Aggregating to check the limit instead is the same amount of work. - Meminfo & debugfs might benefit somewhat from a pre-aggregated counter, but aren't exactly hotpaths. - Shrinking can aggregate once for every cycle instead of doing it for every freed entry. As the shrinker might work on tens or hundreds of objects per scan cycle, this is a large reduction in aggregations. The paths that benefit dramatically are swapin, swapoff, and unmaps. There could be millions of pages being processed until somebody asks for the pool size again. This eliminates the pool size updates from those paths entirely. Top profile entries for a 24G range munmap(), before: 38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size 12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size 9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size 2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free 2.86% zswap-unmap [kernel.kallsyms] [k] xas_store and after: 7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free 7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap 6.74% zswap-unmap [kernel.kallsyms] [k] xas_store It was also briefly considered to move to a single atomic in zswap that is updated by the backends, since zswap only cares about the sum of all pools anyway. However, zram directly needs per-pool information out of zsmalloc. To keep the backend from having to update two atomics every time, I opted for the lazy aggregation instead for now. Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-25 20:55:47 -07:00
Linus Torvalds	dda89e2fbc	fs/9p: fixes for 6.9-rc6 Contains a single mitigation to help deal with an apparent race condition between client and server having to deal with inode number collisions. Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYqu2QACgkQiP/V+0pf /5h6FhAAjYxXb3zSeQouR7Nr+n4Hc1pIG2hwfwDg0ruZfXnKDCnDvCtMmKJZVQWc PtR51+wlKLsrCcVArD4zUI9dezJCAzHrG2W+MMn0tka4rQYIGA8TAVXWC0dpVs+e 0UxG/CFjwcHcwRIyCcgmzSetO+rR2kVaK9Nmsd+DkCixRFsJdmG1xZqMEsUg339b rAnA82fncR5cHvoaTNhFK3TIzIZ78v/xOTORjCSsXYgLBC1Sq7gwPxt11Ms9qZBK 2ttkU6PB/AIL2gXm7VfAQ82HZY5AWRlwH1EFHcgge1vylXoFJuqadhIaj+l6QC4N fgQGA6q+288vrjr5z5WXlBUCqHO2MXxPhJxEyViif+TIyJ/eCW+G777J4wPKHMiZ LHx3/4XpbzCMgAZs29Y5l7e53xE13OfnrIngC18iX3AP/fUPKi5fYkiDX59id++k PPjKlZJI7zW29hXXgBoqQtGG/h2H3d5y4B6dummDv4teGjnY5jPhA+KaL8Z9MUkV NzSozmsL+zFVK9El+FRI+4REltrg/UwdznknMHe7MJeqigdCMWUSpn7TWh9rxWft 0yLzf3QXZuYG06o3YP6RAwBTdeRay5THU8X09nusPnmoJf9oXYB/lxlqa1k/4msa kBrZzAHGOmVTpJQYTrspgWp57NyFsAbQqkzYCg6zrBj6P0mdCfs= =IrQ9 -----END PGP SIGNATURE----- Merge tag '9p-for-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p fix from Eric Van Hensbergen: "This contains a single mitigation to help deal with an apparent race condition between client and server having to deal with inode number collisions" * tag '9p-for-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: mitigate inode collisions	2024-04-25 15:31:56 -07:00
Chuck Lever	18180a4550	NFSD: Fix nfsd4_encode_fattr4() crasher Ensure that args.acl is initialized early. It is used in an unconditional call to kfree() on the way out of nfsd4_encode_fattr4(). Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: `83ab8678ad` ("NFSD: Add struct nfsd4_fattr_args") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-04-25 18:06:41 -04:00
Jakub Kicinski	2bd87951de	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c `89884459a0` ("wifi: mac80211: fix idle calculation with multi-link") `87f5500285` ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c `1971d13ffa` ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") `4090fa373f` ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c `4dcd0e83ea` ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") `e2dc7bfd67` ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-04-25 12:41:37 -07:00
Steve French	8861fd5180	smb3: fix lock ordering potential deadlock in cifs_sync_mid_result Coverity spotted that the cifs_sync_mid_result function could deadlock "Thread deadlock (ORDER_REVERSAL) lock_order: Calling spin_lock acquires lock TCP_Server_Info.srv_lock while holding lock TCP_Server_Info.mid_lock" Addresses-Coverity: 1590401 ("Thread deadlock (ORDER_REVERSAL)") Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-04-25 12:49:50 -05:00
Steve French	8094a60024	smb3: missing lock when picking channel Coverity spotted a place where we should have been holding the channel lock when accessing the ses channel index. Addresses-Coverity: 1582039 ("Data race condition (MISSING_LOCK)") Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-04-25 12:47:23 -05:00
Linus Torvalds	e33c4963bf	nfsd-6.9 fixes: - Revert some backchannel fixes that went into v6.9-rc -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmYqZecACgkQM2qzM29m f5fOPhAAlxsB8ThAAyg+vOMeR6lxutEjPzez+U4M3N1s8dj9eMmcx8s6DPxsvfTS ExhcqxauCEwh9q+yZQ5ismWiE43Gvf39tvKGXzH30oM4v8xYo9nzxi6w9mEGyqR+ bkpbvbpuCfSeVY4WXD2rFI8DMh1G8PM9llue2QVLWPny2y5K2gAsbinpV/QQtdgp LEwhnGcX2ZmSh9yd7aGiOOO18CHU9rMygcb4XDzmwWggO8eCrElWbp4x8yuZJFAD OzuIx00WYny00rlMpZUyZ67hhg0+KcGQVldhr+A8f0K1baFCfW4QDq2T3RGTonoe DWdDGW7zwBbBw/rg4NUhdIZ9CArml9bB8ohPsaGr99e/95mtpSnXoEDJvKYgsFVE QembV7zPZqzEjtrR6g/tdf8/g1yRRwUMLze+RLjzz70TohLKvj7+qHCHirNMCMVE +HXvEdPKCzdZ24YST2wDh6sQ571ka6TX8fL6gYU8O+w+fiitLkEVXGE7KDJvynkm bsoJxTtbTuFleaW79d51VpIuiW1uyqKJrD1RnwDlMpp+HDtUnoVXU5SfGnmMp4KG Tz9Hx7hgFthXpajDpyzgLMnYVJ8IvXnoQaOp97mqFe/jSktXfslDdsj/R5ZAjZHO SfViZ8hjyoxrna7bPh250tSCb/S3qljhsIOc/3tAl8nPWGLoV28= =O2Ox -----END PGP SIGNATURE----- Merge tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Revert some backchannel fixes that went into v6.9-rc * tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: Revert "NFSD: Convert the callback workqueue to use delayed_work" Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"	2024-04-25 09:31:06 -07:00
Wu Bo	3763f9effc	f2fs: use helper to print zone condition To make code clean, use blk_zone_cond_str() to print debug information. Signed-off-by: Wu Bo <bo.wu@vivo.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-25 15:33:44 +00:00
Jaegeuk Kim	b864ddb57e	f2fs: fix false alarm on invalid block address f2fs_ra_meta_pages can try to read ahead on invalid block address which is not the corruption case. Cc: <stable@kernel.org> # v6.9+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218770 Fixes: `31f85ccc84` ("f2fs: unify the error handling of f2fs_is_valid_blkaddr") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-25 15:33:09 +00:00
Jaegeuk Kim	2174035a7f	f2fs: clear writeback when compression failed Let's stop issuing compressed writes and clear their writeback flags. Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2024-04-25 15:33:04 +00:00
Josef Bacik	0f2b8098d7	btrfs: take the cleaner_mutex earlier in qgroup disable One of my CI runs popped the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc4+ #1 Not tainted ------------------------------------------------------ btrfs/471533 is trying to acquire lock: ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0 but task is already holding lock: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->subvol_sem){++++}-{3:3}: down_read+0x42/0x170 btrfs_rename+0x607/0xb00 btrfs_rename2+0x2e/0x70 vfs_rename+0xaf8/0xfc0 do_renameat2+0x586/0x600 __x64_sys_rename+0x43/0x50 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}: down_write+0x3f/0xc0 btrfs_inode_lock+0x40/0x70 prealloc_file_extent_cluster+0x1b0/0x370 relocate_file_extent_cluster+0xb2/0x720 relocate_data_extent+0x107/0x160 relocate_block_group+0x442/0x550 btrfs_relocate_block_group+0x2cb/0x4b0 btrfs_relocate_chunk+0x50/0x1b0 btrfs_balance+0x92f/0x13d0 btrfs_ioctl+0x1abf/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}: __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 __mutex_lock+0xbe/0xc00 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->subvol_sem); lock(&sb->s_type->i_mutex_key#16); lock(&fs_info->subvol_sem); lock(&fs_info->cleaner_mutex); * DEADLOCK * 2 locks held by btrfs/471533: #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600 #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 stack backtrace: CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1 Call Trace: <TASK> dump_stack_lvl+0x77/0xb0 check_noncircular+0x148/0x160 ? lock_acquire+0xcb/0x2e0 __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? lock_is_held_type+0x9a/0x110 __mutex_lock+0xbe/0xc00 ? btrfs_quota_disable+0x54/0x4c0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? btrfs_quota_disable+0x54/0x4c0 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 ? srso_return_thunk+0x5/0x5f ? __do_sys_statfs+0x61/0x70 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 ? srso_return_thunk+0x5/0x5f ? reacquire_held_locks+0xd1/0x1f0 ? do_user_addr_fault+0x307/0x8a0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? find_held_lock+0x2b/0x80 ? srso_return_thunk+0x5/0x5f ? lock_release+0xca/0x2a0 ? srso_return_thunk+0x5/0x5f ? do_user_addr_fault+0x35c/0x8a0 ? srso_return_thunk+0x5/0x5f ? trace_hardirqs_off+0x4b/0xc0 ? srso_return_thunk+0x5/0x5f ? lockdep_hardirqs_on_prepare+0xde/0x190 ? srso_return_thunk+0x5/0x5f This happens because when we call rename we already have the inode mutex held, and then we acquire the subvol_sem if we are a subvolume. This makes the dependency inode lock -> subvol sem When we're running data relocation we will preallocate space for the data relocation inode, and we always run the relocation under the ->cleaner_mutex. This now creates the dependency of cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem Qgroup delete is doing this in the opposite order, it is acquiring the subvol_sem and then it is acquiring the cleaner_mutex, which results in this lockdep splat. This deadlock can't happen in reality, because we won't ever rename the data reloc inode, nor is the data reloc inode a subvolume. However this is fairly easy to fix, simply take the cleaner mutex in the case where we are disabling qgroups before we take the subvol_sem. This resolves the lockdep splat. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-04-25 16:23:09 +02:00
Dominique Martinet	9af503d912	btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks() The previous patch that replaced BUG_ON by error handling forgot to unlock the mutex in the error path. Link: https://lore.kernel.org/all/Zh%2fHpAGFqa7YAFuM@duo.ucw.cz Reported-by: Pavel Machek <pavel@denx.de> Fixes: `7411055db5` ("btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()") CC: stable@vger.kernel.org Reviewed-by: Pavel Machek <pavel@denx.de> Signed-off-by: Dominique Martinet <dominique.martinet@atmark-techno.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-04-25 16:23:01 +02:00
Yuezhang Mo	f19257997d	exfat: zero the reserved fields of file and stream extension dentries From exFAT specification, the reserved fields should initialize to zero and should not use for any purpose. If create a new dentry set in the UNUSED dentries, all fields had been zeroed when allocating cluster to parent directory. But if create a new dentry set in the DELETED dentries, the reserved fields in file and stream extension dentries may be non-zero. Because only the valid bit of the type field of the dentry is cleared in exfat_remove_entries(), if the type of dentry is different from the original(For example, a dentry that was originally a file name dentry, then set to deleted dentry, and then set as a file dentry), the reserved fields is non-zero. So this commit initializes the dentry to 0 before createing file dentry and stream extension dentry. Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com> Reviewed-by: Andy Wu <Andy.Wu@sony.com> Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>	2024-04-25 21:59:59 +09:00
Zhang Yi	e1f453d433	iomap: do some small logical cleanup in buffered write Since iomap_write_end() can never return a partial write length, the comparison between written, copied and bytes becomes useless, just merge them with the unwritten branch. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240320110548.2200662-10-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-25 14:23:54 +02:00
Zhang Yi	815f4b633b	iomap: make iomap_write_end() return a boolean For now, we can make sure iomap_write_end() always return 0 or copied bytes, so instead of return written bytes, convert to return a boolean to indicate the copied bytes have been written to the pagecache. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240320110548.2200662-9-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-25 14:23:54 +02:00
Zhang Yi	1a61d74932	iomap: use a new variable to handle the written bytes in iomap_write_iter() In iomap_write_iter(), the status variable used to receive the return value from iomap_write_end() is confusing, replace it with a new written variable to represent the written bytes in each cycle, no logic changes. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240320110548.2200662-8-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-25 14:23:54 +02:00
Zhang Yi	943bc0882c	iomap: don't increase i_size if it's not a write operation Increase i_size in iomap_zero_range() and iomap_unshare_iter() is not needed, the caller should handle it. Especially, when truncate partial block, we should not increase i_size beyond the new EOF here. It doesn't affect xfs and gfs2 now because they set the new file size after zero out, it doesn't matter that a transient increase in i_size, but it will affect ext4 because it set file size before truncate. So move the i_size updating logic to iomap_write_iter(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240320110548.2200662-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-25 14:23:54 +02:00
Zhang Yi	89c6c1d91a	iomap: drop the write failure handles when unsharing and zeroing Unsharing and zeroing can only happen within EOF, so there is never a need to perform posteof pagecache truncation if write begin fails, also partial write could never theoretically happened from iomap_write_end(), so remove both of them. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20240320110548.2200662-6-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-25 14:23:53 +02:00
Al Viro	958b9f85f8	erofs_buf: store address_space instead of inode ... seeing that ->i_mapping is the only thing we want from the inode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-04-25 00:57:14 -04:00
Matthew Wilcox (Oracle)	fd1a745ce0	mm: support page_mapcount() on page_has_type() pages Return 0 for pages which can't be mapped. This matches how page_mapped() works. It is more convenient for users to not have to filter out these pages. Link: https://lkml.kernel.org/r/20240321142448.1645400-5-willy@infradead.org Fixes: `9c5ccf2db0` ("mm: remove HUGETLB_PAGE_DTOR") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-04-24 19:34:26 -07:00
Justin Stitt	f700b71927	fs: ecryptfs: replace deprecated strncpy with strscpy strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. A good alternative is strscpy() as it guarantees NUL-termination on the destination buffer. In crypto.c: We expect cipher_name to be NUL-terminated based on its use with the C-string format specifier %s and with other string apis like strlen(): \| printk(KERN_ERR "Error attempting to initialize key TFM " \| "cipher with name = [%s]; rc = [%d]\n", \| tmp_tfm->cipher_name, rc); and \| int cipher_name_len = strlen(cipher_name); In main.c: We can remove the manual NUL-byte assignments as well as the pointers to destinations (which I assume only existed to trim down on line length?) in favor of directly using the destination buffer which allows the compiler to get size information -- enabling the usage of the new 2-argument strscpy(). Note that this patch relies on the _new_ 2-argument versions of strscpy() and strscpy_pad() introduced in Commit `e6584c3964` ("string: Allow 2-argument strscpy()"). Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: <linux-hardening@vger.kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240321-strncpy-fs-ecryptfs-crypto-c-v1-1-d78b74c214ac@google.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-24 16:57:38 -07:00
Justin Stitt	7dcbf17e3f	hfsplus: refactor copy_name to not use strncpy strncpy() is deprecated with NUL-terminated destination strings [1]. The copy_name() method does a lot of manual buffer manipulation to eventually arrive with its desired string. If we don't know the namespace this attr has or belongs to we want to prepend "osx." to our final string. Following this, we're copying xattr_name and doing a bizarre manual NUL-byte assignment with a memset where n=1. Really, we can use some more obvious string APIs to acomplish this, improving readability and security. Following the same control flow as before: if we don't know the namespace let's use scnprintf() to form our prefix + xattr_name pairing (while NUL-terminating too!). Otherwise, use strscpy() to return the number of bytes copied into our buffer. Additionally, for non-empty strings, include the NUL-byte in the length -- matching the behavior of the previous implementation. Note that strscpy() _can_ return -E2BIG but this is already handled by all callsites: In both hfsplus_listxattr_finder_info() and hfsplus_listxattr(), ret is already type ssize_t so we can change the return type of copy_name() to match (understanding that scnprintf()'s return type is different yet fully representable by ssize_t). Furthermore, listxattr() in fs/xattr.c is well-equipped to handle a potential -E2BIG return result from vfs_listxattr(): \| ssize_t error; ... \| error = vfs_listxattr(d, klist, size); \| if (error > 0) { \| if (size && copy_to_user(list, klist, error)) \| error = -EFAULT; \| } else if (error == -ERANGE && size >= XATTR_LIST_MAX) { \| /* The file system tried to returned a list bigger \| than XATTR_LIST_MAX bytes. Not possible. */ \| error = -E2BIG; \| } ... the error can potentially already be -E2BIG, skipping this else-if and ending up at the same state as other errors. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html Link: https://github.com/KSPP/linux/issues/90 Cc: <linux-hardening@vger.kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240401-strncpy-fs-hfsplus-xattr-c-v2-1-6e089999355e@google.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-24 16:55:28 -07:00
Justin Stitt	31ca7e77fd	reiserfs: replace deprecated strncpy with scnprintf strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Our goal here is to get @namebuf populated with @name's contents but surrounded with quotes. There is some careful handling done to ensure we properly truncate @name so that we have room for a literal quote as well as a NUL-term. All this careful handling can be done with scnprintf using the dynamic string width specifier %.*s which allows us to pass in the max size for a source string. Doing this, we can put literal quotes in our format specifier and ensure @name is truncated to fit inbetween these quotes (-3 is from 2 quotes + 1 NUL-byte). All in all, we get to remove a deprecated use of strncpy and clean up this code nicely! Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: <linux-hardening@vger.kernel.org> Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240328-strncpy-fs-reiserfs-item_ops-c-v1-1-2dab6d22a996@google.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-24 16:53:15 -07:00
Max Filippov	10e29251be	binfmt_elf_fdpic: fix /proc/<pid>/auxv Althought FDPIC linux kernel provides /proc/<pid>/auxv files they are empty because there's no code that initializes mm->saved_auxv in the FDPIC ELF loader. Synchronize FDPIC ELF aux vector setup with ELF. Replace entry-by-entry aux vector copying to userspace with initialization of mm->saved_auxv first and then copying it to userspace as a whole. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Link: https://lore.kernel.org/r/20240322195418.2160164-1-jcmvbkbc@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-24 15:55:28 -07:00
Kees Cook	2a5eb99955	binfmt_elf: Leave a gap between .bss and brk Currently the brk starts its randomization immediately after .bss, which means there is a chance that when the random offset is 0, linear overflows from .bss can reach into the brk area. Leave at least a single page gap between .bss and brk (when it has not already been explicitly relocated into the mmap range). Reported-by: <y0un9n132@gmail.com> Closes: https://lore.kernel.org/linux-hardening/CA+2EKTVLvc8hDZc+2Yhwmus=dzOUG5E4gV7ayCbu0MPJTZzWkw@mail.gmail.com/ Link: https://lore.kernel.org/r/20240217062545.1631668-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org>	2024-04-24 12:20:35 -07:00
Andreas Gruenbacher	1e86044402	gfs2: Remove and replace gfs2_glock_queue_work There are no more callers of gfs2_glock_queue_work() left, so remove that helper. With that, we can now rename __gfs2_glock_queue_work() back to gfs2_glock_queue_work() to get rid of some unnecessary clutter. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	9947a06d29	gfs2: do_xmote fixes Function do_xmote() is called with the glock spinlock held. Commit `86934198ee` added a 'goto skip_inval' statement at the beginning of the function to further below where the glock spinlock is expected not to be held anymore. Then it added code there that requires the glock spinlock to be held. This doesn't make sense; fix this up by dropping and retaking the spinlock where needed. In addition, when ->lm_lock() returned an error, do_xmote() didn't fail the locking operation, and simply left the glock hanging; fix that as well. (This is a much older error.) Fixes: `86934198ee` ("gfs2: Clear flags when withdraw prevents xmote") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	1cd28e1586	gfs2: finish_xmote cleanup Currently, function finish_xmote() takes and releases the glock spinlock. However, all of its callers immediately take that spinlock again, so it makes more sense to take the spin lock before calling finish_xmote() already. With that, thaw_glock() is the only place that sets the GLF_HAVE_REPLY flag outside of the glock spinlock, but it also takes that spinlock immediately thereafter. Change that to set the bit when the spinlock is already held. This allows to switch from test_and_clear_bit() to test_bit() and clear_bit() in glock_work_func(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	a3730c5ec5	gfs2: Unlock fewer glocks on unmount At unmount time, we would generally like to explicitly unlock as few glocks as possible for efficiency. We are already skipping glocks that don't have a lock value block (LVB), but we can also skip glocks which are not held in DLM_LOCK_EX or DLM_LOCK_PW mode (of which gfs2 only uses DLM_LOCK_EX under the name LM_ST_EXCLUSIVE). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: David Teigland <teigland@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	d98779e687	gfs2: Fix potential glock use-after-free on unmount When a DLM lockspace is released and there ares still locks in that lockspace, DLM will unlock those locks automatically. Commit `fb6791d100` started exploiting this behavior to speed up filesystem unmount: gfs2 would simply free glocks it didn't want to unlock and then release the lockspace. This didn't take the bast callbacks for asynchronous lock contention notifications into account, which remain active until until a lock is unlocked or its lockspace is released. To prevent those callbacks from accessing deallocated objects, put the glocks that should not be unlocked on the sd_dead_glocks list, release the lockspace, and only then free those glocks. As an additional measure, ignore unexpected ast and bast callbacks if the receiving glock is dead. Fixes: `fb6791d100` ("GFS2: skip dlm_unlock calls in unmount") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Cc: David Teigland <teigland@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	59f6000579	gfs2: Remove ill-placed consistency check This consistency check was originally added by commit `9287c6452d` ("gfs2: Fix occasional glock use-after-free"). It is ill-placed in gfs2_glock_free() because if it holds there, it must equally hold in __gfs2_glock_put() already. Either way, the check doesn't seem necessary anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:48:20 +02:00
Andreas Gruenbacher	7a1ad9d812	gfs2: Fix lru_count accounting Currently, gfs2_scan_glock_lru() decrements lru_count when a glock is moved onto the dispose list. When such a glock is then stolen from the dispose list while gfs2_dispose_glock_lru() doesn't hold the lru_lock, lru_count will be decremented again, so the counter will eventually go negative. This bug has existed in one form or another since at least commit `97cc1025b1` ("GFS2: Kill two daemons with one patch"). Fix this by only decrementing lru_count when we actually remove a glock and schedule for it to be unlocked and dropped. We also don't need to remove and then re-add glocks when we can just as well move them back onto the lru_list when necessary. In addition, return the number of glocks freed as we should, not the number of glocks moved onto the dispose list. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2024-04-24 19:46:38 +02:00
Linus Torvalds	e88c4cfcb7	for-6.9-rc5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYnyRwACgkQxWXV+ddt WDtnlA/9GWPYrQFBBzPZeXZHldr7grn8t8oDVVvMhSxRslrk2XYGqRVEVfZT5+Wp wsbqdfmDXShstWkU43jxXcJAg4QxQNBBSVaKVMO5rXkM5ZHLdn78EJs5htSuy+67 n0zFqqxVr0F9LvrHqs/JJp70fr3WQtGINAkxda0JHKaMEj2nSnGjzKf6GAPomAs+ 7BWQlV4cc8tQAox2MxCFx1eXTISepa9pi0ojm0R+siZGgMkmzpjJTy9WZ3EtIWQN 4LU5FMCQMsKkqUETxsEs5Va0QkEvN3SuiNsoUIJZFSArws3cwYz+u10+8Rpc/U1o 8P3H8fGxvbYpRt6QjG1GGYy0LxhyYeyGp+fNfeBF4pl2MAn2e8qeiPpPlC+Q6/U2 /Nn8+x9/FgWQKNmu76DQ1BM4WoD18mEUQrB6OYLO/9FBttLAevaEO3vxECxosBIj wGyfXJ/r4Y2Vva+pkjreBpc7m/VwwOPGdRHkKk8mzFGqoQzSwvs/pm3ldF9dV7ud smZ0H8vvaEDigOd4oFR2vC2wpETaCL89oS9x/NzMvqlqQaGJoD6t591c3yRwmvro hJYT5lG6KR+ZeDgv+2ZZA2s5/2l7193pkS3toj7v1xSOkpjADPaTBg+2P5o27YyJ UvXbEpdcCV6xWxT3Ak/bkfS1dDGfxAtwV7c/sPGvY5KtmbvjOBE= =riVJ -----END PGP SIGNATURE----- Merge tag 'for-6.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix information leak by the buffer returned from LOGICAL_INO ioctl - fix flipped condition in scrub when tracking sectors in zoned mode - fix calculation when dropping extent range - reinstate fallback to write uncompressed data in case of fragmented space that could not store the entire compressed chunk - minor fix to message formatting style to make it conforming to the commonly used style * tag 'for-6.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range() btrfs: fix information leak in btrfs_ioctl_logical_to_ino() btrfs: fallback if compressed IO fails for ENOSPC btrfs: scrub: run relocation repair when/only needed btrfs: remove colon from messages with state	2024-04-24 09:22:51 -07:00
Christoph Hellwig	652efdeca5	xfs: don't call xfs_file_open from xfs_dir_open Directories do not support direct I/O and thus no non-blocking direct I/O either. Open code the shutdown check and call to generic_file_open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-24 14:46:51 +02:00
Christoph Hellwig	f50805713a	xfs: drop fop_flags for directories Directories have non of the capabilities, so drop the flags. Note that the current state is harmless as no one actually checks for the flags either. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-24 14:46:51 +02:00
Christoph Hellwig	19e048641b	xfs: fix overly long line in the file_operations Re-wrap the newly added fop_flags fields to not go over 80 characters. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-04-24 14:46:39 +02:00
Thomas Weißschuh	a35dd3a786	sysctl: drop now unnecessary out-of-bounds check Remove the now unneeded check for ctl_table_size; it is safe to do so as sysctl_set_perm_empty_ctl_header() does not access the ctl_table member anymore. This also makes the element of sysctl_mount_point unnecessary, so drop it at the same time. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Thomas Weißschuh	4a7b29f650	sysctl: move sysctl type to ctl_table_header Move the SYSCTL_TABLE_TYPE_{DEFAULT,PERMANENTLY_EMPTY} enums from ctl_table to ctl_table_header. Removing the mutable member is necessary to constify static instances of struct ctl_table. Move the initialization of the sysctl_mount_point type into init_header() where all the other header fields are also initialized. As a side-effect the memory usage of the sysctl core is reduced. Each ctl_table_header instance can manage multiple ctl_table instances and is only allocated when the table is actually registered. This saves 8 bytes of memory per ctl_table on 64bit, 4 due to the enum field itself and 4 due to padding. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Thomas Weißschuh	eb32d3adef	sysctl: drop sysctl_is_perm_empty_ctl_table It is used only twice and those callers are simpler with sysctl_is_perm_empty_ctl_header(). So use this sibling function. This is part of an effort to constify definition of struct ctl_table. For this effort the mutable member 'type' is moved from struct ctl_table to struct ctl_table_header. Unifying the macros sysctl_is_perm_empty_ctl_* makes this easier. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Thomas Weißschuh	520713a93d	sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) Remove the 'table' argument from set_ownership as it is never used. This change is a step towards putting "struct ctl_table" into .rodata and eventually having sysctl core only use "const struct ctl_table". The patch was created with the following coccinelle script: @@ identifier func, head, table, uid, gid; @@ void func( struct ctl_table_header head, - struct ctl_table table, kuid_t uid, kgid_t gid) { ... } No additional occurrences of 'set_ownership' were found after doing a tree-wide search. Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-04-24 09:43:54 +02:00
Jiapeng Chong	08e012a62d	xfs: Remove unused function xrep_dir_self_parent The function are defined in the dir_repair.c file, but not called elsewhere, so delete the unused function. fs/xfs/scrub/dir_repair.c:186:1: warning: unused function 'xrep_dir_self_parent'. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8867 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-04-24 12:34:44 +05:30
Gustavo A. R. Silva	9a1f1d04f6	smb: client: Fix struct_group() usage in __packed structs Use struct_group_attr() in __packed structs, instead of struct_group(). Below you can see the pahole output before/after changes: pahole -C smb2_file_network_open_info fs/smb/client/smb2ops.o struct smb2_file_network_open_info { union { struct { __le64 CreationTime; /* 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le64 AllocationSize; / 32 8 / __le64 EndOfFile; / 40 8 / __le32 Attributes; / 48 4 / }; / 0 56 / struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le64 AllocationSize; / 32 8 / __le64 EndOfFile; / 40 8 / __le32 Attributes; / 48 4 / } network_open_info; / 0 56 / }; / 0 56 / __le32 Reserved; / 56 4 / / size: 60, cachelines: 1, members: 2 / / last cacheline: 60 bytes / } __attribute__((__packed__)); pahole -C smb2_file_network_open_info fs/smb/client/smb2ops.o struct smb2_file_network_open_info { union { struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le64 AllocationSize; / 32 8 / __le64 EndOfFile; / 40 8 / __le32 Attributes; / 48 4 / } __attribute__((__packed__)); / 0 52 / struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le64 AllocationSize; / 32 8 / __le64 EndOfFile; / 40 8 / __le32 Attributes; / 48 4 / } __attribute__((__packed__)) network_open_info; / 0 52 / }; / 0 52 / __le32 Reserved; / 52 4 / / size: 56, cachelines: 1, members: 2 / / last cacheline: 56 bytes / }; pahole -C smb_com_open_rsp fs/smb/client/cifssmb.o struct smb_com_open_rsp { ... union { struct { __le64 CreationTime; / 48 8 / __le64 LastAccessTime; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / __le64 LastWriteTime; / 64 8 / __le64 ChangeTime; / 72 8 / __le32 FileAttributes; / 80 4 / }; / 48 40 / struct { __le64 CreationTime; / 48 8 / __le64 LastAccessTime; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / __le64 LastWriteTime; / 64 8 / __le64 ChangeTime; / 72 8 / __le32 FileAttributes; / 80 4 / } common_attributes; / 48 40 / }; / 48 40 / ... / size: 111, cachelines: 2, members: 14 / / last cacheline: 47 bytes / } __attribute__((__packed__)); pahole -C smb_com_open_rsp fs/smb/client/cifssmb.o struct smb_com_open_rsp { ... union { struct { __le64 CreationTime; / 48 8 / __le64 LastAccessTime; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / __le64 LastWriteTime; / 64 8 / __le64 ChangeTime; / 72 8 / __le32 FileAttributes; / 80 4 / } __attribute__((__packed__)); / 48 36 / struct { __le64 CreationTime; / 48 8 / __le64 LastAccessTime; / 56 8 / / --- cacheline 1 boundary (64 bytes) --- / __le64 LastWriteTime; / 64 8 / __le64 ChangeTime; / 72 8 / __le32 FileAttributes; / 80 4 / } __attribute__((__packed__)) common_attributes; / 48 36 / }; / 48 36 / ... / size: 107, cachelines: 2, members: 14 / / last cacheline: 43 bytes / } __attribute__((__packed__)); pahole -C FILE_ALL_INFO fs/smb/client/cifssmb.o typedef struct { union { struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le32 Attributes; / 32 4 / }; / 0 40 / struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le32 Attributes; / 32 4 / } common_attributes; / 0 40 / }; / 0 40 / ... / size: 113, cachelines: 2, members: 17 / / last cacheline: 49 bytes / } __attribute__((__packed__)) FILE_ALL_INFO; pahole -C FILE_ALL_INFO fs/smb/client/cifssmb.o typedef struct { union { struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le32 Attributes; / 32 4 / } __attribute__((__packed__)); / 0 36 / struct { __le64 CreationTime; / 0 8 / __le64 LastAccessTime; / 8 8 / __le64 LastWriteTime; / 16 8 / __le64 ChangeTime; / 24 8 / __le32 Attributes; / 32 4 / } __attribute__((__packed__)) common_attributes; / 0 36 / }; / 0 36 / ... / size: 109, cachelines: 2, members: 17 / / last cacheline: 45 bytes */ } __attribute__((__packed__)) FILE_ALL_INFO; Fixes: `0015eb6e12` ("smb: client, common: fix fortify warnings") Cc: stable@vger.kernel.org Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-04-24 01:06:13 -05:00
Chuck Lever	8ddb7142c8	Revert "NFSD: Convert the callback workqueue to use delayed_work" This commit was a pre-requisite for commit `c1ccfcf1a9` ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"), which has already been reverted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-04-23 20:12:41 -04:00
Chuck Lever	9c8ecb9308	Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down" The reverted commit attempted to enable NFSD to retransmit pending callback operations if an NFS client disconnects, but unintentionally introduces a hazardous behavior regression if the client becomes permanently unreachable while callback operations are still pending. A disconnect can occur due to network partition or if the NFS server needs to force the NFS client to retransmit (for example, if a GSS window under-run occurs). Reverting the commit will make NFSD behave the same as it did in v6.8 and before. Pending callback operations are permanently lost if the client connection is terminated before the client receives them. For some callback operations, this loss is not harmful. However, for CB_RECALL, the loss means a delegation might be revoked unnecessarily. For CB_OFFLOAD, pending COPY operations will never complete unless the NFS client subsequently sends an OFFLOAD_STATUS operation, which the Linux NFS client does not currently implement. These issues still need to be addressed somehow. Reported-by: Dai Ngo <dai.ngo@oracle.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-04-23 20:12:34 -04:00
Darrick J. Wong	5e1c7d0b29	xfs: invalidate dentries for a file before moving it to the orphanage Invalidate the cached dentries that point to the file that we're moving to lost+found before we actually move it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:19 -07:00
Darrick J. Wong	6d335233fe	xfs: exchange-range for repairs is no longer dynamic The atomic file exchange-range functionality is now a permanent filesystem feature instead of a dynamic log-incompat feature. It cannot be turned on at runtime, so we no longer need the XCHK_FSGATES flags and whatnot that supported it. Remove the flag and the enable function, and move the xfs_has_exchange_range checks to the start of the repair functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:19 -07:00
Darrick J. Wong	b44bfc0695	xfs: fix iunlock calls in xrep_adoption_trans_alloc If the transaction allocation in xrep_adoption_trans_alloc fails, we should drop only the locks that we took. In this case this is ILOCK_EXCL of both the orphanage and the file being repaired. Dropping any IOLOCK here is incorrect. Found by fuzzing u3.sfdir3.list[1].name = zeroes in xfs/1546. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:19 -07:00
Darrick J. Wong	6691753752	xfs: drop the scrub file's iolock when transaction allocation fails If the transaction allocation in the !orphanage_available case of xrep_nlinks_repair_inode fails, we need to drop the IOLOCK of the file being scrubbed before exiting. Found by fuzzing u3.sfdir3.list[1].name = zeroes in xfs/1546. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:18 -07:00
Darrick J. Wong	4ad350ac58	xfs: only iget the file once when doing vectored scrub-by-handle If a program wants us to perform a scrub on a file handle and the fd passed to ioctl() is not the file referenced in the handle, iget the file once and pass it into the scrub code. This amortizes the untrusted iget lookup over /all/ the scrubbers mentioned in the scrubv call. When running fstests in "rebuild all metadata after each test" mode, I observed a 10% reduction in runtime on account of avoiding repeated inobt lookups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:18 -07:00
Darrick J. Wong	b27ce0da60	xfs: use dontcache for grabbing inodes during scrub Back when I wrote commit `a03297a0ca`, I had thought that we'd be doing users a favor by only marking inodes dontcache at the end of a scrub operation, and only if there's only one reference to that inode. This was more or less true back when I_DONTCACHE was an XFS iflag and the only thing it did was change the outcome of xfs_fs_drop_inode to 1. Note: If there are dentries pointing to the inode when scrub finishes, the inode will have positive i_count and stay around in cache until dentry reclaim. But now we have d_mark_dontcache, which cause the inode and the dentries attached to it all to be marked I_DONTCACHE, which means that we drop the dentries ASAP, which drops the inode ASAP. This is bad if scrub found problems with the inode, because now they can be scheduled for inactivation, which can cause inodegc to trip on it and shut down the filesystem. Even if the inode isn't bad, this is still suboptimal because phases 3-7 each initiate inode scans. Dropping the inode immediately during phase 3 is silly because phase 5 will reload it and drop it immediately, etc. It's fine to mark the inodes dontcache, but if there have been accesses to the file that set up dentries, we should keep them. I validated this by setting up ftrace to capture xfs_iget_recycle* tracepoints and ran xfs/285 for 30 seconds. With current djwong-wtf I saw ~30,000 recycle events. I then dropped the d_mark_dontcache calls and set XFS_IGET_DONTCACHE, and the recycle events dropped to ~5,000 per 30 seconds. Therefore, grab the inode with XFS_IGET_DONTCACHE, which only has the effect of setting I_DONTCACHE for cache misses. Remove the d_mark_dontcache call that can happen in xchk_irele. Fixes: `a03297a0ca` ("xfs: manage inode DONTCACHE status at irele time") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:18 -07:00
Darrick J. Wong	c77b37584c	xfs: introduce vectored scrub mode Introduce a variant on XFS_SCRUB_METADATA that allows for a vectored mode. The caller specifies the principal metadata object that they want to scrub (allocation group, inode, etc.) once, followed by an array of scrub types they want called on that object. The kernel runs the scrub operations and writes the output flags and errno code to the corresponding array element. A new pseudo scrub type BARRIER is introduced to force the kernel to return to userspace if any corruptions have been found when scrubbing the previous scrub types in the array. This enables userspace to schedule, for example, the sequence: 1. data fork 2. barrier 3. directory If the data fork scrub is clean, then the kernel will perform the directory scrub. If not, the barrier in 2 will exit back to userspace. The alternative would have been an interface where userspace passes a pointer to an empty buffer, and the kernel formats that with xfs_scrub_vecs that tell userspace what it scrubbed and what the outcome was. With that the kernel would have to communicate that the buffer needed to have been at least X size, even though for our cases XFS_SCRUB_TYPE_NR + 2 would always be enough. Compared to that, this design keeps all the dependency policy and ordering logic in userspace where it already resides instead of duplicating it in the kernel. The downside of that is that it needs the barrier logic. When running fstests in "rebuild all metadata after each test" mode, I observed a 10% reduction in runtime due to fewer transitions across the system call boundary. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:18 -07:00
Darrick J. Wong	be7cf174e9	xfs: move xfs_ioc_scrub_metadata to scrub.c Move the scrub ioctl handler to scrub.c to keep the code together and to reduce unnecessary code when CONFIG_XFS_ONLINE_SCRUB=n. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:17 -07:00
Darrick J. Wong	271557de7c	xfs: reduce the rate of cond_resched calls inside scrub We really don't want to call cond_resched every single time we go through a loop in scrub -- there may be billions of records, and probing into the scheduler itself has overhead. Reduce this overhead by only calling cond_resched 10x per second; and add a counter so that we only check jiffies once every 1000 records or so. Surprisingly, this reduces scrub-only fstests runtime by about 2%. I used the bmapinflate xfs_db command to produce a billion-extent file and this stupid gadget reduced the scrub runtime by about 4%. From a stupid microbenchmark of calling these things 1 billion times, I estimate that cond_resched costs about 5.5ns per call; jiffes costs about 0.3ns per read; and fatal_signal_pending costs about 0.4ns per call. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:17 -07:00
Darrick J. Wong	3f31406aef	xfs: fix corruptions in the directory tree Repair corruptions in the directory tree itself. Cycles are broken by removing an incoming parent->child link. Multiply-owned directories are fixed by pruning the extra parent -> child links Disconnected subtrees are reconnected to the lost and found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:17 -07:00
Darrick J. Wong	37056912d5	xfs: report directory tree corruption in the health information Report directories that are the source of corruption in the directory tree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:17 -07:00
Darrick J. Wong	d54c5ac80f	xfs: invalidate dirloop scrub path data when concurrent updates happen Add a dirent update hook so that we can detect directory tree updates that affect any of the paths found by this scrubber and force it to rescan. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:16 -07:00
Darrick J. Wong	928b721a11	xfs: teach online scrub to find directory tree structure problems Create a new scrubber that detects corruptions within the directory tree structure itself. It can detect directories with multiple parents; loops within the directory tree; and directory loops not accessible from the root. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:16 -07:00
Darrick J. Wong	327ed702d8	xfs: inode repair should ensure there's an attr fork to store parent pointers The runtime parent pointer update code expects that any file being moved around the directory tree already has an attr fork. However, if we had to rebuild an inode core record, there's a chance that we zeroed forkoff as part of the inode to pass the iget verifiers. Therefore, if we performed any repairs on an inode core, ensure that the inode has a nonzero forkoff before unlocking the inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:16 -07:00
Darrick J. Wong	3f50ddbf4b	xfs: repair link count of nondirectories after rebuilding parent pointers Since the parent pointer scrubber does not exhaustively search the filesystem for missing parent pointers, it doesn't have a good way to determine that there are pointers missing from an otherwise uncorrupt xattr structure. Instead, for nondirectories it employs a heuristic of comparing the file link count to the number of parent pointers found. However, we don't want this heuristic flagging a false corruption after a repair has actually scanned the entire filesystem to rebuild the parent pointers. Therefore, reset the file link count in this one case because we actually know the correct link count. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:16 -07:00
Darrick J. Wong	7be3d20bbe	xfs: adapt the orphanage code to handle parent pointers Adapt the orphanage's adoption code to update the child file's parent pointers as part of the reparenting process. Also ensure that the child has an attr fork to receive the parent pointer update, since the runtime code assumes one exists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:15 -07:00
Darrick J. Wong	a26dc21309	xfs: actually rebuild the parent pointer xattrs Once we've assembled all the parent pointers for a file, we need to commit the new dataset atomically to that file. Parent pointer records are embedded in the xattr structure, which means that we must write a new extended attribute structure, again, atomically. Therefore, we must copy the non-parent-pointer attributes from the file being repaired into the temporary file's extended attributes and then call the atomic extent swap mechanism to exchange the blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:15 -07:00
Darrick J. Wong	6efbbdeb14	xfs: add a per-leaf block callback to xchk_xattr_walk Add a second callback function to xchk_xattr_walk so that we can do something in between attr leaf blocks. This will be used by the next patch to see if we should flush cached parent pointer updates to constrain memory usage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:15 -07:00
Darrick J. Wong	55edcd1f86	xfs: split xfs_bmap_add_attrfork into two pieces Split this function into two pieces -- one to make the actual changes to the inode core to add the attr fork, and another one to deal with getting the transaction and locking the inodes. The next couple of patches will need this to be split into two. One patch implements committing new parent pointer recordsets to damaged files. If one file has an attr fork and the other does not, we have to create the missing attr fork before the atomic swap transaction, and can use the behavior encoded in the current xfs_bmap_add_attrfork. The second patch adapts /lost+found adoptions to handle parent pointers correctly. The adoption process will add a parent pointer to a child that is being moved to /lost+found, but this requires that the attr fork already exists. We don't know if we're actually going to commit the adoption until we've already reserved a transaction and taken the ILOCKs, which means that we must have a way to bypass the start of the current xfs_bmap_add_attrfork. Therefore, create xfs_attr_add_fork as the helper that creates a transaction and takes locks; and make xfs_bmap_add_attrfork the function that updates the inode core and allocates the incore attr fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:15 -07:00
Darrick J. Wong	13db700789	xfs: remove pointless unlocked assertion Remove this assertion about the inode not having an attr fork from xfs_bmap_add_attrfork because the function handles that case just fine. Weirder still, the function actually /requires/ the caller not to hold the ILOCK, which means that its accesses are not stabilized. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:14 -07:00
Darrick J. Wong	65a1fb7a11	xfs: implement live updates for parent pointer repairs While we're scanning the filesystem for dirents that we can turn into parent pointers, we cannot hold the IOLOCK or ILOCK of the file being repaired. Therefore, we need to set up a dirent hook so that we can keep the temporary file's parent pionters up to date with the rest of the filesystem. Hence we add the ability to remove pptrs from the temporary file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:14 -07:00
Darrick J. Wong	b334f7fab5	xfs: repair directory parent pointers by scanning for dirents If parent pointers are enabled on the filesystem, we can repair the entire dataset by walking the directories of the filesystem looking for dirents that we can turn into parent pointers. Once we have a full incore dataset, we'll figure out what to do with it, but that's for a subsequent patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:14 -07:00
Darrick J. Wong	e5d7ce0364	xfs: replay unlocked parent pointer updates that accrue during xattr repair There are a few places where the extended attribute repair code drops the ILOCK to apply stashed xattrs to the temporary file. Although setxattr and removexattr are still locked out because we retain our hold on the IOLOCK, this doesn't prevent renames from updating parent pointers, because the VFS doesn't take i_rwsem on children that are being moved. Therefore, set up a dirent hook to capture parent pointer updates for this file, and replay(?) the updates. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:14 -07:00
Darrick J. Wong	8559b21a64	xfs: implement live updates for directory repairs While we're scanning the filesystem for parent pointers that we can turn into dirents, we cannot hold the IOLOCK or ILOCK of the directory being repaired. Therefore, we need to set up a dirent hook so that we can keep the temporary directory up to date with the rest of the filesystem. Hence we add the ability to remove entries from the temporary dir. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:13 -07:00
Darrick J. Wong	76fc23b695	xfs: repair directories by scanning directory parent pointers For filesystems with parent pointers, scan the entire filesystem looking for parent pointers that target the directory we're rebuilding instead of trying to salvage whatever we can from the directory data blocks. This will be more robust than salvaging, but there's more code to come. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 16:55:13 -07:00
Matthew Wilcox (Oracle)	91ba8cb1b0	isofs: Remove calls to set/clear the error flag Nobody checks the error flag on isofs folios, so stop setting and clearing it. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20240420025029.2166544-14-willy@infradead.org>	2024-04-23 23:22:11 +02:00
Matthew Wilcox (Oracle)	72a5425adc	ext2: Remove call to folio_set_error() Nobody checks this flag on ext2 folios, stop setting it. Cc: Jan Kara <jack@suse.com> Cc: linux-ext4@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20240420025029.2166544-10-willy@infradead.org>	2024-04-23 23:22:04 +02:00
Alexander Aring	7b72ab2c6a	dlm: return -ENOMEM if ls_recover_buf fails This patch fixes to return -ENOMEM in case of an allocation failure that was forgotten to change in commit `6c648035cb` ("dlm: switch to use rhashtable for rsbs"). Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202404200536.jGi6052v-lkp@intel.com/ Fixes: `6c648035cb` ("dlm: switch to use rhashtable for rsbs") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-04-23 16:08:55 -05:00
Linus Torvalds	9d1ddab261	smb3 client fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYm8oEACgkQiiy9cAdy T1HAVAv/T9NXYhTYQK8EK7DvzcGDV4dOcFI6GVsrinp1KHGyuxoPABKGctQXfou0 4DJ9ik0CYFWmVuz8CODxmHkJq8fclnkJtVPH4DjcOm8CqhcaMJNVgEWP6eMGevGo sQPEtFAYRPQrIm5X8u2uvARE490YiUD85Se+6LrYZdzt/BOUQQHtKodSTre1ZCAV F6GLEVVuncP9iqno9lNu1EI8ltcOW6i1i15s4HOmULNwtUKdwsYTWVqW6JDOy5gM 9YqXJhPobVcZnY/m8QVWfE/lEPOwJO5lbRF4Ktykz4PQkxZqD6t+Noesj73GKEgC 7jnt3L79s1zA51gHdn96Z1qWlaIruX4ugYhfzQAW7PfYmSUr3I8G09ofzwFc77aH osYoU6mZWWL+4/RIKK3DGYRe2ET68KmlNheG2OTOeCNfwjSoI5rWHSLMGqlegTc0 40gz62OHuwncMyKFLwGaf6ztzlOLHIdVU4uhRv1/9ptZuMl/2LoiN7G8ZddbtYIS JxL/kIEE =PaAG -----END PGP SIGNATURE----- Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - fscache fix - fix for case where we could use uninitialized lease - add tracepoint for debugging refcounting of tcon - fix mount option regression (e.g. forceuid vs. noforceuid when uid= specified) caused by conversion to the new mount API * tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: reinstate original behavior again for forceuid/forcegid smb: client: fix rename(2) regression against samba cifs: Add tracing for the cifs_tcon struct refcounting cifs: Fix reacquisition of volume cookie on still-live connection	2024-04-23 09:37:32 -07:00
Darrick J. Wong	5769aa41ee	xfs: add raw parent pointer apis to support repair Add a couple of utility functions to set or remove parent pointers from a file. These functions will be used by repair code, hence they skip the xattr logging that regular parent pointer updates use. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:04 -07:00
Darrick J. Wong	086e934fe9	xfs: salvage parent pointers when rebuilding xattr structures When we're salvaging extended attributes, make sure we validate the ones that claim to be parent pointers before adding them to the salvage pile. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:03 -07:00
Darrick J. Wong	bf61c36a45	xfs: make the reserved block permission flag explicit in xfs_attr_set Make the use of reserved blocks an explicit parameter to xfs_attr_set. Userspace setting XFS_ATTR_ROOT attrs should continue to be able to use it, but for online repairs we can back out and therefore do not care. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:03 -07:00
Darrick J. Wong	e7420e75ef	xfs: remove some boilerplate from xfs_attr_set In preparation for online/offline repair wanting to use xfs_attr_set, move some of the boilerplate out of this function into the callers. Repair can initialize the da_args completely, and the userspace flag handling/twisting goes away once we move it to xfs_attr_change. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:03 -07:00
Darrick J. Wong	59a2af9086	xfs: check parent pointer xattrs when scrubbing Check parent pointer xattrs as part of scrubbing xattrs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:03 -07:00
Darrick J. Wong	77ede5f44b	xfs: walk directory parent pointers to determine backref count If the filesystem has parent pointers enabled, walk the parent pointers of subdirectories to determine the true backref count. In theory each subdir should have a single parent reachable via dotdot, but in the case of (corrupt) subdirs with multiple parents, we need to keep the link counts high enough that the directory loop detector will be able to correct the multiple parents problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:03 -07:00
Darrick J. Wong	8ad345306d	xfs: deferred scrub of parent pointers If the trylock-based dirent check fails, retain those parent pointers and check them at the end. This may involve dropping the locks on the file being scanned, so yay. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:02 -07:00
Darrick J. Wong	0d29a20fbd	xfs: scrub parent pointers Actually check parent pointers now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:02 -07:00
Darrick J. Wong	b961c8bf1f	xfs: deferred scrub of dirents If the trylock-based parent pointer check fails, retain those dirents and check them at the end. This may involve dropping the locks on the file being scanned, so yay. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:02 -07:00
Darrick J. Wong	61b3f0df5c	xfs: check dirents have parent pointers If the fs has parent pointers, we need to check that each child dirent points to a file that has a parent pointer pointing back at us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:02 -07:00
Darrick J. Wong	2a009397eb	xfs: revert commit `44af6c7e59` In my haste to fix what I thought was a performance problem in the attr scrub code, I neglected to notice that the xfs_attr_get_ilocked also had the effect of checking that attributes can actually be looked up through the attr dabtree. Fix this. Fixes: `44af6c7e59` ("xfs: don't load local xattr values during scrub") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:01 -07:00
Darrick J. Wong	67ac7091e3	xfs: enable parent pointers Add parent pointers to the list of supported features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:01 -07:00
Darrick J. Wong	6ed858c7c6	xfs: drop compatibility minimum log size computations for reflink Let's also drop the oversized minimum log computations for reflink and rmap that were the result of bugs introduced many years ago. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:01 -07:00
Darrick J. Wong	7ea816ca40	xfs: fix unit conversion error in xfs_log_calc_max_attrsetm_res Dave and I were discussing some recent test regressions as a result of me turning on nrext64=1 on realtime filesystems, when we noticed that the minimum log size of a 32M filesystem jumped from 954 blocks to 4287 blocks. Digging through xfs_log_calc_max_attrsetm_res, Dave noticed that @size contains the maximum estimated amount of space needed for a local format xattr, in bytes, but we feed this quantity to XFS_NEXTENTADD_SPACE_RES, which requires units of blocks. This has resulted in an overestimation of the minimum log size over the years. We should nominally correct this, but there's a backwards compatibility problem -- if we enable it now, the minimum log size will decrease. If a corrected mkfs formats a filesystem with this new smaller log size, a user will encounter mount failures on an uncorrected kernel due to the larger minimum log size computations there. Therefore, turn this on for parent pointers because it wasn't merged at all upstream when this issue was discovered. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:01 -07:00
Allison Henderson	5f98ec1cb5	xfs: add a incompat feature bit for parent pointers Create an incompat feature bit and a fs geometry flag so that we can enable the feature in the ondisk superblock and advertise its existence to userspace. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-04-23 07:47:01 -07:00
Allison Henderson	7dafb449b7	xfs: don't remove the attr fork when parent pointers are enabled When an inode is removed, it may also cause the attribute fork to be removed if it is the last attribute. This transaction gets flushed to the log, but if the system goes down before we could inactivate the symlink, the log recovery tries to inactivate this inode (since it is on the unlinked list) but the verifier trips over the remote value and leaks it. Hence we ended up with a file in this odd state on a "clean" mount. The "obvious" fix is to prohibit erasure of the attr fork to avoid tripping over the verifiers when pptrs are enabled. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Darrick J. Wong	233f4e12bb	xfs: add parent pointer ioctls This patch adds a pair of new file ioctls to retrieve the parent pointer of a given inode. They both return the same results, but one operates on the file descriptor passed to ioctl() whereas the other allows the caller to specify a file handle for which the caller wants results. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Darrick J. Wong	b8c9d4253d	xfs: split out handle management helpers a bit Split out the functions that generate file/fs handles and map them back into dentries in preparation for the GETPARENTS ioctl next. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Darrick J. Wong	af69d852df	xfs: move handle ioctl code to xfs_handle.c Move the handle managemnet code (and the attrmulti code that uses it) to xfs_handle.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Allison Henderson	8f4b980ee6	xfs: pass the attr value to put_listent when possible Pass the attr value to put_listent when we have local xattrs or shortform xattrs. This will enable the GETPARENTS ioctl to use xfs_attr_list as its backend. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:47:00 -07:00
Allison Henderson	daf9f88490	xfs: don't return XFS_ATTR_PARENT attributes via listxattr Parent pointers are internal filesystem metadata. They're not intended to be directly visible to userspace, so filter them out of xfs_xattr_put_listent so that they don't appear in listxattr. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Inspired-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: change this to XFS_ATTR_PRIVATE_NSP_MASK per fsverity patchset] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:59 -07:00
Allison Henderson	1c12949e50	xfs: Add parent pointers to xfs_cross_rename Cross renames are handled separately from standard renames, and need different handling to update the parent attributes correctly. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:59 -07:00
Allison Henderson	5a8338c882	xfs: Add parent pointers to rename This patch removes the old parent pointer attribute during the rename operation, and re-adds the updated parent pointer. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: adjust to new ondisk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:59 -07:00
Allison Henderson	d2d18330f6	xfs: remove parent pointers in unlink This patch removes the parent pointer attribute during unlink Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: adjust to new ondisk format, minor rebase fixes] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:59 -07:00
Allison Henderson	5d31a85dcc	xfs: add parent attributes to symlink This patch modifies xfs_symlink to add a parent pointer to the inode. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor rebase fixups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:58 -07:00
Allison Henderson	f1097be220	xfs: add parent attributes to link This patch modifies xfs_link to add a parent pointer to the inode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: minor rebase fixes] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:58 -07:00
Allison Henderson	b7c62d90c1	xfs: parent pointer attribute creation Add parent pointer attribute during xfs_create, and subroutines to initialize attributes. Note that the xfs_attr_intent object contains a pointer to the caller's xfs_da_args object, so the latter must persist until transaction commit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: shorten names, adjust to new format, set init_xattrs for parent pointers] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:58 -07:00
Darrick J. Wong	fb102fe7fe	xfs: create a hashname function for parent pointers Although directory entry and parent pointer recordsets look very similar (name -> ino), there's one major difference between them: a file can be hardlinked from multiple parent directories with the same filename. This is common in shared container environments where a base directory tree might be hardlink-copied multiple times. IOWs the same 'ls' program might be hardlinked to multiple /srv/*/bin/ls paths. We don't want parent pointer operations to bog down on hash collisions between the same dirent name, so create a special hash function that mixes in the parent directory inode number. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:58 -07:00
Allison Henderson	7dba4a5fe1	xfs: extend transaction reservations for parent attributes We need to add, remove or modify parent pointer attributes during create/link/unlink/rename operations atomically with the dirents in the parent directories being modified. This means they need to be modified in the same transaction as the parent directories, and so we need to add the required space for the attribute modifications to the transaction reservations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: fix indenting errors, adjust for new log format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:57 -07:00
Allison Henderson	a08d672963	xfs: add parent pointer validator functions The attr name of a parent pointer is a string, and the attr value of a parent pointer is (more or less) a file handle. So we need to modify attr_namecheck to verify the parent pointer name, and add a xfs_parent_valuecheck function to sanitize the handle. At the same time, we need to validate attr values during log recovery if the xattr is really a parent pointer. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: move functions to xfs_parent.c, adjust for new disk format] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:57 -07:00
Allison Henderson	297da63379	xfs: Expose init_xattrs in xfs_create_tmpfile Tmp files are used as part of rename operations and will need attr forks initialized for parent pointers. Expose the init_xattrs parameter to the calling function to initialize the fork. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:57 -07:00
Darrick J. Wong	ae673f534a	xfs: record inode generation in xattr update log intent items For parent pointer updates, record the i_generation of the file that is being updated so that we don't accidentally jump generations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:57 -07:00
Darrick J. Wong	5773f7f82b	xfs: create attr log item opcodes and formats for parent pointers Make the necessary alterations to the extended attribute log intent item ondisk format so that we can log parent pointer operations. This requires the creation of new opcodes specific to parent pointers, and a new four-argument replace operation to handle renames. At this point this part of the patchset has changed so much from what Allison original wrote that I no longer think her SoB applies. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:57 -07:00
Darrick J. Wong	a918f5f2cd	xfs: refactor xfs_is_using_logged_xattrs checks in attr item recovery Move this feature check down to the per-op checks so that we can ensure that we never see parent pointer attr items on non-pptr filesystems, and that logged xattrs are turned on for non-pptr attr items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:56 -07:00
Darrick J. Wong	f041455eb5	xfs: allow xattr matching on name and value for parent pointers If a file is hardlinked with the same name but from multiple parents, the parent pointers will all have the same dirent name (== attr name) but with different parent_ino/parent_gen values. To disambiguate, we need to be able to match on both the attr name and the attr value. This is in contrast to regular xattrs, which are matchtg edit d only on name. Therefore, plumb in the ability to match shortform and local attrs on name and value in the XFS_ATTR_PARENT namespace. Parent pointer attr values are never large enough to be stored in a remote attr, so we need can reject these cases as corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:56 -07:00
Allison Henderson	8337d58ab2	xfs: define parent pointer ondisk extended attribute format We need to define the parent pointer attribute format before we start adding support for it into all the code that needs to use it. The EA format we will use encodes the following information: name={dirent name} value={parent inumber, parent inode generation} hash=xfs_dir2_hashname(dirent name) ^ (parent_inumber) The inode/gen gives all the information we need to reliably identify the parent without requiring child->parent lock ordering, and allows userspace to do pathname component level reconstruction without the kernel ever needing to verify the parent itself as part of ioctl calls. By using the name-value lookup mode in the extended attribute code to match parent pointers using both the xattr name and value, we can identify the exact parent pointer EA we need to modify/remove in rename/unlink operations without searching the entire EA space. By storing the dirent name, we have enough information to be able to validate and reconstruct damaged directory trees. Earlier iterations of this patchset encoded the directory offset in the parent pointer key, but this format required repair to keep that in sync across directory rebuilds, which is unnecessary complexity. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:56 -07:00
Allison Henderson	98493ff878	xfs: add parent pointer support to attribute code Add the new parent attribute type. XFS_ATTR_PARENT is used only for parent pointer entries; it uses reserved blocks like XFS_ATTR_ROOT. Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:56 -07:00
Darrick J. Wong	a64e013475	xfs: create a separate hashname function for extended attributes Create a separate function to compute name hashvalues for extended attributes. When we get to parent pointers we'll be altering the rules so that metadump obfuscation doesn't turn heinous. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:55 -07:00
Darrick J. Wong	9713dc8877	xfs: move xfs_attr_defer_add to xfs_attr_item.c Move the code that adds the incore xfs_attr_item deferred work data to a transaction live with the ATTRI log item code. This means that the upper level extended attribute code no longer has to know about the inner workings of the ATTRI log items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:55 -07:00
Christoph Hellwig	f49af061f4	xfs: check the flags earlier in xfs_attr_match Checking the flags match is much cheaper than a memcmp, so do it early on in xfs_attr_match, and also add a little helper to calculate the match mask right under the comment explaining the logic for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2024-04-23 07:46:55 -07:00
Darrick J. Wong	63211876ce	xfs: rearrange xfs_attr_match parameters Rearrange the parameters to this function so that they match the order of attr listent: attr_flags -> name -> namelen -> value -> valuelen. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:55 -07:00
Darrick J. Wong	ea0b3e8147	xfs: enforce one namespace per attribute Create a standardized helper function to enforce one namespace bit per extended attribute, and refactor all the open-coded hweight logic. This function is not a static inline to avoid porting hassles in userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:54 -07:00
Darrick J. Wong	ffdcc3b8eb	xfs: refactor name/value iovec validation in xlog_recover_attri_commit_pass2 Hoist the code that checks the attr name and value iovecs into separate helpers so that we can add more callsites for the new parent pointer attr intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:54 -07:00
Darrick J. Wong	50855427c2	xfs: refactor name/length checks in xfs_attri_validate Move the name and length checks into the attr op switch statement so that we can perform more specific checks of the value length. Over the next few patches we're going to add new attr op flags with different validation requirements. While we're at it, remove the incorrect comment. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:54 -07:00
Darrick J. Wong	c07f018bc0	xfs: use local variables for name and value length in _attri_commit_pass2 We're about to start using tagged unions in the xattr log format, so create a bunch of local variables in the recovery function so we only have to decode the log item fields once. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:54 -07:00
Darrick J. Wong	0aeeeb7969	xfs: always set args->value in xfs_attri_item_recover Always set args->value to the recovered value buffer. This reduces the amount of code in the switch statement, and hence the amount of thinking that I have to do. We validated the recovered buffers, supposedly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:54 -07:00
Darrick J. Wong	1c7f09d210	xfs: validate recovered name buffers when recovering xattr items Strengthen the xattri log item recovery code by checking that we actually have the required name and newname buffers for whatever operation we're replaying. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:53 -07:00
Darrick J. Wong	2a2c05d013	xfs: use helpers to extract xattr op from opflags Create helper functions to extract the xattr op from the ondisk xattri log item and the incore attr intent item. These will get more use in the patches that follow. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:53 -07:00
Darrick J. Wong	992c3b5c3f	xfs: restructure xfs_attr_complete_op a bit Eliminate the local variable from this function so that we can streamline things a bit later when we add the PPTR_REPLACE op code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:53 -07:00
Darrick J. Wong	309dc9cbbb	xfs: check shortform attr entry flags specifically While reviewing flag checking in the attr scrub functions, we noticed that the shortform attr scanner didn't catch entries that have the LOCAL or INCOMPLETE bits set. Neither of these flags can ever be set on a shortform attr, so we need to check this narrower set of valid flags. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:53 -07:00
Darrick J. Wong	f660ec8eae	xfs: fix missing check for invalid attr flags The xattr scrubber doesn't check for undefined flags in shortform attr entries. Therefore, define a mask XFS_ATTR_ONDISK_MASK that has all possible XFS_ATTR_* flags in it, and use that to check for unknown bits in xchk_xattr_actor. Refactor the check in the dabtree scanner function to use the new mask as well. The redundant checks need to be in place because the dabtree check examines the hash mappings and therefore needs to decode the attr leaf entries to compute the namehash. This happens before the walk of the xattr entries themselves. Fixes: `ae0506eba7` ("xfs: check used space of shortform xattr structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-04-23 07:46:53 -07:00

... 10 11 12 13 14 ...

92046 Commits