Proxmox-Port/zfs - zfs - Gitea: Git with a cup of tea

mirror of https://github.com/openzfs/zfs.git synced 2025-10-01 19:56:28 +00:00

Author	SHA1	Message	Date
Rob Norris	574eec2964	dnode: remove dn_dirtyctx and dnode_dirtycontext Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:38 -07:00
Rob Norris	aa6f0f878b	dnode: remove dn_dirtyctx_firstset Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:36 -07:00
Rob Norris	eecff1b4a9	dnode: remove dn_dirty_txg and DNODE_IS_DIRTY dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:35 -07:00
Rob Norris	3abf72b251	dnode: add dn_dirtycnt, count of number of txgs this dnode is dirty on Bumped when we take the dirty hold in dnode_setdirty(), dropped when the dnode is finally cleaned up after sync in dnode_rele_task() or userquota_updates_task(). This gives us a way to check if the dnode is dirty on any txg without having to rely on outside information (eg presence on a dirty list), which has been a rich source of bugs in the past. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Suggested-by: Robert Evans <evansr@google.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Robert Evans <evansr@google.com> Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16297 Closes #17652 Closes #17658	2025-08-21 06:05:29 -07:00
Rob Norris	dcd73069f0	zvol_remove_minors_impl: remove all async fallbacks Since both ZFS- and OS-sides of a zvol now take care of their own locking and don't get in each other's way, there's no need for the very complicated removal code to fall back to async tasks if the locks needed at each stage can't be obtained right now. Here we change it to be a linear three-step process: select zvols of interest and flag them for removal, then wait for them to shed activity and then remove them, and finally, free them. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:47 -07:00
Rob Norris	96f9d271ea	zvol: remove the OS-side minor before freeing the zvol When destroying a zvol, it is not "unpublished" from the system (that is, /dev/zd* node removed) until zvol_os_free(). Under Linux, at the time del_gendisk() and put_disk() are called, the device node may still be have an active hold, from a userspace program or something inside the kernel (a partition probe). As it is currently, this can lead to calls to zvol_open() or zvol_release() while the zvol_state_t is partially or fully freed. zvol_open() has some protection against this by checking that private_data is NULL, but zvol_release does not. This implements a better ordering for all of this by adding a new OS-side method, zvol_os_remove_minor(), which is responsible for fully decoupling the "private" (OS-side) objects from the zvol_state_t. For Linux, that means calling put_disk(), nulling private_data, and freeing zv_zso. This takes the place of zvol_os_clear_private(), which was a nod in that direction but did not do enough, and did not do it early enough. Equivalent changes are made on the FreeBSD side to follow the API change. Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17625	2025-08-19 10:06:21 -07:00
Ameer Hamza	7b54567c1f	trace_zil.h: rename zcw_zio_error to zcw_error Rename `zcw_zio_error` to `zcw_error` in `trace_zil.h` that was missed in commit `f562e0f69`. This fixes compilation errors exposed when building with `--with-linux=`. Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17654	2025-08-19 10:54:50 -04:00
Brian Behlendorf	5061f959d1	Retire zfs_autoimport_disable kmod option Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Back in 2014 the zfs_autoimport_disable module option was added to control whether the kmods should load the pool configs from the cache file on module load. The default value since that time has been for the kernel to not process the cache file. Detecting and importing pools during boot is now controlled outside of the kmod on both Linux and FreeBSD. By all accounts this has been working well and we can remove this dormant code on the kernel side. The spa_config_load() function is has been moved to userspace, it is now only used by libzpool. Additionally, the spa_boot_init() hook which was used by FreeBSD now looks to be used and was removed. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17618	2025-08-14 14:58:58 -07:00
Alexander Motin	d151432073	ZIL: Make allocations more flexible Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details When ZIL allocates space for new LWBs without knowing how much it will require, it can use new metaslab_alloc_range() function to allocate slightly more or less than it predicted. It allows to improve space efficiency by allocating bigger LWBs on RAIDZ/dRAID instead of padding and possibly packing more ZIL records there. It may also allow to reduce ganging in some cases by allowing to allocate smaller LWBs when we are not sure we'll need bigger. On the opposite side, when we allocate space for already closed LWBs, when we precisely know how much space we need, we may just allocate what we need instead of relying on writing less than allocated, that does not work for RAIDZ. Space for LWBs in open state (still being filled) is allocated same as before. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17613	2025-08-14 08:50:17 -07:00
Joel Low	bb9225ea86	Backport AVX2 AES-GCM implementation from BoringSSL This uses the AVX2 versions of the AESENC and PCLMULQDQ instructions; on Zen 3 this provides an up to 80% performance improvement. Original source: `d5440dd2c2/gen/bcm/aes-gcm-avx2-x86_64-linux.S` See the original BoringSSL commit at `3b6e1be439`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Attila Fülöp <attila@fueloep.org> Signed-off-by: Joel Low <joel@joelsplace.sg> Closes #17058	2025-08-13 14:51:20 -07:00
Alexander Motin	e0e60d319c	Better pack struct zio_prop By using precisely sized fields it is possible to reduce the size of this structure and respectively struct zio it is included into by 40 bytes (from 92 to 52). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17619	2025-08-12 13:28:46 -07:00
Rob Norris	f562e0f691	ZIL: single zil_commit_waiter_done() function to complete a waiter Just making it easier to not get the locking and broadcast wrong. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:22 -07:00
Rob Norris	92da3e18c8	ZIL: flag crashed LWBs so we know not to process them If the ZIL crashed, any outstanding LWBs are no longer interesting, so if they return, we need to just clean them up and return, not try to do any work on them. This is true even if they return success, as that may be long after the pool suspended and resumed, depending on when/if the kernel decides to return the IO to us. In particular, we must not try to get the "next" LWB from zl_lwb_list, since they're no longer on that list. So, we put a flag on in-flight LWBs in zil_crash() when we move them from zl_lwb_list to zl_lwb_crash_list, so we know what's going on when they return. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:24:16 -07:00
Rob Norris	508c546975	ZIL: use a bitfield for LWB "slog" and "slim" state flags I'm soon about to need another LWB flag, and boolean_t is just so big for only storing a single bit. Changing to a bitfield is far less wasteful. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17622	2025-08-12 13:23:59 -07:00
Rob Norris	391e85f519	ZIL: add zil_commit_flags() to make honouring failmode= optional The vast majority of calls to zil_commit() follow VFS ops, and should honour the failmode= setting - either wait for sync, or return error. Some calls however are part of a larger syncing op, and shouldn't ever block if something goes wrong. To allow this, we introduce zil_commit_flags(), with a flag ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should be honoured. zil_commit() is now a wrapper that always sets this flag, but any caller wanting a different behaviour can request ZIL_COMMIT_NOW instead to have the call return failure if the pool suspends, regardless of the failmode= setting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:33 -07:00
Rob Norris	72602f6ad9	ZIL: "crash" the ZIL if the pool suspends during fallback If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks on suspend. We want it to not block on suspend, instead returning an error. On the surface, this is simple: change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller. Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit(). However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. (As an aside, ITXs can also supply completion callbacks, which are called when they are destroyed. These are directly connected to LWBs though, so are passed the error code and destroyed there too). At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to. The simplest solution I could find, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to three txgs in the future (syncing + TXG_CONCURRENT_STATES), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()). This commit adds zil_crash(), which handles all of the above: - sets the restart txg - capture and signal all waiters - zero the header zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN). The rest of the commit is just threading the errors through, and related housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:26 -07:00
Rob Norris	99a5f5d1ba	ZIL: pass commit errors back to ITX callbacks ITX callbacks are used to signal that something can be cleaned up after a itx is committed. Presently that's only used when syncing out mapped pages (msync()) to mark dirty pages clean. This extends the callback interface so it can be passed an error, and take a different cleanup action if necessary. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:20 -07:00
Rob Norris	967b15b888	ZIL: allow zil_commit() to fail with error This changes zil_commit() to have an int return, and updates all callers to check it. There are no corresponding internal changes yet; it will always return 0. Since zil_commit() is an indication that the caller _really_ wants the associated data to be durability stored, I've annotated it with the __warn_unused_result__ compiler attribute (via __must_check), to emit a warning if it's ever ussd without doing something with the return code. I hope this will mean we never misuse it in the future. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:09 -07:00
Rob Norris	82d6f7b047	Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL) Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:42 -07:00
Rob Norris	f7bdd84328	Prefer VERIFY0P(n) over VERIFY(n == NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:37 -07:00
Rob Norris	c39e076f23	Prefer VERIFY0(n) over VERIFY(n == 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:40:59 -07:00
Attila Fülöp	03592417cb	SIMD: Don't require definition of `HAVE_XSAVE` Currently we fail the compilation via the #error directive if `HAVE_XSAVE` isn't defined. This breaks i586 builds since we check the toolchains SIMD support only on i686 and onward. Remove the requirement to fix the build on i586. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #13303 Closes #17590	2025-08-06 14:34:53 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00
Alexander Motin	4ae8bf406b	Allow physical rewrite without logical During regular block writes ZFS sets both logical and physical birth times equal to the current TXG. During dedup and block cloning logical birth time is still set to the current TXG, but physical may be copied from the original block that was used. This represents the fact that logically user data has changed, but the physically it is the same old block. But block rewrite introduces a new situation, when block is not changed logically, but stored in a different place of the pool. From ARC, scrub and some other perspectives this is a new block, but for example for user applications or incremental replication it is not. Somewhat similar thing happen during remap phase of device removal, but in that case space blocks are still acounted as allocated at their logical birth times. This patch introduces a new "rewrite" flag in the block pointer structure, allowing to differentiate physical rewrite (when the block is actually reallocated at the physical birth time) from the device reval case (when the logical birth time is used). The new functionality is not used at this point, and the only expected change is that error log is now kept in terms of physical physical birth times, rather than logical, since if a block with logged error was somehow rewritten, then the previous error does not matter any more. This change also introduces a new TRAVERSE_LOGICAL flag to the traverse code, allowing zfs send, redact and diff to work in context of logical birth times, ignoring physical-only rewrites. It also changes nothing at this point due to lack of those writes, but they will come in a following patch. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:07 -07:00
Mariusz Zaborski	894edd084e	Add TXG timestamp database This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG. With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16853	2025-08-06 10:31:21 -07:00
Rob Norris	a18c9edda6	Linux: sync: remove async/sync accounting All this machinery is there to try to understand when there an async writeback waiting to complete because the intent log callbacks are still outstanding, and force them with a timely zil_commit(). The next commit fixes this properly, so there's no need for all this extra housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584	2025-08-06 09:54:30 -07:00
Paul Dagnelie	31c4fa93bb	Fix dynamic gang block headers on raidz and mirror devices Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17587	2025-08-06 09:50:58 -07:00
Fedor Uporov	0b6fd024a7	ZVOL: Unify zvol minors operations and improve error handling Now zvol minors creation logic is passed thru spa_zvol_taskq, like it is doing for remove/rename zvol minors functions. Appropriate zvol minors creation functions are refactored: - The zvol_create_minor()/zvol_minors_create_recursive() were removed. - The single zvol_create_minors() is added instead. Also, it become possible to collect zvol minors subtasks status, to detect, if some zvol minor subtask is failed in the subtasks chain. The appropriate message is reported to zfs_dbgmsg buffer in this case. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17575	2025-08-06 10:10:52 -04:00
khoang98	0f8a1105ee	Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim threads waiting for the dbuf hash lock in the call sequence: dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Kaitlin Hoang <kthoang@amazon.com> Closes #17561	2025-08-01 16:47:41 -07:00
Rob Norris	1aec627c60	linux/atomic: fill out API for atomic pointer ops Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17580	2025-07-31 15:51:47 -07:00
Igor Ostapenko	cb5e7e097d	range_tree: Provide more debug details upon unexpected add/remove Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17581	2025-07-31 10:44:42 -04:00
rmacklem	2957eabbef	Add support for FreeBSD's Solaris style extended attribute interface Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface for extended attributes. This patch wires this interface into the FreeBSD ZFS port, since this style of extended attributes is supported by OpenZFS internally when the "xattr" property is set to "dir". Some specific changes: LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR for calls to zfs_zaccess(). V_NAMEDATTR indicates that the access checking does need to be done for FreeBSD. The access checking code for extended attributes was copy/pasted from the Linux port into zfs_zaccess() in the FreeBSD port. Most of the changes are in zfs_freebsd_lookup() and zfs_freebsd_create(). The semantics of these functions should remain unchanged unless named attributes are being manipulated. All the code changes are enabled for __FreeBSD_version 1500040 and newer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17540	2025-07-30 09:49:43 -07:00
Alexander Motin	f70c85086b	BRT: Fix ZAP entry endianness During original block cloning implementation a mistake was made, making BRT ZAP entries an array of 8 1-byte entries instead of 1 entry of 8 bytes. This makes the pools non-endian-safe. This commit introduces a new read-compatible pool feature "com.truenas:block_cloning_endian", fixing the endianness issue for new pools while maintaining compatibility with existing ones. The feature is automatically activated when creating the first BRT ZAP (ensuring we don't activate it on pools that already have BRT entries in the old format). When active, BRT entries are stored as single 8-byte values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17572	2025-07-30 09:42:47 -07:00
Akash B	b6e8db509d	zpool/zfs: Add '-a\|--all' option to scrub, trim, initialize Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Add support for the '-a \| --all' option to perform trim, scrub, and initialize operations on all pools. Previously, specifying a pool name was mandatory for these operations. With this enhancement, users can now execute these operations across all pools at once, without needing to manually iterate over each pool from the command line. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Akash B <akash-b@hpe.com> Closes #17524	2025-07-29 14:50:44 -07:00
Rob Norris	00ce064d8f	spa: update blkptr diagram to include vdev padding on encrypted blocks Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Probably just an oversight in `4d044c4c1d`. SPA_VDEVBITS is always 24, regardless of whether or not the bp is for an encrypted block, and it wouldn't make sense for it to be different anyway. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17564	2025-07-24 09:50:23 -04:00
Rob Norris	9292071565	linux/kmem: remove HAVE_ATOMIC64_T and kmem_alloc_used wrappers Seems like we haven't set it since the SPL was pulled into the main ZFS tree. In removing the define, I've taken the 64-bit version (ie the one that _hasn't_ been running since back then) because it looks like its closer to the intended width by the way its used. Since the macros ar eno longer needed as a selector, pull those too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:08:07 -07:00
Rob Norris	1c483cf3d0	linux/kmem: remove long-obsolete __GFP compat flags Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:53 -07:00
Rob Norris	96d20d7d59	linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:36 -07:00
shodanshok	a7a144e655	enforce arc_dnode_limit Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Linux kernel shrinker in the context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17487 Closes #17542	2025-07-21 10:32:01 -07:00
Alexander Motin	be1e991a1a	Allow and prefer special vdevs as ZIL Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Before this change ZIL blocks were allocated only from normal or SLOG vdevs. In typical situation when special vdevs are SSDs and normal are HDDs it could cause weird inversions when data blocks are written to SSDs, but ZIL referencing them to HDDs. This change assumes that special vdevs typically have much better (or at least not worse) latency than normal, and so in absence of SLOGs should store ZIL blocks. It means similar to normal vdevs introduction of special embedded log allocation class and updating the allocation fallback order to: SLOG -> special embedded log -> special -> normal embedded log -> normal. The code tries to guess whether data block is going to be written to normal or special vdev (it can not be done precisely before compression) and prefer indirect writes for blocks written to a special vdev to avoid double-write. For blocks that are going to be written to normal vdev, special vdev by default plays as SLOG, reducing write latency by the cost of higher special vdev wear, but it is tunable via module parameter. This should allow HDD pools with decent SSD as special vdev to work under synchronous workloads without requiring additional SLOG SSD, impractical in many scenarios. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17505	2025-07-18 18:44:14 -07:00
Chunwei Chen	2669b00f13	Define sops->free_inode() to prevent use-after-free during lookup Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can be dereferenced without refcounts and locks. For this reason, dentry and inode must only be freed after RCU grace period. However, zfs currently frees inode in zfs_inode_destroy synchronously and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on Linux 5.2 and after, if we define sops->free_inode(), the kernel will do call_rcu() for us. This issue may be triggered more easily with init_on_free=1 boot parameter: BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:selinux_inode_permission+0x10e/0x1c0 Call Trace: ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? security_inode_permission+0x37/0x60 ? __die_body.cold+0x8/0xd ? no_context+0x113/0x220 ? exc_page_fault+0x6d/0x130 ? asm_exc_page_fault+0x1e/0x30 ? selinux_inode_permission+0x10e/0x1c0 security_inode_permission+0x37/0x60 link_path_walk.part.0.constprop.0+0xb5/0x360 ? path_init+0x27d/0x3c0 path_lookupat+0x3e/0x1a0 filename_lookup+0xc0/0x1d0 ? __check_object_size.part.0+0x123/0x150 ? strncpy_from_user+0x4e/0x130 ? getname_flags.part.0+0x4b/0x1c0 vfs_statx+0x72/0x120 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120 __do_sys_newlstat+0x39/0x70 ? __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x62/0xc7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #17546	2025-07-18 08:45:13 -07:00
Rob Norris	d323fbf49c	FreeBSD: zfs_putpages: don't undirty pages until after write completes In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (`238eab7dc1`) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:15 -07:00
Mark Johnston	ee2a2d941a	Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit `238eab7dc1`. Reported by: Ihor Antonov <ngor@hugpoint.tech> Discussed with: Rob Norris <rob.norris@klarasystems.com> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Ported-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:11 -07:00
Rob Norris	fce18e04d5	libzpool: tunable-based option interface for zdb/ztest Removes the old dlsym() based option setter and adds a new function handle_tunable_option() that can set, get and list all the tunables in the system. And then wire it up to zdb and ztest. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:47:03 -07:00
Rob Norris	3a494c6d2a	mod.h: make consistent across all three platforms mod.h only exists to include the platform-specific mod_os.h, so we can get rid of it and just call the platform header mod.h. Then, create a libspl mod.h, and move the relevant items to it so we can start building on it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17537	2025-07-15 15:46:14 -07:00
Paul Dagnelie	a981cb69e4	Implement dynamic gang header sizes ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:02:53 -07:00
Paul Dagnelie	e845be28e7	Add no-upgrade featureflag Adds a featureflag that is not enabled during upgrades unless listed explicitly. This is useful for features that could cause issues unless applied carefully; for example, a feature that could make a root pool unbootable if bootloaders don't yet have support for it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:01:59 -07:00
Rob Norris	6af8db61b1	metaslab: don't pass whole zio to throttle reserve APIs Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details They only need a couple of fields, and passing the whole thing just invites fiddling around inside it, like modifying flags, which then makes it much harder to understand the zio state from inside zio.c. We move the flag update to just after a successful throttle in zio.c. Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED Better describes what it means, and makes it look less like IO_IS_ALLOCATING, which means something different. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17508	2025-07-04 23:22:22 -04:00
Rob Norris	92d3b4ee2c	zio: rename `io_reexecute` as `io_post`; use it for the direct IO checksum error flag We're not supposed to modify someone else's io_flags, so we need another way to propagate DIO_CHKSUM_ERR. If we squint, we can see that io_reexecute is really just recording exceptional events that a parent (or its parents) will need to do something about. It just happens that the only things we've had historically are two forms of reexecution: now or later (suspend). So, rename it to io_post, as in, post-IO info/events/actions. And now we have a few spare bits for other conditions. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17507	2025-07-04 23:16:14 -04:00
Alexander Motin	4e92aee233	Relax special_small_blocks restrictions Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details special_small_blocks is applied to blocks after compression, so it makes no sense to demand its values to be power of 2. At most they could be multiple of 512, but that would still buy us nothing, so lets allow them be any within SPA_MAXBLOCKSIZE. Also special_small_blocks does not really need to depend on the set recordsize, enabled pool features or presence of special vdev. At worst in any of those cases it will just do nothing, so we should not complicate users lives by artificial limitations. While there, polish comments for recordsize and volblocksize. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17497	2025-07-02 11:11:37 -07:00

1 2 3 4 5 ...

2615 Commits