Proxmox-Port/zfs - zfs - Gitea: Git with a cup of tea

mirror of https://github.com/openzfs/zfs.git synced 2025-10-01 02:46:29 +00:00

Author	SHA1	Message	Date
Rob Norris	ef4058fcdc	FreeBSD: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC) writeback. The only thing we need to do when a page writeback fails is to skip marking the page clean ("undirty"), since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an async writeback on the vnode cache first, so any dirty data has been recorded in the ZIL, ready for the followup zfs_sync()->zil_commit() to find. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:44 -07:00
Rob Norris	3d6ee9a68c	Linux: zfs_putpage: handle page writeback errors Page writeback is considered completed when the associated itx callback completes. A syncing writeback will receive the error in its callback directly, but an in-flight async writeback that was promoted to sync by the ZIL may also receive an error. Writeback errors, even syncing writeback errors, are not especially serious on their own, because the error will ultimately be returned to the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback (kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER). The only thing we need to do when a page writeback fails is to re-mark the page dirty, since we don't know if it made it to disk yet. This will ensure that it gets written out again in the future, either some scheduled async writeback or another explicit syncing call. On the other side, we need to make sure that if a syncing op arrives, any changes on dirty pages are written back to the DMU and/or the ZIL first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the file mapping at the start of the sync op (fsync(), msync(), etc). An async op will get an async itx created and logged, ready for the followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit() call for every page in the range. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:38 -07:00
Rob Norris	391e85f519	ZIL: add zil_commit_flags() to make honouring failmode= optional The vast majority of calls to zil_commit() follow VFS ops, and should honour the failmode= setting - either wait for sync, or return error. Some calls however are part of a larger syncing op, and shouldn't ever block if something goes wrong. To allow this, we introduce zil_commit_flags(), with a flag ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should be honoured. zil_commit() is now a wrapper that always sets this flag, but any caller wanting a different behaviour can request ZIL_COMMIT_NOW instead to have the call return failure if the pool suspends, regardless of the failmode= setting. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:33 -07:00
Rob Norris	72602f6ad9	ZIL: "crash" the ZIL if the pool suspends during fallback If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks on suspend. We want it to not block on suspend, instead returning an error. On the surface, this is simple: change all calls to txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error return back to the zil_commit() caller. Handling suspension means returning an error to all commit waiters. This is relatively straightforward, as zil_commit_waiter_t already has zcw_zio_error to hold the write IO error, which signals a fallback to txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the waiter can now return an error from zil_commit(). However, commit waiters are normally signalled when their associated write (LWB) completes. If the pool has suspended, those IOs may not return for some time, or maybe not at all. We still want to signal those waiters so they can return from zil_commit(). We have a list of those in-flight LWBs on zl_lwb_list, so we can run through those, detach them and signal them. The LWB itself is still in-flight, but no longer has attached waiters, so when it returns there will be nothing to do. (As an aside, ITXs can also supply completion callbacks, which are called when they are destroyed. These are directly connected to LWBs though, so are passed the error code and destroyed there too). At this point, all ZIL waiters have been ejected, so we only have to consider the internal state. We potentially still have ITXs that have not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL is in an unknown state; some writes may have been written but not returned to us. We really can't rely on any of it; the best thing to do is abandon it entirely and start over when the pool returns to service. But, since we may have IO out that won't return until the pool resumes, we need something for it to return to. The simplest solution I could find, implemented here, is to "crash" the ZIL: accept no new ITXs, make no further updates, and let it empty out on its normal schedule, that is, as txgs complete and zil_sync() and zil_clean() are called. We set a "restart txg" to three txgs in the future (syncing + TXG_CONCURRENT_STATES), at which point all the internal state will have been cleared out, and the ZIL can resume operation (handled at the top of zil_clean()). This commit adds zil_crash(), which handles all of the above: - sets the restart txg - capture and signal all waiters - zero the header zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND) returns because the pool suspended (ESHUTDOWN). The rest of the commit is just threading the errors through, and related housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:26 -07:00
Rob Norris	99a5f5d1ba	ZIL: pass commit errors back to ITX callbacks ITX callbacks are used to signal that something can be cleaned up after a itx is committed. Presently that's only used when syncing out mapped pages (msync()) to mark dirty pages clean. This extends the callback interface so it can be passed an error, and take a different cleanup action if necessary. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:20 -07:00
Rob Norris	967b15b888	ZIL: allow zil_commit() to fail with error This changes zil_commit() to have an int return, and updates all callers to check it. There are no corresponding internal changes yet; it will always return 0. Since zil_commit() is an indication that the caller _really_ wants the associated data to be durability stored, I've annotated it with the __warn_unused_result__ compiler attribute (via __must_check), to emit a warning if it's ever ussd without doing something with the return code. I hope this will mean we never misuse it in the future. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17398	2025-08-08 16:43:09 -07:00
Rob Norris	b270663e8a	linux/zvol_os: fix crash with blk-mq on Linux 4.19 Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details `03987f71e3` (#16069) added a workaround to get the blk-mq hardware context for older kernels that don't cache it in the struct request. However, this workaround appears to be incomplete. In 4.19, the rq data context is optional. If its not initialised, then the cached rq->cpu will be -1, and so using it to index into mq_map causes a crash. Given that the upstream 4.19 is now in extended LTS and rarely seen, RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached hardware context has been available since 5.1, I'm not going to huge lengths to get queue selection correct for the very few people that are likely to feel it. To that end, we simply call raw_smp_processor_id() to get a valid CPU id and use that instead. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17597	2025-08-08 09:39:14 -07:00
Rob Norris	82d6f7b047	Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL) Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:42 -07:00
Rob Norris	f7bdd84328	Prefer VERIFY0P(n) over VERIFY(n == NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:37 -07:00
Rob Norris	611b95da18	Prefer VERIFY0(n) over VERIFY3S(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:32 -07:00
Rob Norris	5c7df3bcac	Prefer VERIFY0(n) over VERIFY3U(n, ==, 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:41:25 -07:00
Rob Norris	c39e076f23	Prefer VERIFY0(n) over VERIFY(n == 0) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17591	2025-08-07 11:40:59 -07:00
Rob Norris	e44e51f28d	zvol_task_report_status: gate behind ZFS_DEBUG dprintf() is a no-op in production builds, giving a compile warning. So, refactor it a little to keep all the strings inside the function, and then make the function a no-op when ZFS_DEBUG is not set. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Closes #17596	2025-08-07 11:36:15 -07:00
Rob Norris	e6eb03a991	zvol_check_volblocksize: fix spa ref leak Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Closes #17596	2025-08-07 11:36:09 -07:00
Rob Norris	3e671f2353	zvol: remove void return casts on void-returning functions Casting unused returns to (void) is already of dubious value, but it's entirely meaningless on functions that are defined as void return. Remove the clutter. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Railway Corporation Closes #17596	2025-08-07 11:34:20 -07:00
Alek P	3e004369f7	Removed unused zio_decompress_fail_fraction variable Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alek Pinchuk <alek.pinchuk@connectwise.com> Closes #17599	2025-08-06 17:10:03 -07:00
Alexander Motin	60f714e6e2	Implement physical rewrites Based on previous commit this implements `zfs rewrite -P` flag, making ZFS to keep blocks logical birth times while rewriting files. It should exclude the rewritten blocks from incremental sends, snapshot diffs, etc. Snapshots space usage same time will reflect the additional space usage from newly allocated blocks. Since this begins to use new "rewrite" flag in the block pointers, this commit introduces a new read-compatible per-dataset feature physical_rewrite. It must be enabled for the command to not fail, it is activated on first use and deactivated on deletion of the last affected dataset. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:56 -07:00
Alexander Motin	4ae8bf406b	Allow physical rewrite without logical During regular block writes ZFS sets both logical and physical birth times equal to the current TXG. During dedup and block cloning logical birth time is still set to the current TXG, but physical may be copied from the original block that was used. This represents the fact that logically user data has changed, but the physically it is the same old block. But block rewrite introduces a new situation, when block is not changed logically, but stored in a different place of the pool. From ARC, scrub and some other perspectives this is a new block, but for example for user applications or incremental replication it is not. Somewhat similar thing happen during remap phase of device removal, but in that case space blocks are still acounted as allocated at their logical birth times. This patch introduces a new "rewrite" flag in the block pointer structure, allowing to differentiate physical rewrite (when the block is actually reallocated at the physical birth time) from the device reval case (when the logical birth time is used). The new functionality is not used at this point, and the only expected change is that error log is now kept in terms of physical physical birth times, rather than logical, since if a block with logged error was somehow rewritten, then the previous error does not matter any more. This change also introduces a new TRAVERSE_LOGICAL flag to the traverse code, allowing zfs send, redact and diff to work in context of logical birth times, ignoring physical-only rewrites. It also changes nothing at this point due to lack of those writes, but they will come in a following patch. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17565	2025-08-06 10:36:07 -07:00
Mariusz Zaborski	894edd084e	Add TXG timestamp database This feature enables tracking of when TXGs are committed to disk, providing an estimated timestamp for each TXG. With this information, it becomes possible to perform scrubs based on specific date ranges, improving the granularity of data management and recovery operations. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16853	2025-08-06 10:31:21 -07:00
Rob Norris	c3496b5cc6	Linux: zfs_putpage: document (and fix!) confusing sync/commit modes The structure of zfs_putpage() and its callers is tricky to follow. There's a lot more we could do to improve it, but at least now we have some description of one of the trickier bits. Writing this exposed a very subtle bug: most async pages pushed out through zpl_putpages() would go to the ZIL with commit=false, which can yield a less-efficient write policy. So this commit updates that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584	2025-08-06 09:55:58 -07:00
Rob Norris	fb7a8503bc	Linux: zfs_putpage: complete async page writeback immediately For async page writeback, we do not need to wait for the page to be on disk before returning to the caller; it's enough that the data from the dirty page be on the DMU and in the in-memory ZIL, just like any other write. So, if this is not a syncing write, don't add a callback to the itx, and instead just unlock the page immediately. (This is effectively the same concept used for FreeBSD in `d323fbf49c`). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584 Closes #14290	2025-08-06 09:55:50 -07:00
Rob Norris	a18c9edda6	Linux: sync: remove async/sync accounting All this machinery is there to try to understand when there an async writeback waiting to complete because the intent log callbacks are still outstanding, and force them with a timely zil_commit(). The next commit fixes this properly, so there's no need for all this extra housekeeping. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17584	2025-08-06 09:54:30 -07:00
Paul Dagnelie	31c4fa93bb	Fix dynamic gang block headers on raidz and mirror devices Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #17587	2025-08-06 09:50:58 -07:00
Fedor Uporov	0b6fd024a7	ZVOL: Unify zvol minors operations and improve error handling Now zvol minors creation logic is passed thru spa_zvol_taskq, like it is doing for remove/rename zvol minors functions. Appropriate zvol minors creation functions are refactored: - The zvol_create_minor()/zvol_minors_create_recursive() were removed. - The single zvol_create_minors() is added instead. Also, it become possible to collect zvol minors subtasks status, to detect, if some zvol minor subtask is failed in the subtasks chain. The appropriate message is reported to zfs_dbgmsg buffer in this case. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17575	2025-08-06 10:10:52 -04:00
khoang98	0f8a1105ee	Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim threads waiting for the dbuf hash lock in the call sequence: dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Kaitlin Hoang <kthoang@amazon.com> Closes #17561	2025-08-01 16:47:41 -07:00
Fedor Uporov	92da9e0e93	ZVOL: Implement zvol_alloc() function on FreeBSD side Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Implement zvol_alloc() function on FreeBSD side to increase code base compatibility with Linux. Also, fix issue with late returning in case if volmode=none. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17482	2025-07-31 11:02:09 -04:00
Igor Ostapenko	cb5e7e097d	range_tree: Provide more debug details upon unexpected add/remove Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com> Closes #17581	2025-07-31 10:44:42 -04:00
rmacklem	2957eabbef	Add support for FreeBSD's Solaris style extended attribute interface Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface for extended attributes. This patch wires this interface into the FreeBSD ZFS port, since this style of extended attributes is supported by OpenZFS internally when the "xattr" property is set to "dir". Some specific changes: LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR for calls to zfs_zaccess(). V_NAMEDATTR indicates that the access checking does need to be done for FreeBSD. The access checking code for extended attributes was copy/pasted from the Linux port into zfs_zaccess() in the FreeBSD port. Most of the changes are in zfs_freebsd_lookup() and zfs_freebsd_create(). The semantics of these functions should remain unchanged unless named attributes are being manipulated. All the code changes are enabled for __FreeBSD_version 1500040 and newer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17540	2025-07-30 09:49:43 -07:00
Fedor Uporov	dea0fc969b	ZVOL: Return early, if volmode is ZFS_VOLMODE_NONE on FreeBSD side Return from zvol_os_create_minor() function immediately after dsl_prop_get_integer() call if volmode property value is set to 'none', like it is doing on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17405	2025-07-30 09:46:34 -07:00
Alexander Motin	f70c85086b	BRT: Fix ZAP entry endianness During original block cloning implementation a mistake was made, making BRT ZAP entries an array of 8 1-byte entries instead of 1 entry of 8 bytes. This makes the pools non-endian-safe. This commit introduces a new read-compatible pool feature "com.truenas:block_cloning_endian", fixing the endianness issue for new pools while maintaining compatibility with existing ones. The feature is automatically activated when creating the first BRT ZAP (ensuring we don't activate it on pools that already have BRT entries in the old format). When active, BRT entries are stored as single 8-byte values. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <alexander.motin@TrueNAS.com> Closes #17572	2025-07-30 09:42:47 -07:00
Tino Reichardt	10a78e2647	Faster checksum benchmark on system boot Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details While booting, only the needed 256KiB benchmarks are done now. The delay for checking all checksums occurs when requested via: - Linux: cat /proc/spl/kstat/zfs/chksum_bench - FreeBSD: sysctl kstat.zfs.misc.chksum_bench Reported by: Lahiru Gunathilake <gunathilakebllg@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Co-authored-by: Colin Percival <cperciva@tarsnap.com> Closes #17563 Closes #17560	2025-07-29 17:09:48 -07:00
Paul Dagnelie	fc885f308f	Don't use wrong weight when passivating group When we're passivating a metaslab group we start by passivating the metaslabs that have been activated for each of the allocators. To do that, we need to provide a weight. However, currently this erroneously always uses a segment-based weight, even if segment-based weighting is disabled. Use the normal weight function, which will decide which type of weight to use. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17566	2025-07-29 14:28:01 -07:00
Brian Behlendorf	cf146460c1	Default to zfs_bclone_wait_dirty=1 Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Update the default FICLONE and FICLONERANGE ioctl behavior to wait on dirty blocks. While this does remove some control from the application, in practice ZFS is better positioned to the optimial thing and immediately force a TXG sync. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17455	2025-07-25 10:42:23 -04:00
Coleman Kane	5a9b9c7f87	linux: Fix out-of-src builds The linux kernel modules haven't been building successfully when the build occurs in a separate directory than the source code, which is a common build pattern in Linux. Was not able to determine the root cause, but the %.o targets in subdirectories are no longer being matched by the pattern targets in the Linux Kbuild system. This change fixes the issue by dynamically creating the missing ones inside our Kbuild. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #17517	2025-07-24 15:38:58 -07:00
Rob Norris	bf38c15071	everywhere: misc unnecessary var init/update Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details These are all cases where we initialise or update a variable, and then never use it. None of them particularly matter, as the compiler should optimise them all away during dead store elimination, but some static analysers complain about them and they are extra work for casual readers to follow, so worth removing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:23:58 -07:00
Rob Norris	d2b9e66b88	vdev_raidz: asize/psize: remove unnecessary var initialisation It would have been optimised away anyway so it doesn't matter, but it does make things a little tougher to read. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:23:51 -07:00
Rob Norris	2755e2aa60	spa_activity_check: narrow scope of MMP vars They aren't used outside these very small blocks, and their initial values are never used at all. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:23:07 -07:00
Rob Norris	9292071565	linux/kmem: remove HAVE_ATOMIC64_T and kmem_alloc_used wrappers Seems like we haven't set it since the SPL was pulled into the main ZFS tree. In removing the define, I've taken the 64-bit version (ie the one that _hasn't_ been running since back then) because it looks like its closer to the intended width by the way its used. Since the macros ar eno longer needed as a selector, pull those too. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:08:07 -07:00
Rob Norris	1c483cf3d0	linux/kmem: remove long-obsolete __GFP compat flags Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:53 -07:00
Rob Norris	96d20d7d59	linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17551	2025-07-22 15:07:36 -07:00
shodanshok	a7a144e655	enforce arc_dnode_limit Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Linux kernel shrinker in the context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17487 Closes #17542	2025-07-21 10:32:01 -07:00
Alexander Motin	be1e991a1a	Allow and prefer special vdevs as ZIL Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Before this change ZIL blocks were allocated only from normal or SLOG vdevs. In typical situation when special vdevs are SSDs and normal are HDDs it could cause weird inversions when data blocks are written to SSDs, but ZIL referencing them to HDDs. This change assumes that special vdevs typically have much better (or at least not worse) latency than normal, and so in absence of SLOGs should store ZIL blocks. It means similar to normal vdevs introduction of special embedded log allocation class and updating the allocation fallback order to: SLOG -> special embedded log -> special -> normal embedded log -> normal. The code tries to guess whether data block is going to be written to normal or special vdev (it can not be done precisely before compression) and prefer indirect writes for blocks written to a special vdev to avoid double-write. For blocks that are going to be written to normal vdev, special vdev by default plays as SLOG, reducing write latency by the cost of higher special vdev wear, but it is tunable via module parameter. This should allow HDD pools with decent SSD as special vdev to work under synchronous workloads without requiring additional SLOG SSD, impractical in many scenarios. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17505	2025-07-18 18:44:14 -07:00
Chunwei Chen	2669b00f13	Define sops->free_inode() to prevent use-after-free during lookup Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can be dereferenced without refcounts and locks. For this reason, dentry and inode must only be freed after RCU grace period. However, zfs currently frees inode in zfs_inode_destroy synchronously and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on Linux 5.2 and after, if we define sops->free_inode(), the kernel will do call_rcu() for us. This issue may be triggered more easily with init_on_free=1 boot parameter: BUG: kernel NULL pointer dereference, address: 0000000000000020 RIP: 0010:selinux_inode_permission+0x10e/0x1c0 Call Trace: ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? show_trace_log_lvl+0x1be/0x2d9 ? security_inode_permission+0x37/0x60 ? __die_body.cold+0x8/0xd ? no_context+0x113/0x220 ? exc_page_fault+0x6d/0x130 ? asm_exc_page_fault+0x1e/0x30 ? selinux_inode_permission+0x10e/0x1c0 security_inode_permission+0x37/0x60 link_path_walk.part.0.constprop.0+0xb5/0x360 ? path_init+0x27d/0x3c0 path_lookupat+0x3e/0x1a0 filename_lookup+0xc0/0x1d0 ? __check_object_size.part.0+0x123/0x150 ? strncpy_from_user+0x4e/0x130 ? getname_flags.part.0+0x4b/0x1c0 vfs_statx+0x72/0x120 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120 __do_sys_newlstat+0x39/0x70 ? __x64_sys_ioctl+0x8d/0xd0 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x62/0xc7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #17546	2025-07-18 08:45:13 -07:00
Alexander Motin	d7ab07dfb4	ZIL: Force writing of open LWB on suspend Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Under parallel workloads ZIL may delay writes of open LWBs that are not full enough. On suspend we do not expect anything new to appear since zil_get_commit_list() will not let it pass, only returning TXG number to wait for. But I suspect that waiting for the TXG commit without having the last LWB issued may not wait for its completion, resulting in panic described in #17509. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17521	2025-07-17 15:31:19 -07:00
Paul Dagnelie	c1e51c55f5	Correct weight recalculation of space-based metaslabs Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Currently, after a failed allocation, the metaslab code recalculates the weight for a metaslab. However, for space-based metaslabs, it uses the maximum free segment size instead of the normal weighting algorithm. This is presumably because the normal metaslab weight is (roughly) intended to estimate the size of the largest free segment, but it doesn't do that reliably at most fragmentation levels. This means that recalculated metaslabs are forced to a weight that isn't really using the same units as the rest of them, resulting in undesirable behaviors. We switch this to use the normal space-weighting function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Closes #17531	2025-07-16 10:20:57 -07:00
Rob Norris	d323fbf49c	FreeBSD: zfs_putpages: don't undirty pages until after write completes In syncing mode, zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework that case to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. The original version of this (`238eab7dc1`) copied the Linux model and did the cleanup in a ZIL callback for both async and sync. This was a mistake, as FreeBSD does not have a separate "busy for writeback" flag like Linux which keeps the page usable. The full sbusy flag locks the entire page out until the itx callback fires, which for async is after txg sync, which could be literal seconds in the future. For the async case, the data is already on the DMU and the in-memory ZIL, which is sufficient for async writeback, so the old method of logging it without a callback, undirtying the page and returning is more than sufficient and reclaims that lost performance. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:15 -07:00
Mark Johnston	ee2a2d941a	Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes" This causes async putpages to leave the pages sbusied for a long time, which hurts concurrency. Revert for now until we have a better approach. This reverts commit `238eab7dc1`. Reported by: Ihor Antonov <ngor@hugpoint.tech> Discussed with: Rob Norris <rob.norris@klarasystems.com> References: freebsd/freebsd-src@738a9a7 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mark Johnston <markj@FreeBSD.org> Ported-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17533	2025-07-15 15:58:11 -07:00
Attila Fülöp	8de8e0df9f	objtool wrapper: use absolute path to call the wrapper Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Older kernel versions run make outside of the build directory. This works since all paths are absolute. Relative paths will fail in such a scenario. Use an absolute path to the objtool wrapper as well, since the relative path breaks the build on older kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #17541	2025-07-14 15:10:02 -07:00
Paul Dagnelie	a981cb69e4	Implement dynamic gang header sizes ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17004	2025-07-09 14:02:53 -07:00
rmacklem	4c2a7f85d5	FreeBSD: Add support for _PC_HAS_HIDDENSYSTEM Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details In FreeBSD there is now a pathconf name _PC_HAS_HIDDENSYSTEM. This patch adds support for it to OpenZFS. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca> Closes #17518	2025-07-08 22:11:22 -04:00
Rob Norris	6af8db61b1	metaslab: don't pass whole zio to throttle reserve APIs Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details They only need a couple of fields, and passing the whole thing just invites fiddling around inside it, like modifying flags, which then makes it much harder to understand the zio state from inside zio.c. We move the flag update to just after a successful throttle in zio.c. Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED Better describes what it means, and makes it look less like IO_IS_ALLOCATING, which means something different. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17508	2025-07-04 23:22:22 -04:00
Rob Norris	92d3b4ee2c	zio: rename `io_reexecute` as `io_post`; use it for the direct IO checksum error flag We're not supposed to modify someone else's io_flags, so we need another way to propagate DIO_CHKSUM_ERR. If we squint, we can see that io_reexecute is really just recording exceptional events that a parent (or its parents) will need to do something about. It just happens that the only things we've had historically are two forms of reexecution: now or later (suspend). So, rename it to io_post, as in, post-IO info/events/actions. And now we have a few spare bits for other conditions. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17507	2025-07-04 23:16:14 -04:00
Alexander Motin	4e92aee233	Relax special_small_blocks restrictions Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details special_small_blocks is applied to blocks after compression, so it makes no sense to demand its values to be power of 2. At most they could be multiple of 512, but that would still buy us nothing, so lets allow them be any within SPA_MAXBLOCKSIZE. Also special_small_blocks does not really need to depend on the set recordsize, enabled pool features or presence of special vdev. At worst in any of those cases it will just do nothing, so we should not complicate users lives by artificial limitations. While there, polish comments for recordsize and volblocksize. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17497	2025-07-02 11:11:37 -07:00
Olivier Certner	dee62e074a	spa: ZIO_TASKQ_ISSUE: Use symbolic priority Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details This allows to change the meaning of priority differences in FreeBSD without requiring code changes in ZFS. This upstreams commit fd141584cf89d7d2 from FreeBSD src. Sponsored-by: The FreeBSD Foundation Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Olivier Certner <olce@FreeBSD.org> Closes #17489	2025-06-30 10:24:23 -04:00
Paul Dagnelie	69ee01aa4b	Fix bug caused by rounding in vdev_raidz_asize_to_psize Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details When an allocation is happening on a raidz vdev, the number of sectors allocated is rounded up to a multiple of nparity + 1. If this results in the allocation spilling into an extra row, then the corresponding call to vdev_raidz_asize_to_psize will incorrectly assume that parity sectors were allocated for that spilled row, even though no data is stored there. If we determine that happened, we need to subtract out those extra sectors before performing the rest of the capacity calculation. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17490	2025-06-27 14:54:20 -04:00
Rob Norris	ea076d6921	vdev_raidz_asize_to_psize: return psize, not asize Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Since `246e588`, gang blocks written to raidz vdevs will write past the end of their allocation, corrupting themselves, other data, or both. The reason is simple - when allocating the gang children, we call vdev_psize_to_asize() to find out how much data we should load into the allocation we just did. vdev_raidz_asize_to_psize() had a bug; it computed the psize, but returned the original asize. The raidz layer dutifully writes that much out, into space beyond the end of the allocation. If there's existing data there, it gets overwritten, causing checksum errors when that data is read. Even there's not data there (unlikely, given that gang blocks are in play at all), that area is not considered allocated, so can be allocated and overwritten later. The fix is simple: return the psize we just computed. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17488	2025-06-26 10:19:59 -04:00
Mark Johnston	0a2163d194	FreeBSD: Ensure that z_pflags is initialized for new znodes Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details The field is subsequently accessed in zfs_mknode(), in zfs_inherit_projid(). The Linux implementation of zfs_create_fs() has this initialization already; there is no counterpart to zfs_create_share_dir() that I can see. Reported-by: KMSAN Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #17486	2025-06-25 12:07:17 -04:00
Paul Dagnelie	d461a67d0a	Ensure that gang_copies is always at least as large as copies As discussed in the comments of PR #17004, you can theoretically run into a case where a gang child has more copies than the gang header, which can lead to some odd accounting behavior (and even trip a VERIFY). While the accounting code could be changed to handle this, it fundamentally doesn't seem to make a lot of sense to allow this to happen. If the data is supposed to have a certain level of reliability, that isn't actually achieved unless the gang_copies property is set to match it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17484	2025-06-25 12:05:36 -04:00
Rob Norris	46a4075100	Linux 6.16: remove writepage and readahead_page Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17443	2025-06-23 15:51:02 -04:00
Brian Behlendorf	48ce292ea0	Clarify and restrict dmu_tx_assign() errors There are three possible cases where dmu_tx_assign() may encounter a fatal error. When there is a true lack of free space (ENOSPC), when there is a lack of quota space (EDQUOT), or when data required to perform the transaction cannot be read from disk (EIO). See the dmu_tx_check_ioerr() function for additional details of on the motivation for check for I/O error early. Prior to this change dmu_tx_assign() would return the contents of tx->tx_err which covered a wide range of possible error codes (EIO, ECKSUM, ESRCH, etc). In practice, none of the callers could do anything useful with this level of detail and simply returned the error. Therefore, this change converts all tx->tx_err errors to EIO, adds ASSERTs to dmu_tx_assign() to cover the only possible errors, and clarifies the function comment to include EIO as a possible fatal error. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov> Closes #17463	2025-06-23 15:48:30 -04:00
Alexander Motin	5e5253be84	FreeBSD: Wire projects support While FreeBSD itself does not support projects, there is no reason why it can't be controlled via `zfs project` and other subcommands. Most of the code is actually already there and just needs some revival and sync with Linux, plus enabling some tests not depending on the OS support. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17423	2025-06-19 14:39:20 -07:00
Paul Dagnelie	717213d431	Fix other nonrot bugs Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details There are still a variety of bugs involving the vdev_nonrot property that will cause problems if you try to run the test suite with segment-based weighting disabled, and with other things in the weighting code. Parents' nonrot property need to be updated when children are added. When vdevs are expanded and more metaslabs are added, the weights have to be recalculated (since the number of metaslabs is an input to the lba bias function). When opening, faulted or unopenable children should not be considered for whether a vdev is nonrot or not (since the nonrot property is determined during a successful open, this can cause false negatives). And draid spares need to have the nonrot property set correctly. Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17469	2025-06-19 09:25:58 -04:00
Attila Fülöp	6cf17f6538	Linux build: handle CONFIG_OBJTOOL_WERROR=y Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Linux 5.16 by default fails the build on objtool warnings. We have known and understood objtool warnings we can't fix without involving Linux maintainers. To work around this we introduce an objtool wrapper script which removes the `--Werror` flag. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #17456	2025-06-16 08:12:09 -07:00
Alexander Motin	bd27b75401	ZIL: Relax parallel write ZIOs processing Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details ZIL introduced dependencies between its write ZIOs to permit flush defer, when we flush vdev caches only once all the write ZIOs has completed. But it was recently spotted that it serializes not only ZIO completions handling, but also their ready stage. It means ZIO pipeline can't calculate checksums for the following ZIOs until all the previous are checksumed, even though it is not required. On a systems where memory throughput of a single CPU core is limited, it creates single-core CPU bottleneck, which is difficult to see due to ZIO pipeline design with many taskqueue threads. While it would be great to bypass the ready stage waits, it would require changes to ZIO code, and I haven't found a clean way to do it. But I've noticed that we don't need any dependency between the write ZIOs if the previous one has some waiters, which means it won't defer any flushes and work as a barrier for the earlier ones. Bypassing it won't help large single-thread writes, since all the write ZIOs except the last in that case won't have waiters, and so will be dependent. But in that case the ZIO processing might not be a bottleneck, since there will be only one thread populating the write buffers, that will likely be the bottleneck. But bypassing the ZIO dependency on multi-threaded write workloads really allows them to scale beyond the checksuming throughput of one CPU core. My tests with writing 12 files on a same dataset on a pool with 4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a system with Xeon Silver 4114 CPU show total throughput increase from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to ~70%. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17458	2025-06-14 09:37:18 -04:00
Rob Norris	238eab7dc1	FreeBSD: zfs_putpages: don't undirty pages until after write completes Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-12 14:45:18 -07:00
Rob Norris	aa964ce61b	zfs_log_write: only put the callback on the last itx If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <markj@FreeBSD.org> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17445	2025-06-12 14:44:33 -07:00
Rob Norris	d1c88cbd4c	zpl_sync_fs: work around kernels that ignore sync_fs errors If the kernel will honour our error returns, use them. If not, fool it by setting a writeback error on the superblock, if available. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-12 14:42:32 -07:00
Rob Norris	e3f5e317e0	zfs_sync: return error when pool suspends If the pool is suspended, we'll just block in zil_commit(). If the system is shutting down, blocking wouldn't help anyone. So, we should keep this test for now, but at least return an error for anyone who is actually interested. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-12 14:42:27 -07:00
Rob Norris	52352dd748	zfs_sync: remove support for impossible scenarios The superblock pointer will always be set, as will z_log, so remove code supporting cases that can't occur (on Linux at least). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17420	2025-06-12 14:42:21 -07:00
Alexander Motin	e0ef4d2768	Improve block cloning transactions accounting Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Previous dmu_tx_count_clone() was broken, stating that cloning is similar to free. While they might be from some points, cloning is not net-free. It will likely consume space and memory, and unlike free it will do it no matter whether the destination has the blocks or not (usually not, so previous code did nothing). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17431	2025-06-11 11:59:16 -07:00
Alexander Motin	66ec7fb269	Reduce zfs_dmu_offset_next_sync penalty Looking on txg_wait_synced(, 0) I've noticed that it always syncs 5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE. But in case of dmu_offset_next() we do not care about deferred frees. And even concurrent TXGs we might not need sync all 3 if the dnode was not dirtied in last few TXGs. This patch makes dmu_offset_next() to sync one TXG at a time until the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times. My tests with random simultaneous writes and seeks over many files on HDD pool show 7-14% performance increase. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17434	2025-06-11 11:50:49 -07:00
Alexander Motin	4ae931aa93	Polish db_rwlock scope Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details dbuf_verify(): Don't need the lock, since we only compare pointers. dbuf_findbp(): Don't need the lock, since aside of unneeded assert we only produce the pointer, but don't de-reference it. dnode_next_offset_level(): When working on top level indirection should lock dnode buffer's db_rwlock, since it is our parent. If dnode has no buffer, then it is meta-dnode or one of quotas and we should lock the dataset's ds_bp_rwlock instead. Reviewed-by: Alan Somers <asomers@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17441	2025-06-11 11:13:48 -07:00
Rob Norris	fbfda270d5	zcp_synctask: add zfs.sync.clone() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17426	2025-06-10 14:53:10 -07:00
Rob Norris	560e3170ef	dsl_dataset: rename dmu_objset_clone* to dsl_dataset_clone* And make its check and sync functions visible, so I can hook them up to zcp_synctask. Rename not strictly necessary, but it definitely looks more like a dsl_dataset thing than a dmu_objset thing, to the extent that those things even have a meaningful distinction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes #17426	2025-06-10 14:52:43 -07:00
Alexander Motin	ba227e2cc2	Make TX abort after assign safer It is not right, but there are few examples when TX is aborted after being assigned in case of error. To handle it better on production systems add extra cleanup steps. While here, replace couple dmu_tx_abort() in simple cases. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17438	2025-06-10 09:30:06 -07:00
Alexander Motin	bcd0430236	Allow zero compression if dedup is enabled Having high-refcount dedup entries for zero blocks is inefficient when they could be recorded as a holes instead. Normally, zero compression is not done if compression is disabled to not confuse naive benchmarks. But with dedup enabled, it is expected that the write will be skipped anyway, so we are just optimizing the way it is skipped. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17435	2025-06-10 09:28:14 -07:00
Mariusz Zaborski	46b82de618	scrub: generate scrub_finish event Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details The `scn_min_txg` can now be used not only with resilver. Instead of checking `scn_min_txg` to determine whether it’s a resilver or a scrub, simply check which function is defined. Thanks to this change, a scrub_finish event is generated when performing a scrub from the saved txg. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Closes #17432	2025-06-06 22:43:10 -04:00
Rob Norris	af7d609592	zpl: handle suspend from two remaining calls to `txg_wait_synced()` Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details * zfs_link: allow tempfile sync to fail if pool suspends `4653e2f7d3` (#17355) allows dmu_tx_assign() to fail if the pool suspends when failmode=continue, but zfs_link() can fall back to txg_wait_synced() if it has to wait for a tempfile to be fully created before continuing, which will block if the pool suspends. Handle this by requesting an error return if the pool suspends when failmode=continue, and if that happens, return EIO. * zfs_clone_range: allow dirty wait to fail if pool suspends `4653e2f7d3` (#17355) allows dmu_tx_assign() to fail if the pool suspends when failmode=continue, but zfs_clone_range() can fall back to txg_wait_synced() if it has to wait for a dirty block to be written out, which will block if the pool suspends. Handle this by requesting an error return if the pool suspends when failmode=continue, and if that happens, return EIO. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17413	2025-06-05 15:38:26 -04:00
Attila Fülöp	b96f1a4b1f	Linux build: silence objtool warnings Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details After #17401 the Linux build produces some stack related warnings. Silence them with the `STACK_FRAME_NON_STANDARD` macro. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Attila Fülöp <attila@fueloep.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #17410	2025-06-04 17:40:09 -07:00
Alexander Motin	b7f919d228	Relax zfs_vnops_read_chunk_size limitations Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details It makes no sense to limit read size below the block size, since DMU will any way consume resources for the whole block, while the current zfs_vnops_read_chunk_size is only 1MB, which is smaller that maximum block size of 16MB. Plus in case of misaligned Uncached I/O the buffer may get evicted between the chunks, requiring repeating I/Os. On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB. It allows to less depend on speculative prefetcher if application requests specific size, first not waiting for prefetcher to start and later not prefetching more than needed. Also while there, we don't need to align reads to the chunk size, but only to a block size, which is smaller and so more forgiving. My profiles show ~4% of CPU time saving when reading 16MB blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17415	2025-06-04 11:24:15 -04:00
Alexander Motin	68817d28c5	Include class name into struct metaslab_class Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details With increasing number of metaslab classes it can be helpful for debugging to know what we are looking at. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17409	2025-06-03 11:12:59 -04:00
Alexander Motin	108562344c	Improve allocation fallback handling Some checks failed checkstyle / checkstyle (push) Has been cancelled Details CodeQL / Analyze (cpp) (push) Has been cancelled Details CodeQL / Analyze (python) (push) Has been cancelled Details zloop / zloop (push) Has been cancelled Details Before this change in case of any allocation error ZFS always fallen back to normal class. But with more of different classes available we migth want more sophisticated logic. For example, it makes sense to fall back from dedup first to special class (if it is allowed to put DDT there) and only then to normal, since in a pool with dedup and special classes populated normal class likely has performance characteristics unsuitable for dedup. This change implements general mechanism where fallback order is controlled by the same spa_preferred_class() as the initial class selection. And as first application it implements the mentioned dedup->special->normal fallbacks. I have more plans for it later. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17391	2025-05-31 19:12:16 -04:00
Fedor Uporov	e0edfcbd4e	ZVOL: Make zvol_volmode module parameter platform-independent The module parameter name was not changed in FreeBSD sysctls list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is: /sys/module/zfs/parameters/zvol_volmode. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17386	2025-05-31 19:09:50 -04:00
Fedor Uporov	e1677d9ee1	ZVOL: Make zvol_prefetch_bytes module parameter platform-independent Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072, same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17385	2025-05-31 09:58:54 -04:00
Rob Norris	e8e602d987	zio_add_child: collapse into a single function Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details The child locking difference is simple enough to handle with a boolean. The actual work is more involved, and it's easy to forget to change things in both places when experimenting. Just collapse them. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17382	2025-05-30 21:18:10 -04:00
Alexander Motin	2d33c8edb6	Make rewrite use Uncached I/O Rewrite is a one-time/rare bulk administrative operation, which should minimally affect payload caching. Plus some avoided memory copies in its data path allow to significantly increase its speed. My tests show reduction of time to rewrite 28GB of uncompressed files on NVMe pool from 17 to 9 seconds and minimal ARC usage. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17407	2025-05-30 21:13:49 -04:00
Fedor Uporov	a38376b37a	Rename zvol kernel module parameters sysctls on FreeBSD side Make 'zvol_threads', 'zvol_num_taskqs' and 'zvol_request_sync' names compatible with FreeBSD sysctl naming convention. Now the sysctls are have a next form: $ sysctl vfs.zfs.vol.threads vfs.zfs.vol.threads: 0 $ sysctl vfs.zfs.vol.num_taskqs vfs.zfs.vol.num_taskqs: 0 $ sysctl vfs.zfs.vol.request_sync vfs.zfs.vol.request_sync: 0 Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17406	2025-05-30 13:41:15 -07:00
Alexander Motin	008c9666ef	Set spa_final_txg in spa_unload() I've noticed that after some dedup tests system reboot ends up in assertion about ms_defer tree not free. It seems to be caused by DDT flushing still freeing some blocks while ZFS trying to reach a final steady state due to spa_final_txg, while being set by spa_export_common() on pool export, is not set when spa_unload() is called by spa_evict_all() on system reboot/shutdown. Setting spa_final_txg in spa_unload() fixes this issue. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17395	2025-05-30 14:44:45 -04:00
Ameer Hamza	b3b3cd1e4f	vdev: skip faulting disks pending removal This patch fixes a race where vdev_remove_wanted may be set after probe initiation, which could otherwise trigger redundant fault and removal. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17400	2025-05-30 09:14:37 -07:00
Rob Norris	5764e218ba	vdev_disk: remove classic IO submission Since it was disabled for 2.3, there's been no confirmed sightings of strange IO errors, misalignments or related shenanigans. Absence of evidence and all that, but I'd rather fix bugs in the new code than in the old. "It isn't hubris until he's failed." -- Chrisjen Avasarala Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17399	2025-05-30 10:31:02 -04:00
Rob Norris	44e3266894	events: include zio type in IO error reports Usually the IO type can be inferred from the other fields (in particular, priority and flags) sometimes it's not easy to see. This is just another little debug helper. May 27 2025 00:54:54.024110493 ereport.fs.zfs.data class = "ereport.fs.zfs.data" ena = 0x1f5ecfae600801 ... zio_delta = 0x0 zio_type = 0x2 [WRITE] zio_priority = 0x3 [ASYNC_WRITE] zio_objset = 0x0 Document zio_type and zio_priority. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17381	2025-05-30 10:29:29 -04:00
Rob Norris	0c94d3838d	linux/zvol_os: don't try to set disk ops if alloc fails If the kernel fails to allocate the gendisk, zvo_disk will be NULL, and derefencing it will explode. So don't do that. Sponsored-by: Klara, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17396	2025-05-30 10:25:09 -04:00
Attila Fülöp	3084336ae4	Linux build: always use objtool Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details We silence `objtool` warnings on some object files using `OBJECT_FILES_NON_STANDARD_some_file.o`. Nowadays `objtool` is needed for CPU vulnerability mitigations and a lot more functionality so its use is desirable. Just remove the `OBJECT_FILES_NON_STANDARD` definitions. A follow-up commit is needed to make the offending files standard and address the compile time warnings. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #17401 Closes #17364	2025-05-29 18:04:20 -07:00
Fedor Uporov	3dfa98d013	ZVOL: Make zvol_inhibit_dev module parameter platform-independent Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details The module parameter now is represented in FreeBSD sysctls list with name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on Linux side. Sponsored-by: vStack, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rob Norris <rob.norris@klarasystems.com> Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com> Closes #17384	2025-05-29 09:37:41 -04:00
Alexander Motin	fa697b94e6	FreeBSD: Add posix_fadvise(POSIX_FADV_WILLNEED) support As commit `320f0c6` did for Linux, connect POSIX_FADV_WILLNEED up to dmu_prefetch() on FreeBSD. While there, fix portability problems in tests/functional/fadvise. 1. Instead of relying on the numerical values of POSIX_FADV_XXX macros, accept macro names as arguments to the file_fadvise program. (The numbers happen to match on Linux and FreeBSD, but future systems may vary and it seems a little strange/raw to count on that.) 2. For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD VFS currently (perhaps something that should be investigated in FreeBSD). Since on Linux we're treating SEQUENTIAL and WILLNEED the same, it doesn't really matter which one we use, so switch the test over to WILLNEED exercise the new prefetch code on both OSes the same way. Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org> Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Thomas Munro <tmunro@FreeBSD.org> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17379	2025-05-29 09:34:07 -04:00
Rob Norris	00360efa35	tunables: fix spelling Some checks are pending checkstyle / checkstyle (push) Waiting to run Details CodeQL / Analyze (cpp) (push) Waiting to run Details CodeQL / Analyze (python) (push) Waiting to run Details zloop / zloop (push) Waiting to run Details Three occurences with an 'e', and all of them mine. Maybe it's an British thing? Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	589d99171f	tunables: use Linux ullong param ops for u64 Since 3.17 Linux has provided param ops for 64-bit ints, so we don't need to use our own anymore. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	6e7e7ea7ef	tunables: remove support for s64 tunables Nothing uses them now. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	58235f52af	tunables: remove direct use of module_param_cb The use for spl_taskq_kick was the only use, and the comment that module_param_call is obsolete is no longer true - it's still very much used even in recent kernels. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00
Rob Norris	7b183f1918	tunables: remove FreeBSD compat macros for Linux module params Nothing in any FreeBSD code uses them. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Pavel Snajdr <snajpa@snajpa.net> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #17377	2025-05-28 16:50:22 -07:00

1 2 3 4 5 ...

5049 Commits