Commit Graph

5049 Commits

Author SHA1 Message Date
Rob Norris
ef4058fcdc FreeBSD: zfs_putpage: handle page writeback errors
Page writeback is considered completed when the associated itx callback
completes. A syncing writeback will receive the error in its callback
directly, but an in-flight async writeback that was promoted to sync by
the ZIL may also receive an error.

Writeback errors, even syncing writeback errors, are not especially
serious on their own, because the error will ultimately be returned to
the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg
msync()) or to zfs_putpage() itself for a syncing (VM_PAGER_PUT_SYNC)
writeback.

The only thing we need to do when a page writeback fails is to skip
marking the page clean ("undirty"), since we don't know if it made it to
disk yet. This will ensure that it gets written out again in the future,
either some scheduled async writeback or another explicit syncing call.

On the other side, we need to make sure that if a syncing op arrives,
any changes on dirty pages are written back to the DMU and/or the ZIL
first. We do this by starting an async writeback on the vnode cache
first, so any dirty data has been recorded in the ZIL, ready for the
followup zfs_sync()->zil_commit() to find.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:44 -07:00
Rob Norris
3d6ee9a68c Linux: zfs_putpage: handle page writeback errors
Page writeback is considered completed when the associated itx callback
completes. A syncing writeback will receive the error in its callback
directly, but an in-flight async writeback that was promoted to sync by
the ZIL may also receive an error.

Writeback errors, even syncing writeback errors, are not especially
serious on their own, because the error will ultimately be returned to
the zil_commit() caller, either zfs_fsync() for an explicit sync op (eg
msync()) or to zfs_putpage() itself for a syncing (WB_SYNC_ALL) writeback
(kernel housekeeping or sync_file_range(SYNC_FILE_RANGE_WAIT_AFTER).

The only thing we need to do when a page writeback fails is to re-mark
the page dirty, since we don't know if it made it to disk yet. This will
ensure that it gets written out again in the future, either some
scheduled async writeback or another explicit syncing call.

On the other side, we need to make sure that if a syncing op arrives,
any changes on dirty pages are written back to the DMU and/or the ZIL
first. We do this by starting an _async_ (WB_SYNC_NONE) writeback on the
file mapping at the start of the sync op (fsync(), msync(), etc). An
async op will get an async itx created and logged, ready for the
followup zfs_fsync()->zil_commit() to find, while avoiding a zil_commit()
call for every page in the range.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:38 -07:00
Rob Norris
391e85f519 ZIL: add zil_commit_flags() to make honouring failmode= optional
The vast majority of calls to zil_commit() follow VFS ops, and should
honour the failmode= setting - either wait for sync, or return error.
Some calls however are part of a larger syncing op, and shouldn't ever
block if something goes wrong.

To allow this, we introduce zil_commit_flags(), with a flag
ZIL_COMMIT_FAILMODE to indicate whether or not the pool failmode should
be honoured. zil_commit() is now a wrapper that always sets this flag,
but any caller wanting a different behaviour can request ZIL_COMMIT_NOW
instead to have the call return failure if the pool suspends, regardless
of the failmode= setting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:33 -07:00
Rob Norris
72602f6ad9 ZIL: "crash" the ZIL if the pool suspends during fallback
If the ZIL runs into trouble, it calls txg_wait_synced(), which blocks
on suspend. We want it to not block on suspend, instead returning an
error. On the surface, this is simple: change all calls to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), and then thread the error
return back to the zil_commit() caller.

Handling suspension means returning an error to all commit waiters. This
is relatively straightforward, as zil_commit_waiter_t already has
zcw_zio_error to hold the write IO error, which signals a fallback to
txg_wait_synced_flags(TXG_WAIT_SUSPEND), which will fail, and so the
waiter can now return an error from zil_commit().

However, commit waiters are normally signalled when their associated
write (LWB) completes. If the pool has suspended, those IOs may not
return for some time, or maybe not at all. We still want to signal those
waiters so they can return from zil_commit(). We have a list of those
in-flight LWBs on zl_lwb_list, so we can run through those, detach them
and signal them. The LWB itself is still in-flight, but no longer has
attached waiters, so when it returns there will be nothing to do.

(As an aside, ITXs can also supply completion callbacks, which are
called when they are destroyed. These are directly connected to LWBs
though, so are passed the error code and destroyed there too).

At this point, all ZIL waiters have been ejected, so we only have to
consider the internal state. We potentially still have ITXs that have
not been committed, LWBs still open, and LWBs in-flight. The on-disk ZIL
is in an unknown state; some writes may have been written but not
returned to us. We really can't rely on any of it; the best thing to do
is abandon it entirely and start over when the pool returns to service.
But, since we may have IO out that won't return until the pool resumes,
we need something for it to return to.

The simplest solution I could find, implemented here, is to "crash" the
ZIL: accept no new ITXs, make no further updates, and let it empty out
on its normal schedule, that is, as txgs complete and zil_sync() and
zil_clean() are called. We set a "restart txg" to three txgs in the
future (syncing + TXG_CONCURRENT_STATES), at which point all the
internal state will have been cleared out, and the ZIL can resume
operation (handled at the top of zil_clean()).

This commit adds zil_crash(), which handles all of the above:
 - sets the restart txg
 - capture and signal all waiters
 - zero the header

zil_crash() is called when txg_wait_synced_flags(TXG_WAIT_SUSPEND)
returns because the pool suspended (ESHUTDOWN).

The rest of the commit is just threading the errors through, and related
housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:26 -07:00
Rob Norris
99a5f5d1ba ZIL: pass commit errors back to ITX callbacks
ITX callbacks are used to signal that something can be cleaned up after
a itx is committed. Presently that's only used when syncing out mapped
pages (msync()) to mark dirty pages clean.

This extends the callback interface so it can be passed an error, and
take a different cleanup action if necessary.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:20 -07:00
Rob Norris
967b15b888 ZIL: allow zil_commit() to fail with error
This changes zil_commit() to have an int return, and updates all callers
to check it. There are no corresponding internal changes yet; it will
always return 0.

Since zil_commit() is an indication that the caller _really_ wants the
associated data to be durability stored, I've annotated it with the
__warn_unused_result__ compiler attribute (via __must_check), to emit a
warning if it's ever ussd without doing something with the return code.
I hope this will mean we never misuse it in the future.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17398
2025-08-08 16:43:09 -07:00
Rob Norris
b270663e8a
linux/zvol_os: fix crash with blk-mq on Linux 4.19
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
03987f71e3 (#16069) added a workaround to get the blk-mq hardware
context for older kernels that don't cache it in the struct request.
However, this workaround appears to be incomplete.

In 4.19, the rq data context is optional. If its not initialised, then
the cached rq->cpu will be -1, and so using it to index into mq_map
causes a crash.

Given that the upstream 4.19 is now in extended LTS and rarely seen,
RHEL8 4.18+ has long carried "modern" blk-mq support, and the cached
hardware context has been available since 5.1, I'm not going to huge
lengths to get queue selection correct for the very few people that are
likely to feel it. To that end, we simply call raw_smp_processor_id() to
get a valid CPU id and use that instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17597
2025-08-08 09:39:14 -07:00
Rob Norris
82d6f7b047 Prefer VERIFY0P(n) over VERIFY3P(n, ==, NULL)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:42 -07:00
Rob Norris
f7bdd84328 Prefer VERIFY0P(n) over VERIFY(n == NULL)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:37 -07:00
Rob Norris
611b95da18 Prefer VERIFY0(n) over VERIFY3S(n, ==, 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:32 -07:00
Rob Norris
5c7df3bcac Prefer VERIFY0(n) over VERIFY3U(n, ==, 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:41:25 -07:00
Rob Norris
c39e076f23 Prefer VERIFY0(n) over VERIFY(n == 0)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17591
2025-08-07 11:40:59 -07:00
Rob Norris
e44e51f28d zvol_task_report_status: gate behind ZFS_DEBUG
dprintf() is a no-op in production builds, giving a compile warning. So,
refactor it a little to keep all the strings inside the function, and
then make the function a no-op when ZFS_DEBUG is not set.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:36:15 -07:00
Rob Norris
e6eb03a991 zvol_check_volblocksize: fix spa ref leak
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:36:09 -07:00
Rob Norris
3e671f2353 zvol: remove void return casts on void-returning functions
Casting unused returns to (void) is already of dubious value, but it's
entirely meaningless on functions that are defined as void return.
Remove the clutter.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Railway Corporation
Closes #17596
2025-08-07 11:34:20 -07:00
Alek P
3e004369f7
Removed unused zio_decompress_fail_fraction variable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alek Pinchuk <alek.pinchuk@connectwise.com>
Closes #17599
2025-08-06 17:10:03 -07:00
Alexander Motin
60f714e6e2 Implement physical rewrites
Based on previous commit this implements `zfs rewrite -P` flag,
making ZFS to keep blocks logical birth times while rewriting
files.  It should exclude the rewritten blocks from incremental
sends, snapshot diffs, etc.  Snapshots space usage same time will
reflect the additional space usage from newly allocated blocks.

Since this begins to use new "rewrite" flag in the block pointers,
this commit introduces a new read-compatible per-dataset feature
physical_rewrite.  It must be enabled for the command to not fail,
it is activated on first use and deactivated on deletion of the
last affected dataset.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:56 -07:00
Alexander Motin
4ae8bf406b Allow physical rewrite without logical
During regular block writes ZFS sets both logical and physical
birth times equal to the current TXG.  During dedup and block
cloning logical birth time is still set to the current TXG, but
physical may be copied from the original block that was used.
This represents the fact that logically user data has changed,
but the physically it is the same old block.

But block rewrite introduces a new situation, when block is not
changed logically, but stored in a different place of the pool.
From ARC, scrub and some other perspectives this is a new block,
but for example for user applications or incremental replication
it is not.  Somewhat similar thing happen during remap phase of
device removal, but in that case space blocks are still acounted
as allocated at their logical birth times.

This patch introduces a new "rewrite" flag in the block pointer
structure, allowing to differentiate physical rewrite (when the
block is actually reallocated at the physical birth time) from
the device reval case (when the logical birth time is used).

The new functionality is not used at this point, and the only
expected change is that error log is now kept in terms of physical
physical birth times, rather than logical, since if a block with
logged error was somehow rewritten, then the previous error does
not matter any more.

This change also introduces a new TRAVERSE_LOGICAL flag to the
traverse code, allowing zfs send, redact and diff to work in
context of logical birth times, ignoring physical-only rewrites.
It also changes nothing at this point due to lack of those writes,
but they will come in a following patch.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17565
2025-08-06 10:36:07 -07:00
Mariusz Zaborski
894edd084e
Add TXG timestamp database
This feature enables tracking of when TXGs are committed to disk,
providing an estimated timestamp for each TXG.

With this information, it becomes possible to perform scrubs based
on specific date ranges, improving the granularity of data
management and recovery operations.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16853
2025-08-06 10:31:21 -07:00
Rob Norris
c3496b5cc6 Linux: zfs_putpage: document (and fix!) confusing sync/commit modes
The structure of zfs_putpage() and its callers is tricky to follow.
There's a lot more we could do to improve it, but at least now we have
some description of one of the trickier bits.

Writing this exposed a very subtle bug: most async pages pushed out
through zpl_putpages() would go to the ZIL with commit=false, which can
yield a less-efficient write policy. So this commit updates that too.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
2025-08-06 09:55:58 -07:00
Rob Norris
fb7a8503bc Linux: zfs_putpage: complete async page writeback immediately
For async page writeback, we do not need to wait for the page to be on
disk before returning to the caller; it's enough that the data from the
dirty page be on the DMU and in the in-memory ZIL, just like any other
write.

So, if this is not a syncing write, don't add a callback to the itx, and
instead just unlock the page immediately.

(This is effectively the same concept used for FreeBSD in d323fbf49c).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
Closes #14290
2025-08-06 09:55:50 -07:00
Rob Norris
a18c9edda6 Linux: sync: remove async/sync accounting
All this machinery is there to try to understand when there an async
writeback waiting to complete because the intent log callbacks are still
outstanding, and force them with a timely zil_commit(). The next commit
fixes this properly, so there's no need for all this extra housekeeping.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17584
2025-08-06 09:54:30 -07:00
Paul Dagnelie
31c4fa93bb Fix dynamic gang block headers on raidz and mirror devices
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17587
2025-08-06 09:50:58 -07:00
Fedor Uporov
0b6fd024a7
ZVOL: Unify zvol minors operations and improve error handling
Now zvol minors creation logic is passed thru spa_zvol_taskq, like it
is doing for remove/rename zvol minors functions. Appropriate
zvol minors creation functions are refactored:
- The zvol_create_minor()/zvol_minors_create_recursive() were removed.
- The single zvol_create_minors() is added instead.

Also, it become possible to collect zvol minors subtasks status, to
detect, if some zvol minor subtask is failed in the subtasks chain.
The appropriate message is reported to zfs_dbgmsg buffer in this case.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17575
2025-08-06 10:10:52 -04:00
khoang98
0f8a1105ee
Skip dbuf_evict_one() from dbuf_evict_notify() for reclaim thread
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Kaitlin Hoang <kthoang@amazon.com>
Closes #17561
2025-08-01 16:47:41 -07:00
Fedor Uporov
92da9e0e93
ZVOL: Implement zvol_alloc() function on FreeBSD side
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Implement zvol_alloc() function on FreeBSD side to increase code base
compatibility with Linux. Also, fix issue with late returning in case
if volmode=none.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17482
2025-07-31 11:02:09 -04:00
Igor Ostapenko
cb5e7e097d
range_tree: Provide more debug details upon unexpected add/remove
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Igor Ostapenko <igor.ostapenko@klarasystems.com>
Closes #17581
2025-07-31 10:44:42 -04:00
rmacklem
2957eabbef
Add support for FreeBSD's Solaris style extended attribute interface
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface
for extended attributes.  This patch wires this interface into the
FreeBSD ZFS port, since this style of extended attributes is supported
by OpenZFS internally when the "xattr" property is set to "dir".

Some specific changes:
LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR
for calls to zfs_zaccess().
V_NAMEDATTR indicates that the access checking does need to be done
for FreeBSD.

The access checking code for extended attributes was copy/pasted from
the Linux port into zfs_zaccess() in the FreeBSD port.

Most of the changes are in zfs_freebsd_lookup() and
zfs_freebsd_create().
The semantics of these functions should remain unchanged unless named
attributes are being manipulated.

All the code changes are enabled for __FreeBSD_version 1500040 and
newer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes #17540
2025-07-30 09:49:43 -07:00
Fedor Uporov
dea0fc969b
ZVOL: Return early, if volmode is ZFS_VOLMODE_NONE on FreeBSD side
Return from zvol_os_create_minor() function immediately after
dsl_prop_get_integer() call if volmode property value is set to
'none', like it is doing on Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17405
2025-07-30 09:46:34 -07:00
Alexander Motin
f70c85086b
BRT: Fix ZAP entry endianness
During original block cloning implementation a mistake was made,
making BRT ZAP entries an array of 8 1-byte entries instead of 1
entry of 8 bytes. This makes the pools non-endian-safe.

This commit introduces a new read-compatible pool feature
"com.truenas:block_cloning_endian", fixing the endianness issue
for new pools while maintaining compatibility with existing ones.

The feature is automatically activated when creating the first BRT
ZAP (ensuring we don't activate it on pools that already have BRT
entries in the old format).  When active, BRT entries are stored
as single 8-byte values.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <alexander.motin@TrueNAS.com>
Closes #17572
2025-07-30 09:42:47 -07:00
Tino Reichardt
10a78e2647
Faster checksum benchmark on system boot
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
While booting, only the needed 256KiB benchmarks are done now.

The delay for checking all checksums occurs when requested via:
- Linux: cat /proc/spl/kstat/zfs/chksum_bench
- FreeBSD: sysctl kstat.zfs.misc.chksum_bench

Reported by: Lahiru Gunathilake <gunathilakebllg@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Colin Percival <cperciva@tarsnap.com>
Closes #17563
Closes #17560
2025-07-29 17:09:48 -07:00
Paul Dagnelie
fc885f308f
Don't use wrong weight when passivating group
When we're passivating a metaslab group we start by passivating the 
metaslabs that have been activated for each of the allocators.  To do 
that, we need to provide a weight. However, currently this erroneously 
always uses a segment-based weight, even if segment-based weighting is 
disabled.

Use the normal weight function, which will decide which type of weight 
to use.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17566
2025-07-29 14:28:01 -07:00
Brian Behlendorf
cf146460c1
Default to zfs_bclone_wait_dirty=1
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks.  While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17455
2025-07-25 10:42:23 -04:00
Coleman Kane
5a9b9c7f87
linux: Fix out-of-src builds
The linux kernel modules haven't been building successfully when the
build occurs in a separate directory than the source code, which is a
common build pattern in Linux. Was not able to determine the root cause,
but the %.o targets in subdirectories are no longer being matched by the
pattern targets in the Linux Kbuild system. This change fixes the issue
by dynamically creating the missing ones inside our Kbuild.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Coleman Kane <ckane@colemankane.org>
Closes #17517
2025-07-24 15:38:58 -07:00
Rob Norris
bf38c15071 everywhere: misc unnecessary var init/update
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
These are all cases where we initialise or update a variable, and then
never use it. None of them particularly matter, as the compiler should
optimise them all away during dead store elimination, but some static
analysers complain about them and they are extra work for casual readers
to follow, so worth removing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:58 -07:00
Rob Norris
d2b9e66b88 vdev_raidz: asize/psize: remove unnecessary var initialisation
It would have been optimised away anyway so it doesn't matter, but it
does make things a little tougher to read.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:51 -07:00
Rob Norris
2755e2aa60 spa_activity_check: narrow scope of MMP vars
They aren't used outside these very small blocks, and their initial
values are never used at all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:23:07 -07:00
Rob Norris
9292071565 linux/kmem: remove HAVE_ATOMIC64_T and kmem_alloc_used wrappers
Seems like we haven't set it since the SPL was pulled into the main ZFS
tree. In removing the define, I've taken the 64-bit version (ie the one
that _hasn't_ been running since back then) because it looks like its
closer to the intended width by the way its used.

Since the macros ar eno longer needed as a selector, pull those too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:08:07 -07:00
Rob Norris
1c483cf3d0 linux/kmem: remove long-obsolete __GFP compat flags
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:07:53 -07:00
Rob Norris
96d20d7d59 linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
2025-07-22 15:07:36 -07:00
shodanshok
a7a144e655
enforce arc_dnode_limit
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.

This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17487
Closes #17542
2025-07-21 10:32:01 -07:00
Alexander Motin
be1e991a1a
Allow and prefer special vdevs as ZIL
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Before this change ZIL blocks were allocated only from normal or
SLOG vdevs.  In typical situation when special vdevs are SSDs and
normal are HDDs it could cause weird inversions when data blocks
are written to SSDs, but ZIL referencing them to HDDs.

This change assumes that special vdevs typically have much better
(or at least not worse) latency than normal, and so in absence of
SLOGs should store ZIL blocks.  It means similar to normal vdevs
introduction of special embedded log allocation class and updating
the allocation fallback order to: SLOG -> special embedded log ->
special -> normal embedded log -> normal.

The code tries to guess whether data block is going to be written
to normal or special vdev (it can not be done precisely before
compression) and prefer indirect writes for blocks written to a
special vdev to avoid double-write.  For blocks that are going to
be written to normal vdev, special vdev by default plays as SLOG,
reducing write latency by the cost of higher special vdev wear,
but it is tunable via module parameter.

This should allow HDD pools with decent SSD as special vdev to
work under synchronous workloads without requiring additional
SLOG SSD, impractical in many scenarios.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17505
2025-07-18 18:44:14 -07:00
Chunwei Chen
2669b00f13
Define sops->free_inode() to prevent use-after-free during lookup
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can
be dereferenced without refcounts and locks. For this reason, dentry and
inode must only be freed after RCU grace period.

However, zfs currently frees inode in zfs_inode_destroy synchronously
and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on
Linux 5.2 and after, if we define sops->free_inode(), the kernel will do
call_rcu() for us.

This issue may be triggered more easily with init_on_free=1 boot
parameter:

BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:selinux_inode_permission+0x10e/0x1c0
Call Trace:
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? security_inode_permission+0x37/0x60
 ? __die_body.cold+0x8/0xd
 ? no_context+0x113/0x220
 ? exc_page_fault+0x6d/0x130
 ? asm_exc_page_fault+0x1e/0x30
 ? selinux_inode_permission+0x10e/0x1c0
 security_inode_permission+0x37/0x60
 link_path_walk.part.0.constprop.0+0xb5/0x360
 ? path_init+0x27d/0x3c0
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xc0/0x1d0
 ? __check_object_size.part.0+0x123/0x150
 ? strncpy_from_user+0x4e/0x130
 ? getname_flags.part.0+0x4b/0x1c0
 vfs_statx+0x72/0x120
 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
 __do_sys_newlstat+0x39/0x70
 ? __x64_sys_ioctl+0x8d/0xd0
 do_syscall_64+0x30/0x40
 entry_SYSCALL_64_after_hwframe+0x62/0xc7

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes #17546
2025-07-18 08:45:13 -07:00
Alexander Motin
d7ab07dfb4
ZIL: Force writing of open LWB on suspend
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Under parallel workloads ZIL may delay writes of open LWBs that
are not full enough.  On suspend we do not expect anything new to
appear since zil_get_commit_list() will not let it pass, only
returning TXG number to wait for.  But I suspect that waiting for
the TXG commit without having the last LWB issued may not wait for
its completion, resulting in panic described in #17509.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17521
2025-07-17 15:31:19 -07:00
Paul Dagnelie
c1e51c55f5
Correct weight recalculation of space-based metaslabs
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes #17531
2025-07-16 10:20:57 -07:00
Rob Norris
d323fbf49c FreeBSD: zfs_putpages: don't undirty pages until after write completes
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.

So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.

The original version of this (238eab7dc1) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.

For the async case, the data is already on the DMU and the in-memory
ZIL, which is sufficient for async writeback, so the old method of
logging it without a callback, undirtying the page and returning is more
than sufficient and reclaims that lost performance.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17533
2025-07-15 15:58:15 -07:00
Mark Johnston
ee2a2d941a Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes"
This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency.  Revert for now until we have a better
approach.

This reverts commit 238eab7dc1.

Reported by:    Ihor Antonov <ngor@hugpoint.tech>
Discussed with: Rob Norris <rob.norris@klarasystems.com>

References: freebsd/freebsd-src@738a9a7
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Ported-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17533
2025-07-15 15:58:11 -07:00
Attila Fülöp
8de8e0df9f
objtool wrapper: use absolute path to call the wrapper
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Older kernel versions run make outside of the build directory. This
works since all paths are absolute. Relative paths will fail in such
a scenario.

Use an absolute path to the objtool wrapper as well, since the
relative path breaks the build on older kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #17541
2025-07-14 15:10:02 -07:00
Paul Dagnelie
a981cb69e4 Implement dynamic gang header sizes
ZFS gang block headers are currently fixed at 512 bytes. This is
increasingly wasteful in the era of larger disk sector sizes. This PR
allows any size allocation to work as a gang header. It also contains
supporting changes to ZDB to make gang headers easier to work with.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17004
2025-07-09 14:02:53 -07:00
rmacklem
4c2a7f85d5
FreeBSD: Add support for _PC_HAS_HIDDENSYSTEM
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
In FreeBSD there is now a pathconf name _PC_HAS_HIDDENSYSTEM.
This patch adds support for it to OpenZFS.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closes #17518
2025-07-08 22:11:22 -04:00
Rob Norris
6af8db61b1
metaslab: don't pass whole zio to throttle reserve APIs
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
They only need a couple of fields, and passing the whole thing just
invites fiddling around inside it, like modifying flags, which then
makes it much harder to understand the zio state from inside zio.c.

We move the flag update to just after a successful throttle in zio.c.

Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED
Better describes what it means, and makes it look less like
IO_IS_ALLOCATING, which means something different.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17508
2025-07-04 23:22:22 -04:00
Rob Norris
92d3b4ee2c
zio: rename io_reexecute as io_post; use it for the direct IO checksum error flag
We're not supposed to modify someone else's io_flags, so we need another
way to propagate DIO_CHKSUM_ERR.

If we squint, we can see that io_reexecute is really just recording
exceptional events that a parent (or its parents) will need to do
something about. It just happens that the only things we've had
historically are two forms of reexecution: now or later (suspend).

So, rename it to io_post, as in, post-IO info/events/actions. And now we
have a few spare bits for other conditions.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17507
2025-07-04 23:16:14 -04:00
Alexander Motin
4e92aee233
Relax special_small_blocks restrictions
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
special_small_blocks is applied to blocks after compression, so it
makes no sense to demand its values to be power of 2.  At most
they could be multiple of 512, but that would still buy us nothing,
so lets allow them be any within SPA_MAXBLOCKSIZE.

Also special_small_blocks does not really need to depend on the
set recordsize, enabled pool features or presence of special vdev.
At worst in any of those cases it will just do nothing, so we
should not complicate users lives by artificial limitations.

While there, polish comments for recordsize and volblocksize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17497
2025-07-02 11:11:37 -07:00
Olivier Certner
dee62e074a
spa: ZIO_TASKQ_ISSUE: Use symbolic priority
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

This upstreams commit fd141584cf89d7d2 from FreeBSD src.

Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17489
2025-06-30 10:24:23 -04:00
Paul Dagnelie
69ee01aa4b
Fix bug caused by rounding in vdev_raidz_asize_to_psize
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
When an allocation is happening on a raidz vdev, the number of sectors
allocated is rounded up to a multiple of nparity + 1. If this results in
the allocation spilling into an extra row, then the corresponding call
to vdev_raidz_asize_to_psize will incorrectly assume that parity sectors
were allocated for that spilled row, even though no data is stored
there.

If we determine that happened, we need to subtract out those extra
sectors before performing the rest of the capacity calculation.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17490
2025-06-27 14:54:20 -04:00
Rob Norris
ea076d6921
vdev_raidz_asize_to_psize: return psize, not asize
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Since 246e588, gang blocks written to raidz vdevs will write past the
end of their allocation, corrupting themselves, other data, or both.

The reason is simple - when allocating the gang children, we call
vdev_psize_to_asize() to find out how much data we should load into the
allocation we just did. vdev_raidz_asize_to_psize() had a bug; it
computed the psize, but returned the original asize. The raidz layer
dutifully writes that much out, into space beyond the end of the
allocation.

If there's existing data there, it gets overwritten, causing checksum
errors when that data is read. Even there's not data there (unlikely,
given that gang blocks are in play at all), that area is not considered
allocated, so can be allocated and overwritten later.

The fix is simple: return the psize we just computed.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17488
2025-06-26 10:19:59 -04:00
Mark Johnston
0a2163d194
FreeBSD: Ensure that z_pflags is initialized for new znodes
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The field is subsequently accessed in zfs_mknode(), in
zfs_inherit_projid().  The Linux implementation of zfs_create_fs() has
this initialization already; there is no counterpart to
zfs_create_share_dir() that I can see.

Reported-by: KMSAN
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #17486
2025-06-25 12:07:17 -04:00
Paul Dagnelie
d461a67d0a
Ensure that gang_copies is always at least as large as copies
As discussed in the comments of PR #17004, you can theoretically run
into a case where a gang child has more copies than the gang header,
which can lead to some odd accounting behavior (and even trip a
VERIFY). While the accounting code could be changed to handle this, it
fundamentally doesn't seem to make a lot of sense to allow this to
happen. If the data is supposed to have a certain level of reliability,
that isn't actually achieved unless the gang_copies property is set to
match it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17484
2025-06-25 12:05:36 -04:00
Rob Norris
46a4075100
Linux 6.16: remove writepage and readahead_page
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17443
2025-06-23 15:51:02 -04:00
Brian Behlendorf
48ce292ea0
Clarify and restrict dmu_tx_assign() errors
There are three possible cases where dmu_tx_assign() may
encounter a fatal error.  When there is a true lack of free
space (ENOSPC), when there is a lack of quota space (EDQUOT),
or when data required to perform the transaction cannot be
read from disk (EIO).  See the dmu_tx_check_ioerr() function
for additional details of on the motivation for check for
I/O error early.

Prior to this change dmu_tx_assign() would return the
contents of tx->tx_err which covered a wide range of possible
error codes (EIO, ECKSUM, ESRCH, etc).  In practice, none
of the callers could do anything useful with this level of
detail and simply returned the error.

Therefore, this change converts all tx->tx_err errors to EIO,
adds ASSERTs to dmu_tx_assign() to cover the only possible
errors, and clarifies the function comment to include EIO as
a possible fatal error.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Closes #17463
2025-06-23 15:48:30 -04:00
Alexander Motin
5e5253be84
FreeBSD: Wire projects support
While FreeBSD itself does not support projects, there is no reason
why it can't be controlled via `zfs project` and other subcommands.
Most of the code is actually already there and just needs some
revival and sync with Linux, plus enabling some tests not depending
on the OS support.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17423
2025-06-19 14:39:20 -07:00
Paul Dagnelie
717213d431
Fix other nonrot bugs
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
There are still a variety of bugs involving the vdev_nonrot property
that will cause problems if you try to run the test suite with
segment-based weighting disabled, and with other things in the weighting
code. Parents' nonrot property need to be updated when children are
added. When vdevs are expanded and more metaslabs are added, the weights
have to be recalculated (since the number of metaslabs is an input to
the lba bias function). When opening, faulted or unopenable children
should not be considered for whether a vdev is nonrot or not (since the
nonrot property is determined during a successful open, this can cause
false negatives). And draid spares need to have the nonrot property set
correctly.

Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17469
2025-06-19 09:25:58 -04:00
Attila Fülöp
6cf17f6538
Linux build: handle CONFIG_OBJTOOL_WERROR=y
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Linux 5.16 by default fails the build on objtool warnings. We have
known and understood objtool warnings we can't fix without
involving Linux maintainers.

To work around this we introduce an objtool wrapper script which
removes the `--Werror` flag.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #17456
2025-06-16 08:12:09 -07:00
Alexander Motin
bd27b75401
ZIL: Relax parallel write ZIOs processing
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
ZIL introduced dependencies between its write ZIOs to permit flush
defer, when we flush vdev caches only once all the write ZIOs has
completed.  But it was recently spotted that it serializes not only
ZIO completions handling, but also their ready stage.  It means ZIO
pipeline can't calculate checksums for the following ZIOs until all
the previous are checksumed, even though it is not required.  On a
systems where memory throughput of a single CPU core is limited,
it creates single-core CPU bottleneck, which is difficult to see
due to ZIO pipeline design with many taskqueue threads.

While it would be great to bypass the ready stage waits, it would
require changes to ZIO code, and I haven't found a clean way to do
it.  But I've noticed that we don't need any dependency between
the write ZIOs if the previous one has some waiters, which means
it won't defer any flushes and work as a barrier for the earlier
ones.

Bypassing it won't help large single-thread writes, since all the
write ZIOs except the last in that case won't have waiters, and
so will be dependent.  But in that case the ZIO processing might
not be a bottleneck, since there will be only one thread populating
the write buffers, that will likely be the bottleneck.

But bypassing the ZIO dependency on multi-threaded write workloads
really allows them to scale beyond the checksuming throughput of
one CPU core.

My tests with writing 12 files on a same dataset on a pool with
4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a
system with Xeon Silver 4114 CPU show total throughput increase
from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to
~70%.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17458
2025-06-14 09:37:18 -04:00
Rob Norris
238eab7dc1 FreeBSD: zfs_putpages: don't undirty pages until after write completes
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17445
2025-06-12 14:45:18 -07:00
Rob Norris
aa964ce61b zfs_log_write: only put the callback on the last itx
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17445
2025-06-12 14:44:33 -07:00
Rob Norris
d1c88cbd4c zpl_sync_fs: work around kernels that ignore sync_fs errors
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:32 -07:00
Rob Norris
e3f5e317e0 zfs_sync: return error when pool suspends
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:27 -07:00
Rob Norris
52352dd748 zfs_sync: remove support for impossible scenarios
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:21 -07:00
Alexander Motin
e0ef4d2768
Improve block cloning transactions accounting
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free.  While they might be from some points, cloning
is not net-free.  It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17431
2025-06-11 11:59:16 -07:00
Alexander Motin
66ec7fb269
Reduce zfs_dmu_offset_next_sync penalty
Looking on txg_wait_synced(, 0) I've noticed that it always syncs
5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE.  But in case
of dmu_offset_next() we do not care about deferred frees. And even
concurrent TXGs we might not need sync all 3 if the dnode was not
dirtied in last few TXGs.

This patch makes dmu_offset_next() to sync one TXG at a time until
the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times.
My tests with random simultaneous writes and seeks over many files
on HDD pool show 7-14% performance increase.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17434
2025-06-11 11:50:49 -07:00
Alexander Motin
4ae931aa93
Polish db_rwlock scope
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
dbuf_verify(): Don't need the lock, since we only compare pointers.

dbuf_findbp(): Don't need the lock, since aside of unneeded assert
we only produce the pointer, but don't de-reference it.

dnode_next_offset_level(): When working on top level indirection
should lock dnode buffer's db_rwlock, since it is our parent.  If
dnode has no buffer, then it is meta-dnode or one of quotas and we
should lock the dataset's ds_bp_rwlock instead.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17441
2025-06-11 11:13:48 -07:00
Rob Norris
fbfda270d5 zcp_synctask: add zfs.sync.clone()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17426
2025-06-10 14:53:10 -07:00
Rob Norris
560e3170ef dsl_dataset: rename dmu_objset_clone* to dsl_dataset_clone*
And make its check and sync functions visible, so I can hook them up to
zcp_synctask. Rename not strictly necessary, but it definitely looks
more like a dsl_dataset thing than a dmu_objset thing, to the extent
that those things even have a meaningful distinction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17426
2025-06-10 14:52:43 -07:00
Alexander Motin
ba227e2cc2
Make TX abort after assign safer
It is not right, but there are few examples when TX is aborted
after being assigned in case of error.  To handle it better on
production systems add extra cleanup steps.

While here, replace couple dmu_tx_abort() in simple cases.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17438
2025-06-10 09:30:06 -07:00
Alexander Motin
bcd0430236
Allow zero compression if dedup is enabled
Having high-refcount dedup entries for zero blocks is inefficient
when they could be recorded as a holes instead.  Normally, zero
compression is not done if compression is disabled to not confuse
naive benchmarks.  But with dedup enabled, it is expected that the
write will be skipped anyway, so we are just optimizing the way it
is skipped.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17435
2025-06-10 09:28:14 -07:00
Mariusz Zaborski
46b82de618
scrub: generate scrub_finish event
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
The `scn_min_txg` can now be used not only with resilver. Instead
of checking `scn_min_txg` to determine whether it’s a resilver or
a scrub, simply check which function is defined. Thanks to this
change, a scrub_finish event is generated when performing a scrub
from the saved txg.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #17432
2025-06-06 22:43:10 -04:00
Rob Norris
af7d609592
zpl: handle suspend from two remaining calls to txg_wait_synced()
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
* zfs_link: allow tempfile sync to fail if pool suspends

4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_link() can fall back to
txg_wait_synced() if it has to wait for a tempfile to be fully created
before continuing, which will block if the pool suspends.

Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.

* zfs_clone_range: allow dirty wait to fail if pool suspends

4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_clone_range() can fall back to
txg_wait_synced() if it has to wait for a dirty block to be written out,
which will block if the pool suspends.

Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17413
2025-06-05 15:38:26 -04:00
Attila Fülöp
b96f1a4b1f
Linux build: silence objtool warnings
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
After #17401 the Linux build produces some stack related warnings.

Silence them with the `STACK_FRAME_NON_STANDARD` macro.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17410
2025-06-04 17:40:09 -07:00
Alexander Motin
b7f919d228
Relax zfs_vnops_read_chunk_size limitations
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It makes no sense to limit read size below the block size, since
DMU will any way consume resources for the whole block, while the
current zfs_vnops_read_chunk_size is only 1MB, which is smaller
that maximum block size of 16MB.  Plus in case of misaligned
Uncached I/O the buffer may get evicted between the chunks,
requiring repeating I/Os.

On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB.
It allows to less depend on speculative prefetcher if application
requests specific size, first not waiting for prefetcher to start
and later not prefetching more than needed.

Also while there, we don't need to align reads to the chunk size,
but only to a block size, which is smaller and so more forgiving.

My profiles show ~4% of CPU time saving when reading 16MB blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17415
2025-06-04 11:24:15 -04:00
Alexander Motin
68817d28c5
Include class name into struct metaslab_class
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
With increasing number of metaslab classes it can be helpful for
debugging to know what we are looking at.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17409
2025-06-03 11:12:59 -04:00
Alexander Motin
108562344c
Improve allocation fallback handling
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Before this change in case of any allocation error ZFS always fallen
back to normal class.  But with more of different classes available
we migth want more sophisticated logic.  For example, it makes sense
to fall back from dedup first to special class (if it is allowed to
put DDT there) and only then to normal, since in a pool with dedup
and special classes populated normal class likely has performance
characteristics unsuitable for dedup.

This change implements general mechanism where fallback order is
controlled by the same spa_preferred_class() as the initial class
selection.  And as first application it implements the mentioned
dedup->special->normal fallbacks.  I have more plans for it later.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17391
2025-05-31 19:12:16 -04:00
Fedor Uporov
e0edfcbd4e
ZVOL: Make zvol_volmode module parameter platform-independent
The module parameter name was not changed in FreeBSD sysctls
list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is:
/sys/module/zfs/parameters/zvol_volmode.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17386
2025-05-31 19:09:50 -04:00
Fedor Uporov
e1677d9ee1
ZVOL: Make zvol_prefetch_bytes module parameter platform-independent
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The module parameter now is represented in FreeBSD sysctls list
with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072,
same as on Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17385
2025-05-31 09:58:54 -04:00
Rob Norris
e8e602d987
zio_add_child: collapse into a single function
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The child locking difference is simple enough to handle with a boolean.
The actual work is more involved, and it's easy to forget to change
things in both places when experimenting. Just collapse them.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17382
2025-05-30 21:18:10 -04:00
Alexander Motin
2d33c8edb6
Make rewrite use Uncached I/O
Rewrite is a one-time/rare bulk administrative operation, which
should minimally affect payload caching.  Plus some avoided memory
copies in its data path allow to significantly increase its speed.

My tests show reduction of time to rewrite 28GB of uncompressed
files on NVMe pool from 17 to 9 seconds and minimal ARC usage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17407
2025-05-30 21:13:49 -04:00
Fedor Uporov
a38376b37a
Rename zvol kernel module parameters sysctls on FreeBSD side
Make 'zvol_threads', 'zvol_num_taskqs' and 'zvol_request_sync' names
compatible with FreeBSD sysctl naming convention. Now the sysctls are
have a next form:

$ sysctl vfs.zfs.vol.threads
vfs.zfs.vol.threads: 0

$ sysctl vfs.zfs.vol.num_taskqs
vfs.zfs.vol.num_taskqs: 0

$ sysctl vfs.zfs.vol.request_sync
vfs.zfs.vol.request_sync: 0

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17406
2025-05-30 13:41:15 -07:00
Alexander Motin
008c9666ef
Set spa_final_txg in spa_unload()
I've noticed that after some dedup tests system reboot ends up in
assertion about ms_defer tree not free.  It seems to be caused by
DDT flushing still freeing some blocks while ZFS trying to reach
a final steady state due to spa_final_txg, while being set by
spa_export_common() on pool export, is not set when spa_unload()
is called by spa_evict_all() on system reboot/shutdown.  Setting
spa_final_txg in spa_unload() fixes this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17395
2025-05-30 14:44:45 -04:00
Ameer Hamza
b3b3cd1e4f vdev: skip faulting disks pending removal
This patch fixes a race where vdev_remove_wanted may be set after probe
initiation, which could otherwise trigger redundant fault and removal.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17400
2025-05-30 09:14:37 -07:00
Rob Norris
5764e218ba
vdev_disk: remove classic IO submission
Since it was disabled for 2.3, there's been no confirmed sightings of
strange IO errors, misalignments or related shenanigans. Absence of
evidence and all that, but I'd rather fix bugs in the new code than in
the old.

  "It isn't hubris until he's failed."
    -- Chrisjen Avasarala

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17399
2025-05-30 10:31:02 -04:00
Rob Norris
44e3266894
events: include zio type in IO error reports
Usually the IO type can be inferred from the other fields (in
particular, priority and flags) sometimes it's not easy to see. This is
just another little debug helper.

    May 27 2025 00:54:54.024110493 ereport.fs.zfs.data
            class = "ereport.fs.zfs.data"
            ena = 0x1f5ecfae600801
            ...
            zio_delta = 0x0
            zio_type = 0x2 [WRITE]
            zio_priority = 0x3 [ASYNC_WRITE]
            zio_objset = 0x0

Document zio_type and zio_priority.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17381
2025-05-30 10:29:29 -04:00
Rob Norris
0c94d3838d
linux/zvol_os: don't try to set disk ops if alloc fails
If the kernel fails to allocate the gendisk, zvo_disk will be NULL, and
derefencing it will explode. So don't do that.

Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17396
2025-05-30 10:25:09 -04:00
Attila Fülöp
3084336ae4
Linux build: always use objtool
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
We silence `objtool` warnings on some object files using
`OBJECT_FILES_NON_STANDARD_some_file.o`. Nowadays `objtool` is
needed for CPU vulnerability mitigations and a lot more
functionality so its use is desirable.

Just remove the `OBJECT_FILES_NON_STANDARD` definitions. A follow-up
commit is needed to make the offending files standard and address
the compile time warnings.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #17401
Closes #17364
2025-05-29 18:04:20 -07:00
Fedor Uporov
3dfa98d013
ZVOL: Make zvol_inhibit_dev module parameter platform-independent
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The module parameter now is represented in FreeBSD sysctls list with
name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on
Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17384
2025-05-29 09:37:41 -04:00
Alexander Motin
fa697b94e6
FreeBSD: Add posix_fadvise(POSIX_FADV_WILLNEED) support
As commit 320f0c6 did for Linux, connect POSIX_FADV_WILLNEED
up to dmu_prefetch() on FreeBSD.

While there, fix portability problems in tests/functional/fadvise.

1.  Instead of relying on the numerical values of POSIX_FADV_XXX macros,
    accept macro names as arguments to the file_fadvise program.  (The
    numbers happen to match on Linux and FreeBSD, but future systems may
    vary and it seems a little strange/raw to count on that.)

2.  For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD
    VFS currently (perhaps something that should be investigated in
    FreeBSD).  Since on Linux we're treating SEQUENTIAL and WILLNEED the
    same, it doesn't really matter which one we use, so switch the test
    over to WILLNEED exercise the new prefetch code on both OSes the
    same way.

Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Thomas Munro <tmunro@FreeBSD.org>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes #17379
2025-05-29 09:34:07 -04:00
Rob Norris
00360efa35 tunables: fix spelling
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Three occurences with an 'e', and all of them mine. Maybe it's an
British thing?

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
589d99171f tunables: use Linux ullong param ops for u64
Since 3.17 Linux has provided param ops for 64-bit ints, so we don't
need to use our own anymore.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
6e7e7ea7ef tunables: remove support for s64 tunables
Nothing uses them now.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
58235f52af tunables: remove direct use of module_param_cb
The use for spl_taskq_kick was the only use, and the comment that
module_param_call is obsolete is no longer true - it's still very much
used even in recent kernels.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
7b183f1918 tunables: remove FreeBSD compat macros for Linux module params
Nothing in any FreeBSD code uses them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00