Commit Graph

5049 Commits

Author SHA1 Message Date
Rob Norris
6af8db61b1
metaslab: don't pass whole zio to throttle reserve APIs
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
They only need a couple of fields, and passing the whole thing just
invites fiddling around inside it, like modifying flags, which then
makes it much harder to understand the zio state from inside zio.c.

We move the flag update to just after a successful throttle in zio.c.

Rename ZIO_FLAG_IO_ALLOCATING to ZIO_FLAG_ALLOC_THROTTLED
Better describes what it means, and makes it look less like
IO_IS_ALLOCATING, which means something different.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17508
2025-07-04 23:22:22 -04:00
Rob Norris
92d3b4ee2c
zio: rename io_reexecute as io_post; use it for the direct IO checksum error flag
We're not supposed to modify someone else's io_flags, so we need another
way to propagate DIO_CHKSUM_ERR.

If we squint, we can see that io_reexecute is really just recording
exceptional events that a parent (or its parents) will need to do
something about. It just happens that the only things we've had
historically are two forms of reexecution: now or later (suspend).

So, rename it to io_post, as in, post-IO info/events/actions. And now we
have a few spare bits for other conditions.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17507
2025-07-04 23:16:14 -04:00
Alexander Motin
4e92aee233
Relax special_small_blocks restrictions
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
special_small_blocks is applied to blocks after compression, so it
makes no sense to demand its values to be power of 2.  At most
they could be multiple of 512, but that would still buy us nothing,
so lets allow them be any within SPA_MAXBLOCKSIZE.

Also special_small_blocks does not really need to depend on the
set recordsize, enabled pool features or presence of special vdev.
At worst in any of those cases it will just do nothing, so we
should not complicate users lives by artificial limitations.

While there, polish comments for recordsize and volblocksize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17497
2025-07-02 11:11:37 -07:00
Olivier Certner
dee62e074a
spa: ZIO_TASKQ_ISSUE: Use symbolic priority
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

This upstreams commit fd141584cf89d7d2 from FreeBSD src.

Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17489
2025-06-30 10:24:23 -04:00
Paul Dagnelie
69ee01aa4b
Fix bug caused by rounding in vdev_raidz_asize_to_psize
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
When an allocation is happening on a raidz vdev, the number of sectors
allocated is rounded up to a multiple of nparity + 1. If this results in
the allocation spilling into an extra row, then the corresponding call
to vdev_raidz_asize_to_psize will incorrectly assume that parity sectors
were allocated for that spilled row, even though no data is stored
there.

If we determine that happened, we need to subtract out those extra
sectors before performing the rest of the capacity calculation.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17490
2025-06-27 14:54:20 -04:00
Rob Norris
ea076d6921
vdev_raidz_asize_to_psize: return psize, not asize
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Since 246e588, gang blocks written to raidz vdevs will write past the
end of their allocation, corrupting themselves, other data, or both.

The reason is simple - when allocating the gang children, we call
vdev_psize_to_asize() to find out how much data we should load into the
allocation we just did. vdev_raidz_asize_to_psize() had a bug; it
computed the psize, but returned the original asize. The raidz layer
dutifully writes that much out, into space beyond the end of the
allocation.

If there's existing data there, it gets overwritten, causing checksum
errors when that data is read. Even there's not data there (unlikely,
given that gang blocks are in play at all), that area is not considered
allocated, so can be allocated and overwritten later.

The fix is simple: return the psize we just computed.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17488
2025-06-26 10:19:59 -04:00
Mark Johnston
0a2163d194
FreeBSD: Ensure that z_pflags is initialized for new znodes
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The field is subsequently accessed in zfs_mknode(), in
zfs_inherit_projid().  The Linux implementation of zfs_create_fs() has
this initialization already; there is no counterpart to
zfs_create_share_dir() that I can see.

Reported-by: KMSAN
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #17486
2025-06-25 12:07:17 -04:00
Paul Dagnelie
d461a67d0a
Ensure that gang_copies is always at least as large as copies
As discussed in the comments of PR #17004, you can theoretically run
into a case where a gang child has more copies than the gang header,
which can lead to some odd accounting behavior (and even trip a
VERIFY). While the accounting code could be changed to handle this, it
fundamentally doesn't seem to make a lot of sense to allow this to
happen. If the data is supposed to have a certain level of reliability,
that isn't actually achieved unless the gang_copies property is set to
match it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17484
2025-06-25 12:05:36 -04:00
Rob Norris
46a4075100
Linux 6.16: remove writepage and readahead_page
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17443
2025-06-23 15:51:02 -04:00
Brian Behlendorf
48ce292ea0
Clarify and restrict dmu_tx_assign() errors
There are three possible cases where dmu_tx_assign() may
encounter a fatal error.  When there is a true lack of free
space (ENOSPC), when there is a lack of quota space (EDQUOT),
or when data required to perform the transaction cannot be
read from disk (EIO).  See the dmu_tx_check_ioerr() function
for additional details of on the motivation for check for
I/O error early.

Prior to this change dmu_tx_assign() would return the
contents of tx->tx_err which covered a wide range of possible
error codes (EIO, ECKSUM, ESRCH, etc).  In practice, none
of the callers could do anything useful with this level of
detail and simply returned the error.

Therefore, this change converts all tx->tx_err errors to EIO,
adds ASSERTs to dmu_tx_assign() to cover the only possible
errors, and clarifies the function comment to include EIO as
a possible fatal error.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian D Behlendorf <behlendo@slag12.llnl.gov>
Closes #17463
2025-06-23 15:48:30 -04:00
Alexander Motin
5e5253be84
FreeBSD: Wire projects support
While FreeBSD itself does not support projects, there is no reason
why it can't be controlled via `zfs project` and other subcommands.
Most of the code is actually already there and just needs some
revival and sync with Linux, plus enabling some tests not depending
on the OS support.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17423
2025-06-19 14:39:20 -07:00
Paul Dagnelie
717213d431
Fix other nonrot bugs
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
There are still a variety of bugs involving the vdev_nonrot property
that will cause problems if you try to run the test suite with
segment-based weighting disabled, and with other things in the weighting
code. Parents' nonrot property need to be updated when children are
added. When vdevs are expanded and more metaslabs are added, the weights
have to be recalculated (since the number of metaslabs is an input to
the lba bias function). When opening, faulted or unopenable children
should not be considered for whether a vdev is nonrot or not (since the
nonrot property is determined during a successful open, this can cause
false negatives). And draid spares need to have the nonrot property set
correctly.

Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17469
2025-06-19 09:25:58 -04:00
Attila Fülöp
6cf17f6538
Linux build: handle CONFIG_OBJTOOL_WERROR=y
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Linux 5.16 by default fails the build on objtool warnings. We have
known and understood objtool warnings we can't fix without
involving Linux maintainers.

To work around this we introduce an objtool wrapper script which
removes the `--Werror` flag.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #17456
2025-06-16 08:12:09 -07:00
Alexander Motin
bd27b75401
ZIL: Relax parallel write ZIOs processing
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
ZIL introduced dependencies between its write ZIOs to permit flush
defer, when we flush vdev caches only once all the write ZIOs has
completed.  But it was recently spotted that it serializes not only
ZIO completions handling, but also their ready stage.  It means ZIO
pipeline can't calculate checksums for the following ZIOs until all
the previous are checksumed, even though it is not required.  On a
systems where memory throughput of a single CPU core is limited,
it creates single-core CPU bottleneck, which is difficult to see
due to ZIO pipeline design with many taskqueue threads.

While it would be great to bypass the ready stage waits, it would
require changes to ZIO code, and I haven't found a clean way to do
it.  But I've noticed that we don't need any dependency between
the write ZIOs if the previous one has some waiters, which means
it won't defer any flushes and work as a barrier for the earlier
ones.

Bypassing it won't help large single-thread writes, since all the
write ZIOs except the last in that case won't have waiters, and
so will be dependent.  But in that case the ZIO processing might
not be a bottleneck, since there will be only one thread populating
the write buffers, that will likely be the bottleneck.

But bypassing the ZIO dependency on multi-threaded write workloads
really allows them to scale beyond the checksuming throughput of
one CPU core.

My tests with writing 12 files on a same dataset on a pool with
4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a
system with Xeon Silver 4114 CPU show total throughput increase
from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to
~70%.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17458
2025-06-14 09:37:18 -04:00
Rob Norris
238eab7dc1 FreeBSD: zfs_putpages: don't undirty pages until after write completes
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17445
2025-06-12 14:45:18 -07:00
Rob Norris
aa964ce61b zfs_log_write: only put the callback on the last itx
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <markj@FreeBSD.org>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17445
2025-06-12 14:44:33 -07:00
Rob Norris
d1c88cbd4c zpl_sync_fs: work around kernels that ignore sync_fs errors
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:32 -07:00
Rob Norris
e3f5e317e0 zfs_sync: return error when pool suspends
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:27 -07:00
Rob Norris
52352dd748 zfs_sync: remove support for impossible scenarios
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17420
2025-06-12 14:42:21 -07:00
Alexander Motin
e0ef4d2768
Improve block cloning transactions accounting
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Previous dmu_tx_count_clone() was broken, stating that cloning is
similar to free.  While they might be from some points, cloning
is not net-free.  It will likely consume space and memory, and
unlike free it will do it no matter whether the destination has
the blocks or not (usually not, so previous code did nothing).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17431
2025-06-11 11:59:16 -07:00
Alexander Motin
66ec7fb269
Reduce zfs_dmu_offset_next_sync penalty
Looking on txg_wait_synced(, 0) I've noticed that it always syncs
5 TXGs: 3 TXG_CONCURRENT_STATES + 2 TXG_DEFER_SIZE.  But in case
of dmu_offset_next() we do not care about deferred frees. And even
concurrent TXGs we might not need sync all 3 if the dnode was not
dirtied in last few TXGs.

This patch makes dmu_offset_next() to sync one TXG at a time until
the dnode is clean, but no more than 3 TXG_CONCURRENT_STATES times.
My tests with random simultaneous writes and seeks over many files
on HDD pool show 7-14% performance increase.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17434
2025-06-11 11:50:49 -07:00
Alexander Motin
4ae931aa93
Polish db_rwlock scope
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
dbuf_verify(): Don't need the lock, since we only compare pointers.

dbuf_findbp(): Don't need the lock, since aside of unneeded assert
we only produce the pointer, but don't de-reference it.

dnode_next_offset_level(): When working on top level indirection
should lock dnode buffer's db_rwlock, since it is our parent.  If
dnode has no buffer, then it is meta-dnode or one of quotas and we
should lock the dataset's ds_bp_rwlock instead.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17441
2025-06-11 11:13:48 -07:00
Rob Norris
fbfda270d5 zcp_synctask: add zfs.sync.clone()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17426
2025-06-10 14:53:10 -07:00
Rob Norris
560e3170ef dsl_dataset: rename dmu_objset_clone* to dsl_dataset_clone*
And make its check and sync functions visible, so I can hook them up to
zcp_synctask. Rename not strictly necessary, but it definitely looks
more like a dsl_dataset thing than a dmu_objset thing, to the extent
that those things even have a meaningful distinction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17426
2025-06-10 14:52:43 -07:00
Alexander Motin
ba227e2cc2
Make TX abort after assign safer
It is not right, but there are few examples when TX is aborted
after being assigned in case of error.  To handle it better on
production systems add extra cleanup steps.

While here, replace couple dmu_tx_abort() in simple cases.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17438
2025-06-10 09:30:06 -07:00
Alexander Motin
bcd0430236
Allow zero compression if dedup is enabled
Having high-refcount dedup entries for zero blocks is inefficient
when they could be recorded as a holes instead.  Normally, zero
compression is not done if compression is disabled to not confuse
naive benchmarks.  But with dedup enabled, it is expected that the
write will be skipped anyway, so we are just optimizing the way it
is skipped.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17435
2025-06-10 09:28:14 -07:00
Mariusz Zaborski
46b82de618
scrub: generate scrub_finish event
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
The `scn_min_txg` can now be used not only with resilver. Instead
of checking `scn_min_txg` to determine whether it’s a resilver or
a scrub, simply check which function is defined. Thanks to this
change, a scrub_finish event is generated when performing a scrub
from the saved txg.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Closes #17432
2025-06-06 22:43:10 -04:00
Rob Norris
af7d609592
zpl: handle suspend from two remaining calls to txg_wait_synced()
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
* zfs_link: allow tempfile sync to fail if pool suspends

4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_link() can fall back to
txg_wait_synced() if it has to wait for a tempfile to be fully created
before continuing, which will block if the pool suspends.

Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.

* zfs_clone_range: allow dirty wait to fail if pool suspends

4653e2f7d3 (#17355) allows dmu_tx_assign() to fail if the pool suspends
when failmode=continue, but zfs_clone_range() can fall back to
txg_wait_synced() if it has to wait for a dirty block to be written out,
which will block if the pool suspends.

Handle this by requesting an error return if the pool suspends when
failmode=continue, and if that happens, return EIO.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17413
2025-06-05 15:38:26 -04:00
Attila Fülöp
b96f1a4b1f
Linux build: silence objtool warnings
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
After #17401 the Linux build produces some stack related warnings.

Silence them with the `STACK_FRAME_NON_STANDARD` macro.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #17410
2025-06-04 17:40:09 -07:00
Alexander Motin
b7f919d228
Relax zfs_vnops_read_chunk_size limitations
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It makes no sense to limit read size below the block size, since
DMU will any way consume resources for the whole block, while the
current zfs_vnops_read_chunk_size is only 1MB, which is smaller
that maximum block size of 16MB.  Plus in case of misaligned
Uncached I/O the buffer may get evicted between the chunks,
requiring repeating I/Os.

On 64-bit platforms increase zfs_vnops_read_chunk_size to 32MB.
It allows to less depend on speculative prefetcher if application
requests specific size, first not waiting for prefetcher to start
and later not prefetching more than needed.

Also while there, we don't need to align reads to the chunk size,
but only to a block size, which is smaller and so more forgiving.

My profiles show ~4% of CPU time saving when reading 16MB blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17415
2025-06-04 11:24:15 -04:00
Alexander Motin
68817d28c5
Include class name into struct metaslab_class
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
With increasing number of metaslab classes it can be helpful for
debugging to know what we are looking at.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17409
2025-06-03 11:12:59 -04:00
Alexander Motin
108562344c
Improve allocation fallback handling
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Before this change in case of any allocation error ZFS always fallen
back to normal class.  But with more of different classes available
we migth want more sophisticated logic.  For example, it makes sense
to fall back from dedup first to special class (if it is allowed to
put DDT there) and only then to normal, since in a pool with dedup
and special classes populated normal class likely has performance
characteristics unsuitable for dedup.

This change implements general mechanism where fallback order is
controlled by the same spa_preferred_class() as the initial class
selection.  And as first application it implements the mentioned
dedup->special->normal fallbacks.  I have more plans for it later.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17391
2025-05-31 19:12:16 -04:00
Fedor Uporov
e0edfcbd4e
ZVOL: Make zvol_volmode module parameter platform-independent
The module parameter name was not changed in FreeBSD sysctls
list: 'vfs.zfs.vol.mode'. Also, on Linux side the name is:
/sys/module/zfs/parameters/zvol_volmode.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17386
2025-05-31 19:09:50 -04:00
Fedor Uporov
e1677d9ee1
ZVOL: Make zvol_prefetch_bytes module parameter platform-independent
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The module parameter now is represented in FreeBSD sysctls list
with name: 'vfs.zfs.vol.prefetch_bytes'. The default value is 131072,
same as on Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17385
2025-05-31 09:58:54 -04:00
Rob Norris
e8e602d987
zio_add_child: collapse into a single function
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The child locking difference is simple enough to handle with a boolean.
The actual work is more involved, and it's easy to forget to change
things in both places when experimenting. Just collapse them.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17382
2025-05-30 21:18:10 -04:00
Alexander Motin
2d33c8edb6
Make rewrite use Uncached I/O
Rewrite is a one-time/rare bulk administrative operation, which
should minimally affect payload caching.  Plus some avoided memory
copies in its data path allow to significantly increase its speed.

My tests show reduction of time to rewrite 28GB of uncompressed
files on NVMe pool from 17 to 9 seconds and minimal ARC usage.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17407
2025-05-30 21:13:49 -04:00
Fedor Uporov
a38376b37a
Rename zvol kernel module parameters sysctls on FreeBSD side
Make 'zvol_threads', 'zvol_num_taskqs' and 'zvol_request_sync' names
compatible with FreeBSD sysctl naming convention. Now the sysctls are
have a next form:

$ sysctl vfs.zfs.vol.threads
vfs.zfs.vol.threads: 0

$ sysctl vfs.zfs.vol.num_taskqs
vfs.zfs.vol.num_taskqs: 0

$ sysctl vfs.zfs.vol.request_sync
vfs.zfs.vol.request_sync: 0

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17406
2025-05-30 13:41:15 -07:00
Alexander Motin
008c9666ef
Set spa_final_txg in spa_unload()
I've noticed that after some dedup tests system reboot ends up in
assertion about ms_defer tree not free.  It seems to be caused by
DDT flushing still freeing some blocks while ZFS trying to reach
a final steady state due to spa_final_txg, while being set by
spa_export_common() on pool export, is not set when spa_unload()
is called by spa_evict_all() on system reboot/shutdown.  Setting
spa_final_txg in spa_unload() fixes this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17395
2025-05-30 14:44:45 -04:00
Ameer Hamza
b3b3cd1e4f vdev: skip faulting disks pending removal
This patch fixes a race where vdev_remove_wanted may be set after probe
initiation, which could otherwise trigger redundant fault and removal.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17400
2025-05-30 09:14:37 -07:00
Rob Norris
5764e218ba
vdev_disk: remove classic IO submission
Since it was disabled for 2.3, there's been no confirmed sightings of
strange IO errors, misalignments or related shenanigans. Absence of
evidence and all that, but I'd rather fix bugs in the new code than in
the old.

  "It isn't hubris until he's failed."
    -- Chrisjen Avasarala

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17399
2025-05-30 10:31:02 -04:00
Rob Norris
44e3266894
events: include zio type in IO error reports
Usually the IO type can be inferred from the other fields (in
particular, priority and flags) sometimes it's not easy to see. This is
just another little debug helper.

    May 27 2025 00:54:54.024110493 ereport.fs.zfs.data
            class = "ereport.fs.zfs.data"
            ena = 0x1f5ecfae600801
            ...
            zio_delta = 0x0
            zio_type = 0x2 [WRITE]
            zio_priority = 0x3 [ASYNC_WRITE]
            zio_objset = 0x0

Document zio_type and zio_priority.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17381
2025-05-30 10:29:29 -04:00
Rob Norris
0c94d3838d
linux/zvol_os: don't try to set disk ops if alloc fails
If the kernel fails to allocate the gendisk, zvo_disk will be NULL, and
derefencing it will explode. So don't do that.

Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17396
2025-05-30 10:25:09 -04:00
Attila Fülöp
3084336ae4
Linux build: always use objtool
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
We silence `objtool` warnings on some object files using
`OBJECT_FILES_NON_STANDARD_some_file.o`. Nowadays `objtool` is
needed for CPU vulnerability mitigations and a lot more
functionality so its use is desirable.

Just remove the `OBJECT_FILES_NON_STANDARD` definitions. A follow-up
commit is needed to make the offending files standard and address
the compile time warnings.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #17401
Closes #17364
2025-05-29 18:04:20 -07:00
Fedor Uporov
3dfa98d013
ZVOL: Make zvol_inhibit_dev module parameter platform-independent
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The module parameter now is represented in FreeBSD sysctls list with
name: 'vfs.zfs.vol.inhibit_dev'. The default value is '0', same as on
Linux side.

Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17384
2025-05-29 09:37:41 -04:00
Alexander Motin
fa697b94e6
FreeBSD: Add posix_fadvise(POSIX_FADV_WILLNEED) support
As commit 320f0c6 did for Linux, connect POSIX_FADV_WILLNEED
up to dmu_prefetch() on FreeBSD.

While there, fix portability problems in tests/functional/fadvise.

1.  Instead of relying on the numerical values of POSIX_FADV_XXX macros,
    accept macro names as arguments to the file_fadvise program.  (The
    numbers happen to match on Linux and FreeBSD, but future systems may
    vary and it seems a little strange/raw to count on that.)

2.  For implementation reasons, SEQUENTIAL doesn't reach ZFS via FreeBSD
    VFS currently (perhaps something that should be investigated in
    FreeBSD).  Since on Linux we're treating SEQUENTIAL and WILLNEED the
    same, it doesn't really matter which one we use, so switch the test
    over to WILLNEED exercise the new prefetch code on both OSes the
    same way.

Reviewed-by: Mateusz Guzik <mjg@FreeBSD.org>
Reviewed-by: Fedor Uporov <fuporov.vstack@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Thomas Munro <tmunro@FreeBSD.org>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes #17379
2025-05-29 09:34:07 -04:00
Rob Norris
00360efa35 tunables: fix spelling
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Three occurences with an 'e', and all of them mine. Maybe it's an
British thing?

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
589d99171f tunables: use Linux ullong param ops for u64
Since 3.17 Linux has provided param ops for 64-bit ints, so we don't
need to use our own anymore.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
6e7e7ea7ef tunables: remove support for s64 tunables
Nothing uses them now.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
58235f52af tunables: remove direct use of module_param_cb
The use for spl_taskq_kick was the only use, and the comment that
module_param_call is obsolete is no longer true - it's still very much
used even in recent kernels.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
7b183f1918 tunables: remove FreeBSD compat macros for Linux module params
Nothing in any FreeBSD code uses them.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
b0e053a10d tunables: ensure tunable and variable have same define gate
If a variable is only available in the kernel, then the tunable should
also only be available there.

This matters very little so long as we don't have userspace tunables,
but its still good hygeine.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
d1724b59dc tunables: don't assert initialisation in impl getters
It actually doesn't matter if it's not initialised when we first query
the current value; it just returns empty-string. A crash is quite
obnoxious even if it is a rare case.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Rob Norris
5ef91c2bee zfs_log: make zfs_immediate_write_sz uint
Likely it's only int64 for comparison with ssize_t, which is signed.
However, it would make no sense for it to be less than 0 or greater than
4G, so making it a regular uint will make it safe for comparison and
remove the only S64 tunable in core.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17377
2025-05-28 16:50:22 -07:00
Paul Dagnelie
c464f1d014
Only interrupt active disk I/Os in failmode=continue
failmode=continue is in a sorry state. Originally designed to fix a very
specific problem, it causes crashes and panics for most people who end
up trying to use it. At this point, we should either remove it entirely,
or try to make it more usable.

With this patch, I choose the latter. While the feature is fundamentally
unpredictable and prone to race conditions, it should be possible to get
it to the point where it can at least sometimes be useful for some
users. This patch fixes one of the major issues with failmode=continue:
it interrupts even ZIOs that are patiently waiting in line behind stuck
IOs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17372
2025-05-28 15:31:32 -07:00
Rob Norris
0372def8c9 vdev_geom: converted injected EIO errors to ENXIO
By the assertion, vdev_geom_io_done() only expects ENXIO on an error
when the geom is a top-level (allocating) vdev[1][2]. However, zinject
currently can't insert ENXIO directly, possibly because on Solaris
outright disk failures were reported with EIO[2][3].

This is a narrow workaround to convert EIO to ENXIO when injections are
enabled, to avoid the assertion and allow the test suite to test
behaviour related to probe failure on FreeBSD.

1. freebsd/freebsd-src@37ec52ca7a
2. freebsd/freebsd-src@cd730bd6b2
3. illumos/illumos-gate@ea8dc4b6d2

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:29:11 -07:00
Rob Norris
55d035e866 dmu_tx_assign: make all VERIFY0 calls use DMU_TX_SUSPEND
This is the cheap way to keep non-user functions working after
break-on-suspend becomes default.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:28:59 -07:00
Rob Norris
4653e2f7d3 dmu_tx: break tx assign/wait when pool suspends
This adjusts dmu_tx_assign/dmu_tx_wait to be interruptable if the pool
suspends while they're waiting, rather than just on the initial check
before falling back into a wait.

Since that's not always wanted, add a DMU_TX_SUSPEND flag to ignore
suspend entirely, effectively returning to the previous behaviour.

With that, it shouldn't be possible for anything with a standard
dmu_tx_assign/wait/abort loop to block under failmode=continue.

Also should be a bit tighter than the old behaviour, where a
VERIFY0(dmu_tx_assign(DMU_TX_WAIT)) could technically fail if the pool
is already suspended and failmode=continue.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:28:51 -07:00
Rob Norris
ac2e579521 dmu_tx: make DMU_TX_* flags an enum
Mostly for a little more type checking and debugging visibility.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:28:46 -07:00
Rob Norris
468d22d60c txg_wait_synced_flags: add TXG_WAIT_SUSPEND flag to not wait if pool suspended
This allows a caller to request a wait for txg sync, with an appropriate
error return if the pool is suspended or becomes suspended during the
wait.

To support this, txg_wait_kick() is added to signal the sync condvar,
which wakes up the waiters, causing them to loop and reconsider their
wait conditions again. zio_suspend() now calls this to trigger the break
if the pool suspends while waiting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17355
2025-05-28 10:27:46 -07:00
Pavel Snajdr
8487945034
zcp: get_prop: fix encryptionroot and encryption
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It was reported that channel programs' zfs.get_prop doesn't work for
dataset properties encryption and encryptionroot.

They are handled in get_special_prop due to the need to call
dsl_dataset_crypt_stats to load those dsl props.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Graham Christensen <graham@grahamc.com>
Closes #17280
2025-05-27 20:04:37 -04:00
Fedor Uporov
087d7d80c7
ZVOL: Comment platform-specific empty functions bodies on FreeBSD side
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17383
2025-05-27 20:00:25 -04:00
Rob Norris
fc617645a3 vdev_disk: remove zfs_vdev_scheduler option
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It has existed as a warning since 0.8.3, 5+ years ago. I think people
have had enough time.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17376
2025-05-27 15:06:15 -07:00
Rob Norris
284580c878 dmu_traverse: remove 'ignore_hole_birth' tunable alias
It's been many years, we can probably do without.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17376
2025-05-27 15:05:09 -07:00
Ameer Hamza
2a91d577b1
Expose dataset encryption status via fast stat path
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
In truenas_pylibzfs, we query list of encrypted datasets several times,
which is expensive. This commit exposes a public API zfs_is_encrypted()
to get encryption status from fast stat path without having to refresh
the properties.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17368
2025-05-26 22:11:03 -04:00
Don Brady
b048bfa9c1
Allow opt-in of zvol blocks in special class
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Don Brady <dev.fs.zfs@gmail.com>
Closes #14876
2025-05-24 16:44:26 -04:00
Alexander Motin
9d76950d67
ZIL: Improve write log size accounting
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Before this change write log size TXG throttling mechanism was
accounting only user payload bytes.  But the actual ZIL both on
disk and especially in memory include headers of hundred(s) of
bytes.  Not accouting those may allow applications like
bonnie++, in their wisdom writing one byte at a time, to consume
excessive amount of memory and ZIL/SLOG in one TXG.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17373
2025-05-23 21:48:46 -04:00
Paul Dagnelie
ddf28f27c5
Fix off-by-one bug in range tree code
Without this fix, zfs_range_tree_find_in could return an overlap when
the found range starts immediately after the searched range, with no
actual overlap.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17363
2025-05-23 10:33:33 -04:00
Alexander Motin
5c30b24381
Fix null dereference in spa_vdev_remove_cancel_sync()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
We don't really need to access space map to know where the metaslab
ends, while msp->ms_sm might be NULL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes #17164
Fixes #17359
Closes #17361
2025-05-22 10:47:43 -04:00
Richard Yao
83fa80a550
dmu_objset_hold_flags() should call dsl_dataset_rele_flags() on error
This was caught when doing a manual check to see if #17352 needed to be
improved to catch mismatches across stack frames of the kind that were
first found in #17340.

Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Richard Yao <richard@ryao.dev>
Closes #17353
2025-05-20 08:35:45 -07:00
Rob Norris
841be1d049 Linux 6.2/6.15: del_timer_sync() renamed to timer_delete_sync()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or
functional change apart from that, so a very minimal update for us.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17229
2025-05-19 11:13:00 -07:00
Rob Norris
bb740d66de Linux 6.15: mkdir now returns struct dentry *
The intent is that the filesystem may have a reference to an "old"
version of the new directory, eg if it was keeping it alive because a
remote NFS client still had it open.

We don't need anything like that, so this really just changes things so
we return error codes encoded in pointers.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17229
2025-05-19 11:12:49 -07:00
Richard Yao
d8a33bc0a5
icp: Use explicit_memset() exclusively in gcm_clear_ctx()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
d634d20d1b had been intended to fix a
potential information leak issue where the compiler's optimization
passes appeared to remove `memset()` operations that sanitize sensitive
data before memory is freed for use by the rest of the kernel.

When I wrote it, I had assumed that the compiler would not remove the
other `memset()` operations, but upon reflection, I have realized that
this was a bad assumption to make. I would rather have a very slight
amount of additional overhead when calling `gcm_clear_ctx()` than risk a
future compiler remove `memset()` calls. This is likely to happen if
someone decides to try doing link time optimization and the person will
not think to audit the assembly output for issues like this, so it is
best to preempt the possibility before it happens.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <richard@ryao.dev>
Closes #17343
2025-05-19 10:04:05 -07:00
George Amanakis
ea74cdedda
Fix 2 bugs in non-raw send with encryption
Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) has been replaced by
dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates
a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing
and the key mapping is not cleared. This may be inadvertedly used, which
results in arc_untransform failing with ECKSUM in:
 arc_untransform+0x96/0xb0 [zfs]
 dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
 dbuf_read+0x56/0x770 [zfs]
 dmu_buf_hold_by_dnode+0x4a/0x80 [zfs]
 zap_lockdir+0x87/0xf0 [zfs]
 zap_lookup_norm+0x5c/0xd0 [zfs]
 zap_lookup+0x16/0x20 [zfs]
 zfs_get_zplprop+0x8d/0x1d0 [zfs]
 setup_featureflags+0x267/0x2e0 [zfs]
 dmu_send_impl+0xe7/0xcb0 [zfs]
 dmu_send_obj+0x265/0x360 [zfs]
 zfs_ioc_send+0x10c/0x280 [zfs]

Fix this by restoring the call to dsl_dataset_hold_obj().

The same applies for to_ds: here replace dsl_dataset_rele(&to_ds) with
dsl_dataset_rele_flags().

Both leaked key mappings will cause a panic when exporting the
sending pool or unloading the zfs module after a non-raw send from
an encrypted filesystem.

Contributions-by: Hank Barta <hbarta@gmail.com>
Contributions-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #12014 
Closes #17340
2025-05-19 09:55:00 -07:00
Alexander Motin
e55225be3e
Add explicit DMU_DIRECTIO checks
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
UIO_DIRECT means we can do Direct I/O, while DMU_DIRECTIO we want
to do it.  First does not automatically means second.  Add few
checks to not use Direct I/O in few cases we don't want it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17342
2025-05-16 23:27:05 -04:00
Alexander Motin
d5616ad34a
Increase meta-dnode redundancy in "some" mode
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset.  It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.

This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17339
2025-05-16 13:23:32 -04:00
Paul Dagnelie
086105f4c4
Cause zpool scan resume commands to get logged in history
Currently, commands that resume a scrub/errorscrub from a paused state
don't get logged in the pool history. This is because resumes actually
return ECANCELED, instead of 0. This causes the tsd code in the common
ioctl logic to not think the ioctl succeeded, which causes the
log_history ioctl to fail with EPERM. However, for resuming a scrub from
a paused state, ECANCELED is success.

There are two options for how to deal with this. The first is the one
that I implemented here; I can't find a good reason for dmu_scan to
return ECANCELED on resume instead of 0, so let's just not. The only
place we check for the ECANCELED value is in zpool_scan, where we just
convert it back to zero.  However, I am aware that this is changing an
ioctl interface, which I believe is a breaking change. I don't think
it's an important change, but maybe there is someone who relies on it.

The other option that could be implemented is to either allow ECANCELED
specifically from dsl_scan in the common ioctl code, or add a generic
facility to the common ioctl code that allows each command to specify
whether or not success happened, regardless of the return values. I am
open to feedback on which option people think would be better.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #17301
2025-05-16 13:19:04 -04:00
Allan Jude
b6916f995e
ARC: parallel eviction
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Youzhong Yang <youzhong@gmail.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Signed-off-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Alexander Stetsenko <alex.stetsenko@klarasystems.com>
Closes #16486
2025-05-14 10:38:32 -04:00
Alexander Motin
89a8a91582
ARC: Notify dbuf cache about target size reduction
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too.  Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17314
2025-05-14 10:34:14 -04:00
Alexander Motin
734eba251d
Wire O_DIRECT also to Uncached I/O (#17218)
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O.  It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable.  Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.

While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations.  It require I/Os
to be page aligned, does not allow speculative prefetch, etc.  The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O.  As such it should fill the gap in between.  Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.

To pass the information between the layers I had to change a number
of APIs.  But as side effect upper layers can now control not only
the caching, but also speculative prefetch.  I haven't wired it to
VFS yet, since it require looking on some OS specifics.  But while
there I've implemented speculative prefetch of indirect blocks for
Direct I/O, controllable via all the same mechanisms.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes #17027
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-05-13 14:26:55 -07:00
Rob Norris
b2284aedab
metaslab_alloc: make hint BP and DVA const (#17324)
Nothing modifies them, and nothing should, so lets try to enforce that.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
2025-05-12 10:52:46 -07:00
Alexander Motin
49fbdd4533
Introduce zfs rewrite subcommand (#17246)
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-05-12 10:22:17 -07:00
Mariusz Zaborski
8b9c4e643b
spa: clear checkpoint information during retry
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Closes #17319
2025-05-11 12:49:38 -04:00
Rob Norris
2ee5b51a57
linux/uio: remove "skip" offset for UIO_ITER
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17298
2025-05-11 12:46:40 -04:00
Alan Somers
c17bdc4914
More aggressively assert that db_mtx protects db.db_data
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
db.db_mtx must be held any time that db.db_data is accessed.  All of
these functions do have the lock held by a parent; add assertions to
ensure that it stays that way.

See https://github.com/openzfs/zfs/discussions/17118

* Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't
required.

* Refactor dbuf_hold_copy to eliminate the db_rwlock
Copy data into the newly allocated buffer before assigning it to the db.
That way, there will be no need to take db->db_rwlock.

* Refactor dbuf_read_hole
In the case of an indirect hole, initialize the newly allocated buffer
before assigning it to the dmu_buf_impl_t.

Sponsored by:	ConnectWise
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #17209
2025-05-09 09:02:26 -04:00
Fedor Uporov
1a8f5ad3b0
zvol: Enable zvol threading functionality on FreeBSD
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Make zvol I/O requests processing asynchronous on FreeBSD side in some
cases. Clone zvol threading logic and required module parameters from
Linux side. Make zvol threadpool creation/destruction logic shared for
both Linux and FreeBSD.
The IO requests are processed asynchronously in next cases:
- volmode=geom: if IO request thread is geom thread or cannot sleep.
- volmode=cdev: if IO request passed thru struct cdevsw .d_strategy
routine, mean is AIO request.
In all other cases the IO requests are processed synchronously. The
volthreading zvol property is ignored on FreeBSD side.

Sponsored-by: vStack, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: @ImAwsumm
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17169
2025-05-08 15:25:40 -04:00
Alan Somers
f13d760aa8
Delete dead code: dbuf_loan_arcbuf
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It's been dead ever since 5fa356ea44

Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17119
2025-05-08 10:34:11 -04:00
Rob Norris
4800181b3b
ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat
These are only required to support these ioctls on Linux <4.5. Since
4.18 is our cutoff, we don't need this code anymore.

Also removing related test things that will never match again.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17308
2025-05-08 10:32:52 -04:00
Olivier Certner
78628a5c15
FreeBSD: Use new SYSCTL_SIZEOF()
SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9):
Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in
branch 'main'.  It will soon be backported to 'stable/14'.  We will thus
be able to remove the old, alternate version left in the '#else' branch
as soon as 'stable/13' goes out of support (April 30, 2026).

Sponsored-by:   The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17309
2025-05-08 10:31:43 -04:00
Alexander Motin
b1ccab1721
ARC: Avoid overflows in arc_evict_adj() (#17255)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
With certain combinations of target ARC states balance and ghost
hit rates it was possible to get the fractions outside of allowed
range.  This patch limits maximum balance adjustment speed, which
should make it impossible, and also asserts it.

Fixes #17210
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-05-06 09:31:38 -07:00
Rob Norris
c5a6b4417d
zfs_valstr: update zio_flag strings for ZIO_FLAG_PREALLOCATED
Missed in 246e5883bb.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17306
2025-05-06 11:35:30 -04:00
Paul Dagnelie
246e5883bb
Implement allocation size ranges and use for gang leaves (#17111)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.

We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-05-02 15:32:18 -07:00
Rob Norris
a7de203c86
txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.

This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.

Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.

The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2025-05-02 15:29:50 -07:00
Alexander Motin
f86d9af16b Fix race between resilver wait and offline/detach
We should not clear scn_state and notify waiters until we call
vdev_dtl_reassess(), otherwise following offline/detach request
may fail with "no valid replicas".

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
2025-05-02 15:19:24 -07:00
Tony Hutter
f40ab9e399
Fix double spares for failed vdev
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk.  This commit checks for that condition and disallows it.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #16547
Closes: #17231
2025-05-02 09:03:11 -07:00
Rob Norris
c8fa39b46c
cred: properly pass and test creds on other threads (#17273)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
### Background

Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.

Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.

However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.

### Current implementation

The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.

For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.

### Changes

We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.

On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-29 16:27:48 -07:00
Brian Atkinson
7031a48c70
Export correct symbols for Lustre Direct I/O
Originally the Lustre ZFS OSD code was going to use zfs_uio_t structs
for supporting Direct I/O with ZFS. However, this has changed to using
abd_t structs instead. This exports the proper symbols that will be used
by the Lustre ZFS OSD code.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #17256
2025-04-24 13:55:21 -04:00
Paul Dagnelie
5f5321effa
Handle interaction between gang blocks, copies, and FDT.
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes: #17123
2025-04-21 11:26:30 -04:00
Tony Hutter
8d1489735b
nvlist: Add nvlist_snprintf() and zfs_dbgmsg_nvlist()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Add nvlist_snprintf() to print a nvlist to a buffer.  This is basically
the snprintf() version of dump_nvlist().  Along with that, add a
zfs_dbgmsg_nvlist() to print out an nvlist to dbgmsg.  This will aid in
debugging.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #17215
2025-04-18 09:22:16 -04:00
Tony Hutter
155847c72d
GCC 15: Fix unterminated-string-initialization (#17244)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Fix build errors on Fedora 42 like:

  module/zcommon/zfs_valstr.c:193:16: error: initializer-string for
  array of 'char' truncates NUL terminator but destination lacks
  'nonstring' attribute (3 chars into 2 available)

The arrays in zpool_vdev_os.c and zfs_valstr.c don't need to be
NULL terminated, but we do so to make GCC happy.

Closes: #17242

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:33:29 -07:00
Tony Hutter
ab9bb193f9
Linux 6.0 compat: Check for migratepage VFS (#17217)
The 6.0 kernel removes the 'migratepage' VFS op. Check for
migratepage.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org
2025-04-10 17:36:31 -07:00
Alexander Motin
a497c5fc8b
Improve L2 caching control for prefetched indirects
dbuf_prefetch_impl() should look on level of current indirect, not
the target prefetch level.  dbuf_prefetch_indirect_done() should
call dnode_level_is_l2cacheable() if we have dpa_dnode to pass it.
It should fix some both false positive and negative L2ARC caching.

While there, fix redacted feature activation assertions.  One was
always true, while another could give false positive if dpa_dnode
is NULL.

George Amanakis <gamanakis@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17204
2025-04-08 19:43:32 -04:00
Martin Matuška
88e3885cf4
freebsd: unbreak module/Makefile.bsd build on 15-CURRENT-arm64
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
- don't include foreign machine assembly files
- reduce diff to FreeBSD module Makefile

Discovered in FreeBSD port filesystems/openzfs-kmod

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #17219
2025-04-05 19:43:41 -04:00
Paul Dagnelie
b14b3e3985
Fix FDT rollback to not overwrite unnecessary fields (#17205)
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-04 11:10:44 -07:00
Ameer Hamza
6f6c504700 Show default quotas in zfs userspace tools
Update zfs userspace, groupspace, and projectspace to display the
default quotas when no per-ID specific quota is configured. This
ensures tool outputs align with enforced limits.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:45 -07:00
Ameer Hamza
9cb9a59e1c Report default quotas via kernel interfaces
Ensure default user/group/project quotas are visible through quota
tools and filesystem stats when no per-ID quota is configured. This
maintains consistency between quota visibility and configured defaults.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:38 -07:00
Ameer Hamza
20705a8430 Enforce default quotas when no per-ID quota is set
Update zfs_id_overobjquota() and zfs_id_overblockquota() to enforce
default user/group/project quotas (block and object-based) when no
per-user, per-group, or per-project quota exists. If a specific quota
is not configured for an ID, the default quota value is applied.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:25 -07:00
Ameer Hamza
2a8d9d9607 Add default user/group/project quota properties
This adds default userquota, groupquota, and projectquota properties to
MASTER_NODE_OBJ to make them accessible during zfsvfs_init() (regular
DSL properties require dsl_config_lock, which cannot be safely acquired
in this context). The zfs_fill_zplprops_impl() logic is updated to read
these default properties directly from MASTER_NODE_OBJ.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:35:22 -07:00
Paul Dagnelie
7be9fa259e
Fix nonrot property being incorrectly unset (#17206)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
When opening a vdev and setting the nonrot property, we used to wait for
each child to be opened before examining its nonrot property. When the
change was made to open vdevs asynchronously, we didn't move the nonrot
check out of the main loop. As a result, the nonrot property is almost
always set to false, regardless of the actual type of the underlying
disks. The fix is simply to move the nonrot check to a separate loop
after the taskq has been waited for.

Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2025-04-02 12:11:33 -07:00
Alexander Motin
301da593ad
Fix lock reversal on device removal cancel
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
FreeBSD kernel's WITNESS code detected lock ordering violation in
spa_vdev_remove_cancel_sync().  It took svr_lock while holding
ms_lock, which is opposite to other places.  I was thinking to
resolve it similar to #17145, but looking closer I don't think
we even need svr_lock at that point, since we already asserted
svr_allocd_segs is empty, and we don't need to add there segments
we are going to call free_mapped_segment_cb for.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17164
2025-04-01 09:31:24 -04:00
Paul Dagnelie
367d34b3aa
Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17183
2025-04-01 09:23:43 -04:00
Rob Norris
75e921da6f
kstat: silence "maybe uninitialized" warnings
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Firmly in the "shouldn't happen" camp, but at least GCC 7.4 (Ubuntu
18.04) complained about them, and it's easy to shut up, so do so.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17189
2025-03-28 22:25:47 -04:00
Alexander Motin
5b29e70ae1
Remove mg_allocators (#17192)
Previous code allowed each metaslab group to have different number
of allocators.  But in practice it worked only for embedded SLOGs,
relying on a number of conditions and creating a significant mine
field if any of those change.  I just stepped on one myself.

This change makes all groups to have spa_alloc_count allocators.
It may cost us extra 192 bytes of memory per normal top-level vdev
on large systems, but I find it a small price for cleaner and more
reliable code.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes #17188
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2025-03-28 13:11:10 -07:00
Ameer Hamza
30cc2331f4
zed: Ensure spare activation after kernel-initiated device removal
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17187
2025-03-28 15:48:38 -04:00
Alexander Motin
4abc21b28c
Block remap for cloned blocks on device removal
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
When after device removal we handle block pointers remap, skip blocks
that might be cloned.  BRTs are indexed by vdev id and offset from
block pointer's DVA[0].  So if we start addressing the same block by
some different DVA, we won't get the proper reference counter.  As
result, we might either remap the block twice, that may result in
assertion during indirect mapping condense, or free it prematurely,
that may result in data overwrite, or free it twice, that may result
in assertion in spacemap code.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15604
Closes #17180
2025-03-26 16:45:34 -07:00
Pavel Snajdr
a0e62718cf
Linux: Fix zfs_prune panics v2 (#17121)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It turns out that approach taken in the original version of the patch
was wrong. So now, we're taking approach in-line with how kernel
actually does it - when sb is being torn down, access to it
is serialized via sb->s_umount rwsem, only when that lock is taken
is it okay to work with s_flags - and the other mistake I was doing
was trying to make SB_ACTIVE work, but apparently the kernel checks
the negative variant - not SB_DYING and not SB_BORN.

Kernels pre-6.6 don't have SB_DYING, but check if sb is hashed
instead.

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-25 15:20:16 -07:00
Alexander Motin
94a3fabcb0
Unified allocation throttling (#17020)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Existing allocation throttling had a goal to improve write speed
by allocating more data to vdevs that are able to write it faster.
But in the process it completely broken the original mechanism,
designed to balance vdev space usage.  With severe vdev space use
imbalance it is possible that some with higher use start growing
fragmentation sooner than others and after getting full will stop
any writes at all.  Also after vdev addition it might take a very
long time for pool to restore the balance, since the new vdev does
not have any real preference, unless the old one is already much
slower due to fragmentation.  Also the old throttling was request-
based, which was unpredictable with block sizes varying from 512B
to 16MB, neither it made much sense in case of I/O aggregation,
when its 32-100 requests could be aggregated into few, leaving
device underutilized, submitting fewer and/or shorter requests,
or in opposite try to queue up to 1.6GB of writes per device.

This change presents a completely new throttling algorithm. Unlike
the request-based old one, this one measures allocation queue in
bytes.  It makes possible to integrate with the reworked allocation
quota (aliquot) mechanism, which is also byte-based.  Unlike the
original code, balancing the vdevs amounts of free space, this one
balances their free/used space fractions.  It should result in a
lower and more uniform fragmentation in a long run.

This algorithm still allows to improve write speed by allocating
more data to faster vdevs, but does it in more controllable way.
On top of space-based allocation quota, it also calculates minimum
queue depth that vdev is allowed to maintain, and respectively the
amount of extra allocations it can receive if it appear faster.
That amount is based on vdev's capacity and space usage, but also
applied only when the pool is busy.  This way the code can choose
between faster writes when needed and better vdev balance when not,
with the choice gradually reducing together with the free space.

This change also makes allocation queues per-class, allowing them
to throttle independently and in parallel.  Allocations that are
bounced between classes due to allocation errors will be able to
properly throttle in the new class.  Allocations that should not
be throttled (ZIL, gang, copies) are not, but may still follow
the rotor and allocation quota mechanism of the class without
disrupting it.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2025-03-24 09:25:01 -07:00
Rob Norris
45e9b54e9e freebsd/kstat: allow multi-level module names
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-20 16:25:32 -07:00
Rob Norris
d28d2e3007 linux/kstat: allow multi-level module names
Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-20 16:24:50 -07:00
Paul Dagnelie
9250403ba6
Make ganging redundancy respect redundant_metadata property (#17073)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-19 15:58:29 -07:00
Alexander Motin
676b7ef104
Fix deadlock on I/O errors during device removal
spa_vdev_remove_thread() should not hold svr_lock while loading a
metaslab.  It may block ZIO threads, required to handle metaslab
loading, at least in case of read errors causing recovery writes.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17145
2025-03-19 14:48:47 -04:00
aokblast
83fa051ceb
spl_vfs: fix vrele task runner signature mismatch
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17101
2025-03-19 11:26:45 -04:00
Alan Somers
d033f26765
Always perform bounds-checking in metaslab_free_concrete
The vd->vdev_ms access can overflow due to on-disk corruption, not just
due to programming bugs.  So it makes sense to check its boundaries even
in production builds.

Sponsored by:	ConnectWise
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17136
2025-03-19 11:24:43 -04:00
Alexander Motin
3cd9934a48
Some arc_release() cleanup
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted.  Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid.  It might be rare,
but why not?
 - Remove some impossible assertions and conditions.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17126
2025-03-18 21:25:50 -04:00
Rob Norris
f69631992d
dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143)
This helps to avoids confusion with the similarly-named
txg_wait_synced().

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-18 16:04:22 -07:00
Rob Norris
a8847a7e4f SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain
SPDX have repeatedly rejected the creation of a tag for a public domain
dedication, as not all dedications are clear and unambiguious in their
meaning and not all jurisdictions permit relinquishing a copyright
anyway.

A reasonably common workaround appears to be to create a local
(project-specific) identifier to convey whatever meaning the project
wishes it to. To cover OpenZFS' use of third-party code with a public
domain dedication, we use this custom tag.

Further reading:
- https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md
- https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/
- https://cr.yp.to/spdx.html

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:31 -07:00
Rob Norris
bf754d6010 SPDX: license tags: OpenSSL-standalone
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:27 -07:00
Rob Norris
44dc09614e SPDX: license tags: Brian-Gladman-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:24 -07:00
Rob Norris
9fd7401d75 SPDX: license tags: BSD-2-Clause OR GPL-2.0-only
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:21 -07:00
Rob Norris
ff28d9fc6f SPDX: license tags: BSD-3-Clause OR GPL-2.0-only
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:17 -07:00
Rob Norris
f83431b3bd SPDX: license tags: GPL-2.0-or-later
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:09 -07:00
Rob Norris
c404ffb446 SPDX: license tags: Apache-2.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:05 -07:00
Rob Norris
7d8dd8d9a5 SPDX: license tags: MIT
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:54 -07:00
Rob Norris
4eafa9e5e8 SPDX: license tags: BSD-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:50 -07:00
Rob Norris
137045be98 SPDX: license tags: BSD-2-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:46 -07:00
Rob Norris
eb9098ed47 SPDX: license tags: CDDL-1.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:27 -07:00
shodanshok
201d262949
Add receive:append permission for limited receive
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.

This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.

Fixes #16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17015
2025-03-13 13:54:14 -04:00
Paul Dagnelie
1b495eeab3
FDT dedup log sync -- remove incremental
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Don Brady <don.brady@klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17038
2025-03-13 13:47:03 -04:00
Alexander Motin
0ea44e576b
Fix deduplication of overridden blocks
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry.  But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.

This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested.  And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17120
2025-03-13 13:27:57 -04:00
Alexander Motin
09f4dd06c3
Prefer embedded blocks to dedup
Since embedded blocks introduction 11 years ago, their writing was
blocked if dedup is enabled.  After searching through the modern
code I see no reason for this restriction to exist.  Same time
embedded blocks are dramatically cheaper.  Even regular write of
so small blocks would likely be cheaper than deduplication, even
if the last is successful, not mentioning otherwise.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17113
2025-03-13 13:27:15 -04:00
Rob Norris
13ec35ce3b
Linux/vnops: implement STATX_DIOALIGN
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16972
2025-03-13 13:15:14 -04:00
Alan Somers
0433523ca2
Verify every block pointer is either embedded, hole, or has a valid DVA
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:

```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```

This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.

And in zdb, don't crash when printing a block pointer with no valid
DVAs.  If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug.  zdb should be able to handle the
situation gracefully.

Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR.  This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why.  It doesn't appear to do anything useful, so remove it.

[^1]: b24ab67627

Fixes		#17077
Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17078
2025-03-13 13:07:48 -04:00
Alexander Motin
fe674998bb
Check portable objset MAC even if local is zeroed
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors
if local MAC can not be calculated at the time.  But it does not
mean we should also ignore portable MAC errors there.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17122
2025-03-08 21:15:11 -05:00
Rob Norris
7f05fface3 gcm_avx_init: zero the ghash state after hashing the IV
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
IVs != 96 bits get hashed with GHASH to bring them to 96 bits. Any call
to GHASH will mix the ghash state in gcm_ghash. This is expected to be
zero at first use in an encrypt or decrypt operation, so it needs to be
zeroed after using GHASH in setup.

gcm_init() does this, but gcm_avx_init() zeroed it before setup, not
after, resulting in incorrect encrypt/decrypt results when using AVX GCM
with an IV != 96 bits.

OpenZFS _always_ uses a 96 bit IV (ZIO_DATA_IV_LEN) so this will never
have been hit in any real-world use, which is extremely fortunate, as we
would have incorrectly-encrypted data on-disk. Still, as long as we have
this code here we should make sure it's correct.

Thanks-to: Joel Low <joel@joelsplace.sg>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
2025-02-25 17:31:08 -08:00
Ameer Hamza
ab3db6d15d
arc: avoid possible deadlock in arc_read
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17071
2025-02-25 14:32:12 -05:00
Paul Dagnelie
701093c44f
Don't try to get mg of hole vdev in removal
Don't try to get mg of hole vdev in removal

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17080
2025-02-25 14:30:51 -05:00
aokblast
a5fb5c55be
spa: fix signature mismatch for spa_boot_init as eventhandler required
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17088
2025-02-25 14:28:57 -05:00
Alexander Motin
d7d2744711
Better fill empty metaslabs
Before this change zfs_metaslab_switch_threshold tunable switched
metaslabs each time ones index reduced by two (which means biggest
contiguous chunk reduced to 1/4).  It is a good idea to balance
metaslabs fragmentation.  But for empty metaslabs (having power-
of-2 sizes) this means switching when they get just below the half
of their capacity.  Inspection with zdb after filling new pool to
half capacity shown most of its metaslabs filled to half capacity.
I consider this sub-optimal for pool fragmentation in a long run.

This change blocks the metaslabs switching if most of the metaslab
free space (15/16) is represented by a single contiguous range.
Such metaslab should not be considered fragmented until it actually
fail some big allocation.  More contiguous filling should improve
data locality and increase time before previously filled and
partially freed metaslab is touched again, giving it more time to
free more contiguous chunks for lower fragmentation.  It should
also slightly reduce spacemap traffic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17081
2025-02-25 14:26:34 -05:00
Rob Norris
ee8803adc2
vdev_file: make FLUSH and TRIM asynchronous
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the
zio_taskq thread is active and blocked both while waiting for the IO
call and then while calling zio_execute() for the next stage. This is a
particular issue for FLUSH, as the z_flush_iss queue typically only has
one thread; multiple flushes arriving at once can cause long delays if
the underlying fsync() response is particularly slow.

To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq,
just as we do for reads and writes. Further, we return all results
through zio_interrupt(), so neither the issue nor the file taskqs are
blocked.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17064
2025-02-22 14:16:54 -05:00
Chunwei Chen
682c5f6a0a
Fix wrong free function in arc_hdr_decrypt
Need to use arc_free_data_abd to free abd type buffer.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #17079
2025-02-22 13:50:33 -05:00
Rob Norris
c43df8bbbf
vdev_file: unify FreeBSD and Linux implementations (#17046)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Kernel & userspace specifics are in zfs_file_os.c, so there's no
particular reason these have to be separate.

The one platform-specific part is in the Linux kernel part, to offload
flushes to a taskq if we're already inside a filesystem transaction.
This would be normally be an unsatisfying wart, but I'm intending to
remove this shortly, so I'm content to leave it gated for the moment.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2025-02-20 10:42:42 -08:00