Commit Graph

5019 Commits

Author SHA1 Message Date
Alexander Motin
49fbdd4533
Introduce zfs rewrite subcommand (#17246)
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-05-12 10:22:17 -07:00
Mariusz Zaborski
8b9c4e643b
spa: clear checkpoint information during retry
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mariusz Zaborski <oshogbo@FreeBSD.org>
Closes #17319
2025-05-11 12:49:38 -04:00
Rob Norris
2ee5b51a57
linux/uio: remove "skip" offset for UIO_ITER
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17298
2025-05-11 12:46:40 -04:00
Alan Somers
c17bdc4914
More aggressively assert that db_mtx protects db.db_data
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
db.db_mtx must be held any time that db.db_data is accessed.  All of
these functions do have the lock held by a parent; add assertions to
ensure that it stays that way.

See https://github.com/openzfs/zfs/discussions/17118

* Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't
required.

* Refactor dbuf_hold_copy to eliminate the db_rwlock
Copy data into the newly allocated buffer before assigning it to the db.
That way, there will be no need to take db->db_rwlock.

* Refactor dbuf_read_hole
In the case of an indirect hole, initialize the newly allocated buffer
before assigning it to the dmu_buf_impl_t.

Sponsored by:	ConnectWise
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #17209
2025-05-09 09:02:26 -04:00
Fedor Uporov
1a8f5ad3b0
zvol: Enable zvol threading functionality on FreeBSD
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Make zvol I/O requests processing asynchronous on FreeBSD side in some
cases. Clone zvol threading logic and required module parameters from
Linux side. Make zvol threadpool creation/destruction logic shared for
both Linux and FreeBSD.
The IO requests are processed asynchronously in next cases:
- volmode=geom: if IO request thread is geom thread or cannot sleep.
- volmode=cdev: if IO request passed thru struct cdevsw .d_strategy
routine, mean is AIO request.
In all other cases the IO requests are processed synchronously. The
volthreading zvol property is ignored on FreeBSD side.

Sponsored-by: vStack, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: @ImAwsumm
Signed-off-by: Fedor Uporov <fuporov.vstack@gmail.com>
Closes #17169
2025-05-08 15:25:40 -04:00
Alan Somers
f13d760aa8
Delete dead code: dbuf_loan_arcbuf
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It's been dead ever since 5fa356ea44

Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17119
2025-05-08 10:34:11 -04:00
Rob Norris
4800181b3b
ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat
These are only required to support these ioctls on Linux <4.5. Since
4.18 is our cutoff, we don't need this code anymore.

Also removing related test things that will never match again.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17308
2025-05-08 10:32:52 -04:00
Olivier Certner
78628a5c15
FreeBSD: Use new SYSCTL_SIZEOF()
SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9):
Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in
branch 'main'.  It will soon be backported to 'stable/14'.  We will thus
be able to remove the old, alternate version left in the '#else' branch
as soon as 'stable/13' goes out of support (April 30, 2026).

Sponsored-by:   The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olivier Certner <olce@FreeBSD.org>
Closes #17309
2025-05-08 10:31:43 -04:00
Alexander Motin
b1ccab1721
ARC: Avoid overflows in arc_evict_adj() (#17255)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
With certain combinations of target ARC states balance and ghost
hit rates it was possible to get the fractions outside of allowed
range.  This patch limits maximum balance adjustment speed, which
should make it impossible, and also asserts it.

Fixes #17210
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-05-06 09:31:38 -07:00
Rob Norris
c5a6b4417d
zfs_valstr: update zio_flag strings for ZIO_FLAG_PREALLOCATED
Missed in 246e5883bb.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #17306
2025-05-06 11:35:30 -04:00
Paul Dagnelie
246e5883bb
Implement allocation size ranges and use for gang leaves (#17111)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.

We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-05-02 15:32:18 -07:00
Rob Norris
a7de203c86
txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.

This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.

Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.

The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2025-05-02 15:29:50 -07:00
Alexander Motin
f86d9af16b Fix race between resilver wait and offline/detach
We should not clear scn_state and notify waiters until we call
vdev_dtl_reassess(), otherwise following offline/detach request
may fail with "no valid replicas".

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
2025-05-02 15:19:24 -07:00
Tony Hutter
f40ab9e399
Fix double spares for failed vdev
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk.  This commit checks for that condition and disallows it.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #16547
Closes: #17231
2025-05-02 09:03:11 -07:00
Rob Norris
c8fa39b46c
cred: properly pass and test creds on other threads (#17273)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
### Background

Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.

Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.

However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
instead, we capture the permissions by saving CRED() in the user task,
and then using it for the check through the secpolicy_* functions.

### Current implementation

The current code calls CRED() to get the credential, which gets a
pointer to the cred_t inside the current task and passes it to the
worker task. However, it doesn't take a reference to the cred_t, and so
expects that it won't change, and that the task continues to exist. In
practice that is always the case, because we don't let the calling task
return from the kernel until the work is done.

For Linux, we also take a reference to the current task, because the
Linux credential APIs for the most part do not check an arbitrary
credential, but rather, query what a task can do. See
secpolicy_zfs_proc(). Again, we don't take a reference on the task, just
a pointer to it.

### Changes

We change to calling crhold() on the task credential, and crfree() when
we're done with it. This ensures it stays alive and unchanged for the
duration of the call.

On the Linux side, we change the main policy checking function
priv_policy_ns() to use override_creds()/revert_creds() if necessary to
make the provided credential active in the current task, allowing the
standard task-permission APIs to do the needed check. Since the task
pointer is no longer required, this lets us entirely remove
secpolicy_zfs_proc() and the need to carry a task pointer around as
well.

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-29 16:27:48 -07:00
Brian Atkinson
7031a48c70
Export correct symbols for Lustre Direct I/O
Originally the Lustre ZFS OSD code was going to use zfs_uio_t structs
for supporting Direct I/O with ZFS. However, this has changed to using
abd_t structs instead. This exports the proper symbols that will be used
by the Lustre ZFS OSD code.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #17256
2025-04-24 13:55:21 -04:00
Paul Dagnelie
5f5321effa
Handle interaction between gang blocks, copies, and FDT.
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.

This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.

This change also fixes a refcount leak that could occur if the lead DDT
write failed.

Sponsored by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes: #17123
2025-04-21 11:26:30 -04:00
Tony Hutter
8d1489735b
nvlist: Add nvlist_snprintf() and zfs_dbgmsg_nvlist()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Add nvlist_snprintf() to print a nvlist to a buffer.  This is basically
the snprintf() version of dump_nvlist().  Along with that, add a
zfs_dbgmsg_nvlist() to print out an nvlist to dbgmsg.  This will aid in
debugging.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #17215
2025-04-18 09:22:16 -04:00
Tony Hutter
155847c72d
GCC 15: Fix unterminated-string-initialization (#17244)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Fix build errors on Fedora 42 like:

  module/zcommon/zfs_valstr.c:193:16: error: initializer-string for
  array of 'char' truncates NUL terminator but destination lacks
  'nonstring' attribute (3 chars into 2 available)

The arrays in zpool_vdev_os.c and zfs_valstr.c don't need to be
NULL terminated, but we do so to make GCC happy.

Closes: #17242

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-04-16 09:33:29 -07:00
Tony Hutter
ab9bb193f9
Linux 6.0 compat: Check for migratepage VFS (#17217)
The 6.0 kernel removes the 'migratepage' VFS op. Check for
migratepage.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org
2025-04-10 17:36:31 -07:00
Alexander Motin
a497c5fc8b
Improve L2 caching control for prefetched indirects
dbuf_prefetch_impl() should look on level of current indirect, not
the target prefetch level.  dbuf_prefetch_indirect_done() should
call dnode_level_is_l2cacheable() if we have dpa_dnode to pass it.
It should fix some both false positive and negative L2ARC caching.

While there, fix redacted feature activation assertions.  One was
always true, while another could give false positive if dpa_dnode
is NULL.

George Amanakis <gamanakis@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17204
2025-04-08 19:43:32 -04:00
Martin Matuška
88e3885cf4
freebsd: unbreak module/Makefile.bsd build on 15-CURRENT-arm64
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
- don't include foreign machine assembly files
- reduce diff to FreeBSD module Makefile

Discovered in FreeBSD port filesystems/openzfs-kmod

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #17219
2025-04-05 19:43:41 -04:00
Paul Dagnelie
b14b3e3985
Fix FDT rollback to not overwrite unnecessary fields (#17205)
When a dedup write fails, we try to roll the DDT entry back to a known
good state. However, this also rolls the refcounts and the last-update
time back to the state they were at when we started this write. This
doesn't appear to be able to cause any refcount leaks (after the fix in
17123). This PR prevents that from happening by only rolling back the
parts of the DDT entry that have been updated by the write so far.

Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-04 11:10:44 -07:00
Ameer Hamza
6f6c504700 Show default quotas in zfs userspace tools
Update zfs userspace, groupspace, and projectspace to display the
default quotas when no per-ID specific quota is configured. This
ensures tool outputs align with enforced limits.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:45 -07:00
Ameer Hamza
9cb9a59e1c Report default quotas via kernel interfaces
Ensure default user/group/project quotas are visible through quota
tools and filesystem stats when no per-ID quota is configured. This
maintains consistency between quota visibility and configured defaults.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:38 -07:00
Ameer Hamza
20705a8430 Enforce default quotas when no per-ID quota is set
Update zfs_id_overobjquota() and zfs_id_overblockquota() to enforce
default user/group/project quotas (block and object-based) when no
per-user, per-group, or per-project quota exists. If a specific quota
is not configured for an ID, the default quota value is applied.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:36:25 -07:00
Ameer Hamza
2a8d9d9607 Add default user/group/project quota properties
This adds default userquota, groupquota, and projectquota properties to
MASTER_NODE_OBJ to make them accessible during zfsvfs_init() (regular
DSL properties require dsl_config_lock, which cannot be safely acquired
in this context). The zfs_fill_zplprops_impl() logic is updated to read
these default properties directly from MASTER_NODE_OBJ.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-04-03 10:35:22 -07:00
Paul Dagnelie
7be9fa259e
Fix nonrot property being incorrectly unset (#17206)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
When opening a vdev and setting the nonrot property, we used to wait for
each child to be opened before examining its nonrot property. When the
change was made to open vdevs asynchronously, we didn't move the nonrot
check out of the main loop. As a result, the nonrot property is almost
always set to false, regardless of the actual type of the underlying
disks. The fix is simply to move the nonrot check to a separate loop
after the taskq has been waited for.

Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
2025-04-02 12:11:33 -07:00
Alexander Motin
301da593ad
Fix lock reversal on device removal cancel
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
FreeBSD kernel's WITNESS code detected lock ordering violation in
spa_vdev_remove_cancel_sync().  It took svr_lock while holding
ms_lock, which is opposite to other places.  I was thinking to
resolve it similar to #17145, but looking closer I don't think
we even need svr_lock at that point, since we already asserted
svr_allocd_segs is empty, and we don't need to add there segments
we are going to call free_mapped_segment_cb for.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17164
2025-04-01 09:31:24 -04:00
Paul Dagnelie
367d34b3aa
Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same.  Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17183
2025-04-01 09:23:43 -04:00
Rob Norris
75e921da6f
kstat: silence "maybe uninitialized" warnings
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Firmly in the "shouldn't happen" camp, but at least GCC 7.4 (Ubuntu
18.04) complained about them, and it's easy to shut up, so do so.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17189
2025-03-28 22:25:47 -04:00
Alexander Motin
5b29e70ae1
Remove mg_allocators (#17192)
Previous code allowed each metaslab group to have different number
of allocators.  But in practice it worked only for embedded SLOGs,
relying on a number of conditions and creating a significant mine
field if any of those change.  I just stepped on one myself.

This change makes all groups to have spa_alloc_count allocators.
It may cost us extra 192 bytes of memory per normal top-level vdev
on large systems, but I find it a small price for cleaner and more
reliable code.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Fixes #17188
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2025-03-28 13:11:10 -07:00
Ameer Hamza
30cc2331f4
zed: Ensure spare activation after kernel-initiated device removal
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17187
2025-03-28 15:48:38 -04:00
Alexander Motin
4abc21b28c
Block remap for cloned blocks on device removal
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
When after device removal we handle block pointers remap, skip blocks
that might be cloned.  BRTs are indexed by vdev id and offset from
block pointer's DVA[0].  So if we start addressing the same block by
some different DVA, we won't get the proper reference counter.  As
result, we might either remap the block twice, that may result in
assertion during indirect mapping condense, or free it prematurely,
that may result in data overwrite, or free it twice, that may result
in assertion in spacemap code.

Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15604
Closes #17180
2025-03-26 16:45:34 -07:00
Pavel Snajdr
a0e62718cf
Linux: Fix zfs_prune panics v2 (#17121)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
It turns out that approach taken in the original version of the patch
was wrong. So now, we're taking approach in-line with how kernel
actually does it - when sb is being torn down, access to it
is serialized via sb->s_umount rwsem, only when that lock is taken
is it okay to work with s_flags - and the other mistake I was doing
was trying to make SB_ACTIVE work, but apparently the kernel checks
the negative variant - not SB_DYING and not SB_BORN.

Kernels pre-6.6 don't have SB_DYING, but check if sb is hashed
instead.

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-25 15:20:16 -07:00
Alexander Motin
94a3fabcb0
Unified allocation throttling (#17020)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Existing allocation throttling had a goal to improve write speed
by allocating more data to vdevs that are able to write it faster.
But in the process it completely broken the original mechanism,
designed to balance vdev space usage.  With severe vdev space use
imbalance it is possible that some with higher use start growing
fragmentation sooner than others and after getting full will stop
any writes at all.  Also after vdev addition it might take a very
long time for pool to restore the balance, since the new vdev does
not have any real preference, unless the old one is already much
slower due to fragmentation.  Also the old throttling was request-
based, which was unpredictable with block sizes varying from 512B
to 16MB, neither it made much sense in case of I/O aggregation,
when its 32-100 requests could be aggregated into few, leaving
device underutilized, submitting fewer and/or shorter requests,
or in opposite try to queue up to 1.6GB of writes per device.

This change presents a completely new throttling algorithm. Unlike
the request-based old one, this one measures allocation queue in
bytes.  It makes possible to integrate with the reworked allocation
quota (aliquot) mechanism, which is also byte-based.  Unlike the
original code, balancing the vdevs amounts of free space, this one
balances their free/used space fractions.  It should result in a
lower and more uniform fragmentation in a long run.

This algorithm still allows to improve write speed by allocating
more data to faster vdevs, but does it in more controllable way.
On top of space-based allocation quota, it also calculates minimum
queue depth that vdev is allowed to maintain, and respectively the
amount of extra allocations it can receive if it appear faster.
That amount is based on vdev's capacity and space usage, but also
applied only when the pool is busy.  This way the code can choose
between faster writes when needed and better vdev balance when not,
with the choice gradually reducing together with the free space.

This change also makes allocation queues per-class, allowing them
to throttle independently and in parallel.  Allocations that are
bounced between classes due to allocation errors will be able to
properly throttle in the new class.  Allocations that should not
be throttled (ZIL, gang, copies) are not, but may still follow
the rotor and allocation quota mechanism of the class without
disrupting it.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
2025-03-24 09:25:01 -07:00
Rob Norris
45e9b54e9e freebsd/kstat: allow multi-level module names
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
This extends the existing special-case for zfs/poolname to split and
create any number of intermediate sysctl names, so that multi-level
module names are possible.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-20 16:25:32 -07:00
Rob Norris
d28d2e3007 linux/kstat: allow multi-level module names
Module names are mapped directly to directory names in procfs, but
nothing is done to create the intermediate directories, or remove them.
This makes it impossible to sensibly present kstats about sub-objects.

This commit loops through '/'-separated names in the full module name,
creates a separate module for each, and hooks them up with a parent
pointer and child counter, and then unrolls this on the other side when
deleting a module.

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-20 16:24:50 -07:00
Paul Dagnelie
9250403ba6
Make ganging redundancy respect redundant_metadata property (#17073)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-19 15:58:29 -07:00
Alexander Motin
676b7ef104
Fix deadlock on I/O errors during device removal
spa_vdev_remove_thread() should not hold svr_lock while loading a
metaslab.  It may block ZIO threads, required to handle metaslab
loading, at least in case of read errors causing recovery writes.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17145
2025-03-19 14:48:47 -04:00
aokblast
83fa051ceb
spl_vfs: fix vrele task runner signature mismatch
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17101
2025-03-19 11:26:45 -04:00
Alan Somers
d033f26765
Always perform bounds-checking in metaslab_free_concrete
The vd->vdev_ms access can overflow due to on-disk corruption, not just
due to programming bugs.  So it makes sense to check its boundaries even
in production builds.

Sponsored by:	ConnectWise
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17136
2025-03-19 11:24:43 -04:00
Alexander Motin
3cd9934a48
Some arc_release() cleanup
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted.  Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid.  It might be rare,
but why not?
 - Remove some impossible assertions and conditions.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17126
2025-03-18 21:25:50 -04:00
Rob Norris
f69631992d
dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143)
This helps to avoids confusion with the similarly-named
txg_wait_synced().

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-03-18 16:04:22 -07:00
Rob Norris
a8847a7e4f SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain
SPDX have repeatedly rejected the creation of a tag for a public domain
dedication, as not all dedications are clear and unambiguious in their
meaning and not all jurisdictions permit relinquishing a copyright
anyway.

A reasonably common workaround appears to be to create a local
(project-specific) identifier to convey whatever meaning the project
wishes it to. To cover OpenZFS' use of third-party code with a public
domain dedication, we use this custom tag.

Further reading:
- https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md
- https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/
- https://cr.yp.to/spdx.html

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:31 -07:00
Rob Norris
bf754d6010 SPDX: license tags: OpenSSL-standalone
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:27 -07:00
Rob Norris
44dc09614e SPDX: license tags: Brian-Gladman-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:24 -07:00
Rob Norris
9fd7401d75 SPDX: license tags: BSD-2-Clause OR GPL-2.0-only
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:21 -07:00
Rob Norris
ff28d9fc6f SPDX: license tags: BSD-3-Clause OR GPL-2.0-only
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:17 -07:00
Rob Norris
f83431b3bd SPDX: license tags: GPL-2.0-or-later
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:09 -07:00
Rob Norris
c404ffb446 SPDX: license tags: Apache-2.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:57:05 -07:00
Rob Norris
7d8dd8d9a5 SPDX: license tags: MIT
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:54 -07:00
Rob Norris
4eafa9e5e8 SPDX: license tags: BSD-3-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:50 -07:00
Rob Norris
137045be98 SPDX: license tags: BSD-2-Clause
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:46 -07:00
Rob Norris
eb9098ed47 SPDX: license tags: CDDL-1.0
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
2025-03-13 17:56:27 -07:00
shodanshok
201d262949
Add receive:append permission for limited receive
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Force receive (zfs receive -F) can rollback or destroy snapshots and
file systems that do not exist on the sending side (see zfs-receive man
page). This means an user having the receive permission can effectively
delete data on receiving side, even if such user does not have explicit
rollback or destroy permissions.

This patch adds the receive:append permission, which only permits
limited, non-forced receive. Behavior for users with full receive
permission is not changed in any way.

Fixes #16943
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #17015
2025-03-13 13:54:14 -04:00
Paul Dagnelie
1b495eeab3
FDT dedup log sync -- remove incremental
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Don Brady <don.brady@klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17038
2025-03-13 13:47:03 -04:00
Alexander Motin
0ea44e576b
Fix deduplication of overridden blocks
Implementation of DDT pruning introduced verification of DVAs in
a block pointer during ddt_lookup() to not by mistake free previous
pruned incarnation of the entry.  But when writing a new block in
zio_ddt_write() we might have the DVAs only from override pointer,
which may never have "D" flag to be confused with pruned DDT entry,
and we'll abandon those DVAs if we find a matching entry in DDT.

This fixes deduplication for blocks written via dmu_sync() for
purposes of indirect ZIL write records, that I have tested.  And
I suspect it might actually allow deduplication for Direct I/O,
even though in an odd way -- first write block directly and then
delete it later during TXG commit if found duplicate, which part
I haven't tested.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17120
2025-03-13 13:27:57 -04:00
Alexander Motin
09f4dd06c3
Prefer embedded blocks to dedup
Since embedded blocks introduction 11 years ago, their writing was
blocked if dedup is enabled.  After searching through the modern
code I see no reason for this restriction to exist.  Same time
embedded blocks are dramatically cheaper.  Even regular write of
so small blocks would likely be cheaper than deduplication, even
if the last is successful, not mentioning otherwise.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17113
2025-03-13 13:27:15 -04:00
Rob Norris
13ec35ce3b
Linux/vnops: implement STATX_DIOALIGN
This statx(2) mask returns the alignment restrictions for O_DIRECT
access on the given file.

We're expected to return both memory and IO alignment. For memory, it's
always PAGE_SIZE. For IO, we return the current block size for the file,
which is the required alignment for an arbitrary block, and for the
first block we'll fall back to the ARC when necessary, so it should
always work.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16972
2025-03-13 13:15:14 -04:00
Alan Somers
0433523ca2
Verify every block pointer is either embedded, hole, or has a valid DVA
Now instead of crashing when attempting to read the corrupt block
pointer, ZFS will return ECKSUM, in a stack that looks like this:

```
none:set-error
zfs.ko`arc_read+0x1d82
zfs.ko`dbuf_read+0xa8c
zfs.ko`dmu_buf_hold_array_by_dnode+0x292
zfs.ko`dmu_read_uio_dnode+0x47
zfs.ko`zfs_read+0x2d5
zfs.ko`zfs_freebsd_read+0x7b
kernel`VOP_READ_APV+0xd0
kernel`vn_read+0x20e
kernel`vn_io_fault_doio+0x45
kernel`vn_io_fault1+0x15e
kernel`vn_io_fault+0x150
kernel`dofileread+0x80
kernel`sys_read+0xb7
kernel`amd64_syscall+0x424
kernel`0xffffffff810633cb
```

This patch should hopefully also prevent such corrupt block pointers
from being written to disk in the first place.

And in zdb, don't crash when printing a block pointer with no valid
DVAs.  If a block pointer isn't embedded yet doesn't have any valid
DVAs, that's a data corruption bug.  zdb should be able to handle the
situation gracefully.

Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR.  This
check, which compares the asizes of two different DVAs within the same
BP, was added by illumos-gate commit b24ab67[^1], and I can't understand
why.  It doesn't appear to do anything useful, so remove it.

[^1]: b24ab67627

Fixes		#17077
Sponsored by:	ConnectWise
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Closes #17078
2025-03-13 13:07:48 -04:00
Alexander Motin
fe674998bb
Check portable objset MAC even if local is zeroed
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors
if local MAC can not be calculated at the time.  But it does not
mean we should also ignore portable MAC errors there.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17122
2025-03-08 21:15:11 -05:00
Rob Norris
7f05fface3 gcm_avx_init: zero the ghash state after hashing the IV
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
IVs != 96 bits get hashed with GHASH to bring them to 96 bits. Any call
to GHASH will mix the ghash state in gcm_ghash. This is expected to be
zero at first use in an encrypt or decrypt operation, so it needs to be
zeroed after using GHASH in setup.

gcm_init() does this, but gcm_avx_init() zeroed it before setup, not
after, resulting in incorrect encrypt/decrypt results when using AVX GCM
with an IV != 96 bits.

OpenZFS _always_ uses a 96 bit IV (ZIO_DATA_IV_LEN) so this will never
have been hit in any real-world use, which is extremely fortunate, as we
would have incorrectly-encrypted data on-disk. Still, as long as we have
this code here we should make sure it's correct.

Thanks-to: Joel Low <joel@joelsplace.sg>
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Attila Fülöp <attila@fueloep.org>
2025-02-25 17:31:08 -08:00
Ameer Hamza
ab3db6d15d
arc: avoid possible deadlock in arc_read
In l2arc_evict(), the config lock may be acquired in reverse order
(e.g., first the config lock (writer), then a hash lock) unlike in
arc_read() during scenarios like L2ARC device removal. To avoid
deadlocks, if the attempt to acquire the config lock (reader) fails
in arc_read(), release the hash lock, wait for the config lock, and
retry from the beginning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #17071
2025-02-25 14:32:12 -05:00
Paul Dagnelie
701093c44f
Don't try to get mg of hole vdev in removal
Don't try to get mg of hole vdev in removal

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17080
2025-02-25 14:30:51 -05:00
aokblast
a5fb5c55be
spa: fix signature mismatch for spa_boot_init as eventhandler required
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org>
Closes #17088
2025-02-25 14:28:57 -05:00
Alexander Motin
d7d2744711
Better fill empty metaslabs
Before this change zfs_metaslab_switch_threshold tunable switched
metaslabs each time ones index reduced by two (which means biggest
contiguous chunk reduced to 1/4).  It is a good idea to balance
metaslabs fragmentation.  But for empty metaslabs (having power-
of-2 sizes) this means switching when they get just below the half
of their capacity.  Inspection with zdb after filling new pool to
half capacity shown most of its metaslabs filled to half capacity.
I consider this sub-optimal for pool fragmentation in a long run.

This change blocks the metaslabs switching if most of the metaslab
free space (15/16) is represented by a single contiguous range.
Such metaslab should not be considered fragmented until it actually
fail some big allocation.  More contiguous filling should improve
data locality and increase time before previously filled and
partially freed metaslab is touched again, giving it more time to
free more contiguous chunks for lower fragmentation.  It should
also slightly reduce spacemap traffic.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #17081
2025-02-25 14:26:34 -05:00
Rob Norris
ee8803adc2
vdev_file: make FLUSH and TRIM asynchronous
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the
zio_taskq thread is active and blocked both while waiting for the IO
call and then while calling zio_execute() for the next stage. This is a
particular issue for FLUSH, as the z_flush_iss queue typically only has
one thread; multiple flushes arriving at once can cause long delays if
the underlying fsync() response is particularly slow.

To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq,
just as we do for reads and writes. Further, we return all results
through zio_interrupt(), so neither the issue nor the file taskqs are
blocked.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17064
2025-02-22 14:16:54 -05:00
Chunwei Chen
682c5f6a0a
Fix wrong free function in arc_hdr_decrypt
Need to use arc_free_data_abd to free abd type buffer.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #17079
2025-02-22 13:50:33 -05:00
Rob Norris
c43df8bbbf
vdev_file: unify FreeBSD and Linux implementations (#17046)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Kernel & userspace specifics are in zfs_file_os.c, so there's no
particular reason these have to be separate.

The one platform-specific part is in the Linux kernel part, to offload
flushes to a taskq if we're already inside a filesystem transaction.
This would be normally be an unsatisfying wart, but I'm intending to
remove this shortly, so I'm content to leave it gated for the moment.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2025-02-20 10:42:42 -08:00
Alexander Motin
6a2f7b3844
Fix metaslab group fragmentation math (#17037)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Since we are calculating a free space fragmentation, we should
weight metaslabs by the amount of their free space, not a full
size.  Fragmentation of full metaslabs may not matter in presence
empty ones.  The old algorithm did not differentiate metaslabs
having only one free 4KB block from metaslabs having 50% of space
free in 4KB blocks, reporting higher fragmentation.

While there, move metaslab_group_alloc_update() call after setting
mg_fragmentation, otherwise the effect may be delayed by one TXG.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-02-18 10:45:42 -08:00
Rob Norris
68473c4fd8 range_tree: convert remaining range_* defs to zfs_range_*
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-02-14 15:37:56 -08:00
Ivan Volosyuk
d4a5a7e3aa Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_*
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.

Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
2025-02-14 15:37:48 -08:00
vandanarungta
db62886d98
Free memory in an error path in spl-kmem-cache.c
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
skc->skc_name also needs to be freed in an error path.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Vandana Rungta <vrungta@amazon.com>
Closes #17041
2025-02-11 20:37:17 -05:00
Rob Norris
b8c73ab780
zio: do no-op injections just before handing off to vdevs
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
The purpose of no-op is to simulate a failure between a device cache and
its permanent store. We still want it to go through the queue and
respond in the same way to everything else.

So, inject "success" as the very last thing, and then move on to
VDEV_IO_DONE to be dequeued and so any followup work can occur.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17029
2025-02-07 20:42:24 -05:00
Dr. Christian Kohlschütter
d2147de319
Fix "make install" with DESTDIR set (#16995)
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
"DESTDIR=/path/to/target/root/ make install" may fail when installing to
a root that contains an existing lib/modules structure. When run as root
we may even affect the wrong kernel (the build system's one, or, if
running a different version, some other directory in /lib/modules, but
not the desired one installed in DESTDIR).

Add a missing reference to the INSTALL_MOD_PATH root when calling
"depmod" during "make install"

Also add a switch "DONT_DELETE_MODULES_FILES=1" that skips the removal
of files named "modules.*" prior to running depmod.

Signed-off-by: Christian Kohlschütter <christian@kohlschutter.com>
Closes #16994

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-02-07 16:38:58 -08:00
Paul Dagnelie
88020b993c
Add kstats tracking gang allocations
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used.  This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #17003
2025-02-06 15:42:50 -05:00
Paul Dagnelie
40496514b8
Expand fragmentation table to reflect larger possibile allocation sizes
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.

We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com>
Closes #16986
2025-02-06 15:40:01 -05:00
Rob Norris
2ca91ba3cf Linux 6.14: BLK_MQ_F_SHOULD_MERGE was removed
According to the upstream change, all callers set it, and all block
devices either honoured it or ignored it, so removing it entirely allows
a bunch of handling for the "unset" case to be removed, and it becomes
effectively implied.

We follow suit, and keep setting it for older kernels.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-02-05 09:43:45 -08:00
Rob Norris
7ef6b70e96 Linux 6.14: dops->d_revalidate now takes four args
This is a convenience for filesystems that need the inode of their
parent or their own name, as its often complicated to get that
information. We don't need those things, so this is just detecting which
prototype is expected and adjusting our callback to match.

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2025-02-05 09:42:37 -08:00
Rob Norris
390f6c1190
zio: lock parent zios when updating wait counts on reexecute
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
As zios are reexecuted after resume from suspension, their ready and
wait states need to be propagated to wait counts on all their parents.

It's possible for those parents to have active children passing through
READY or DONE, which then end up in zio_notify_parent(), take their
parent's lock, and decrement the wait count. Without also taking a lock
here, it's possible for an increment race to occur, which leads to
either there being no references left (tripping the assert in
zio_notify_parent()), or a parent waiting forever for a nonexistent
child to complete.

To protect against this, we simply take the appropriate zio locks in
zio_reexecute() before updating the wait counts.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #17016
2025-02-04 08:47:50 -05:00
Jaydeep Kshirsagar
21205f6488
Avoid ARC buffer transfrom operations in prefetch
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
This change will prevent prefetch to perform unnecessary ARC buffer
fill when reading from disk.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes #17013
2025-02-01 11:15:24 -05:00
Brian Atkinson
1e32c57893
Update pin_user_pages() calls for Direct I/O
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
Originally #16856 updated Linux Direct I/O requests to use the new
pin_user_pages API. However, it was an oversight that this PR only
handled iov_iter's of type ITER_IOVEC and ITER_UBUF. Other iov_iter
types may try and use the pin_user_pages API if it is available. This
can lead to panics as the iov_iter is not being iterated over correctly
in zfs_uio_pin_user_pages().

Unfortunately, generic iov_iter API's that call pin_user_page_fast() are
protected as GPL only. Rather than update zfs_uio_pin_user_pages() to
account for all iov_iter types, we can simply just call
zfs_uio_get_dio_page_iov_iter() if the iov_iter type is not ITER_IOVEC
or ITER_UBUF. zfs_uio_get_dio_page_iov_iter() calls the
iov_iter_get_pages() calls that can handle any iov_iter type.

In the future it might be worth using the exposed iov_iter iterator
functions that are included in the header iov_iter.h since v6.7. These
functions allow for any iov_iter type to be iterated over and advanced
while applying a step function during iteration. This could possibly be
leveraged in zfs_uio_pin_user_pages().

A new ZFS test case was added to test that a ITER_BVEC is handled
correctly using this new code path. This test case was provided though
issue #16956.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16956 
Closes #17006
2025-01-30 15:53:59 -08:00
Alan Somers
12f0baf348
Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by:	Alan Somers <asomers@gmail.com>
Sponsored by:	ConnectWise
Closes #16980
2025-01-29 09:18:09 -05:00
rmacklem
34205715e1
FreeBSD: Add setting of the VFCF_FILEREV flag
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
The flag VFCF_FILEREV was recently defined in FreeBSD
so that a file system could indicate that it increments
va_filerev by one for each change.

Since ZFS does do this, set the flag if defined for the
kernel being built.  This allows the NFSv4.2 server to
reply with the correct change_attr_type attribute value.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rick Macklem <rmacklem@uoguelph.ca>
Closed #16976
2025-01-22 19:33:43 -05:00
Rob Norris
26e38aec46 zinject: add "probe" device injection type
Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.

This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16947
2025-01-22 16:13:21 -08:00
Rob Norris
dfdc5ea993 zinject: make iotype extendable
I'm about to add a new "type", and I need somewhere to put it!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16947
2025-01-22 16:12:31 -08:00
Rob Norris
fe44c5ae27
Makefile.in: pass ARCH for modules_install as well
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
To do a cross-build using only kbuild rather than a full source tree,
ARCH= needs to be passed for the kbuild Makefile to find the
archspecific Makefile.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16944
2025-01-13 16:51:37 -08:00
Rob Norris
2aa3fbe761
zinject: count matches and injections for each handler
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zloop / zloop (push) Waiting to run
When building tests with zinject, it can be quite difficult to work out
if you're producing the right kind of IO to match the rules you've set
up.

So, here we extend injection records to count the number of times a
handler matched the operation, and how often an error was actually
injected (ie after frequency and other exclusions are applied).

Then, display those counts in the `zinject` output.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16938
2025-01-13 08:33:31 -05:00
Alexander Motin
fae4c664a4
FreeBSD: Use ashift in vdev_check_boot_reserve()
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
We should not hardcode 512-byte read size when checking for loader
in the boot area before RAIDZ expansion.  Disk might be unable to
handle that I/O as is, and the code zio_vdev_io_start() handling
the padding asserts doing it only for top-level vdev.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16942
2025-01-11 04:26:42 -05:00
n0-1
18c67d2418
Support for cross-compiling kernel modules
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zloop / zloop (push) Has been cancelled
In order to correctly cross-compile, one has to pass ARCH and
CROSS_COMPILE make flags to kernel module build calls. Facilitate this
in the same way as for custom CC flag by recognizing KERNEL_-prefixed
configure environment variables of same name.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Closes #16924
2025-01-05 17:27:19 -08:00
Don Brady
939e0237c5
Too many vdev probe errors should suspend pool
Similar to what we saw in #16569, we need to consider that a
replacing vdev should not be considered as fully contributing
to the redundancy of a raidz vdev even though current IO has
enough redundancy.

When a failed vdev_probe() is faulting a disk, it now checks
if that disk is required, and if so it suspends the pool until
the admin can return the missing disks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16864
2025-01-04 10:28:33 -08:00
Rob Norris
c02e1cf055
vdev_open: clear async remove flag after reopen
It's possible for a vdev to be flagged for async remove after the pool
has suspended. If the removed device has been returned when the pool is
resumed, the ASYNC_REMOVE task will still run at the end of txg, and
remove the device from the pool again.

To fix, we clear the async remove flag at reopen, just as we did for the
async fault flag in 5de3ac223.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16921
2025-01-03 14:42:06 -08:00
pstef
478b09577a
zfs_vnops_os.c: fallocate is valid but not supported on FreeBSD
This works around
/usr/lib/go-1.18/pkg/tool/linux_amd64/link:
mapping output file failed: invalid argument

It's happened to me under a Linux jail, but it's also happened to other
people, see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270247#c4

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: pstef <pstef@users.noreply.github.com>
Closes #16918
2025-01-03 09:03:14 -08:00
Andrew Walker
25238baad5
Add missing zfs_exit() when snapdir is disabled (#16912)
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
zfs_vget doesn't zfs_exit when erroring out due to snapdir
being disabled.

Signed-off-by: Andrew Walker <awalker@ixsystems.com>
Reviewed-by: @bmeagherix
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-12-30 17:06:48 -08:00
shodanshok
54126fdb5b
set zfs_arc_shrinker_limit to 0 by default
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
zfs_arc_shrinker_limit was introduced to avoid ARC collapse due to
aggressive kernel reclaim. While useful, the current default (10000) is
too prone to OOM especially when MGLRU-enabled kernels with default
min_ttl_ms are used. Even when no OOM happens, it often causes too much
swap usage.

This patch sets zfs_arc_shrinker_limit=0 to not ignore kernel reclaim
requests. ARC now plays better with both kernel shrinker and pagecache
but, should ARC collapse happen again, MGLRU behavior can be tuned or
even disabled.

Anyway, zfs should not cause OOM when ARC can be released.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #16909
2024-12-29 11:50:19 -08:00
Ameer Hamza
9dd5fe1095
zvol: implement platform-independent part of block cloning
In Linux, block devices currently lack support for `copy_file_range`
API because the kernel does not provide the necessary functionality.
However, there is an ongoing upstream effort to address this
limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/.
We have adopted this upstream kernel patch into the TrueNAS kernel and
made some additional modifications to enable block cloning specifically
for the zvol block device. This patch implements the platform-
independent portions of these changes for inclusion in OpenZFS.
This patch does not introduce any new functionality directly into
OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when
zvols are migrated to non-TrueNAS systems that support Clone Range
replay in the ZIL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16901
2024-12-29 11:41:30 -08:00
Rob Norris
03b7cfdef3 spa_sync_props: remove pool userprops by setting empty-string
People have noted there's no way to remove a pool userprop, only zero
it. Turns vdev userprops had a method, by setting empty-string. So this
makes pool userprops follow the same behaviour.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16887
2024-12-29 11:12:04 -08:00
Rob Norris
c37a2ddaaa
microzap: set hard upper limit of 1M
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
The count of chunks in a microzap block is stored as an uint16_t
(mze_chunkid). Each chunk is 64 bytes, and the first is used to store a
header, so there are 32767 usable chunks, which is just under 2M. 1M is
the largest power-2-rounded block size under 2M, so we must set the
limit there.

If it goes higher, the loop in mzap_addent can overflow and fall into
the PANIC case.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16888
2024-12-26 17:10:09 -05:00
Alexander Motin
1acd246964
Fix readonly check for vdev user properties
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
VDEV_PROP_USERPROP is equal do VDEV_PROP_INVAL and so is not a real
property.  That's why vdev_prop_readonly() does not work right for
it.  In particular it may declare all vdev user properties readonly
on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16890
2024-12-20 17:25:35 -05:00
Rob Norris
ab7cbbe789
zprop: fix value help for ZPOOL_PROP_CAPACITY
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
It's a percentage and documented as such, but we were showing it as
<size>.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16881
2024-12-18 15:25:12 -08:00
Brian Atkinson
882a809983 Use pin_user_pages API for Direct I/O requests
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
As of kernel v5.8, pin_user_pages* interfaced were introduced. These
interfaces use the FOLL_PIN flag. This is preferred interface now for
Direct I/O requests in the kernel. The reasoning for using this new
interface for Direct I/O requests is explained in the kernel
documenetation:
Documentation/core-api/pin_user_pages.rst

If pin_user_pages_unlocked is available, the all Direct I/O requests
will use this new API to stay uptodate with the kernel API requirements.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16856
2024-12-16 10:24:30 -08:00
Brian Atkinson
c6442bd3b6 Removing old code outside of 4.18 kernsls
There were checks still in place to verify we could completely use
iov_iter's on the Linux side. All interfaces are available as of kernel
4.18, so there is no reason to check whether we should use that
interface at this point. This PR completely removes the UIO_USERSPACE
type. It also removes the check for the direct_IO interface checks.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16856
2024-12-16 10:23:45 -08:00
Shengqi Chen
acda137d8c
simd_stat: fix undefined CONFIG_KERNEL_MODE_NEON error on armel
CONFIG_KERNEL_MODE_NEON depends on CONFIG_NEON. Neither is defined
on armel. Add a guard to avoid compilation errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16871
2024-12-16 09:40:41 -08:00
Alexander Motin
ff6266ee9b
Fix use-afer-free regression in RAIDZ expansion
We should not dereference rra after the last zio_nowait() is called.
It seems very unlikely, but ASAN in ztest managed to catch it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16868
2024-12-14 14:02:11 -08:00
Rob Norris
46e06feded flush: only detect lack of flush support in one place
It seems there's no good reason for vdev_disk & vdev_geom to explicitly
detect no support for flush and set vdev_nowritecache.  Instead, just
signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess()
take care of it in one place.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16855
2024-12-13 12:19:54 -08:00
Rob Norris
fbea92432a flush: don't report flush error when disabling flush support
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!

So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16855
2024-12-13 12:19:20 -08:00
Chunwei Chen
6c9b4f18d3
Fix DR_OVERRIDDEN use-after-free race in dbuf_sync_leaf
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db
except for overridden case. However, this exception causes a race where
dbuf_new_size could free the arc_buf after the last dereference of
*datap and causes use-after-free. We fix this by cloning the buf
regardless if it's overridden.

The race:
--
P0                                     P1

                                       dbuf_hold_impl()
                                         // dbuf_hold_copy passed
                                         // because db_data_pending NULL

dbuf_sync_leaf()
  // doesn't clone *datap
  // *datap derefed to db_buf
  dbuf_write(*datap)

                                       dbuf_new_size()
                                         dmu_buf_will_dirty()
                                           dbuf_fix_old_data()
                                             // alloc new buf for P0 dr
                                             // but can't change *datap

                                         arc_alloc_buf()
                                         arc_buf_destroy()
                                           // alloc new buf for db_buf
                                           // and destroy old buf

  dbuf_write() // continue
    abd_get_from_buf(data->b_data,
    arc_buf_size(data))
      // use-after-free
--

Here's an example when it happens:

BUG: kernel NULL pointer dereference, address: 000000000000002e
RIP: 0010:arc_buf_size+0x1c/0x30 [zfs]
Call Trace:
 dbuf_write+0x3ff/0x580 [zfs]
 dbuf_sync_leaf+0x13c/0x530 [zfs]
 dbuf_sync_list+0xbf/0x120 [zfs]
 dnode_sync+0x3ea/0x7a0 [zfs]
 sync_dnodes_task+0x71/0xa0 [zfs]
 taskq_thread+0x2b8/0x4e0 [spl]
 kthread+0x112/0x130
 ret_from_fork+0x1f/0x30

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes #16854
2024-12-12 16:18:45 -08:00
Alexander Motin
19a04e5ad2
BRT: Check bv_mos_entries in brt_entry_lookup()
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
When vdev first sees some block cloning, there is a window when
brt_maybe_exists() might already return true since something was
cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet
created.  In such case we should not try to look into the ZAP
and dereference NULL bv_mos_entries_dnode.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16851
2024-12-12 10:22:41 -08:00
Rob Norris
e0039c7057 Remove unnecessary CSTYLED escapes on top-level macro invocations
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
cstyle can handle these cases now, so we don't need to disable it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 08:53:57 -08:00
Alexander Motin
a44eaf1690
Optimize RAIDZ expansion
- Instead of copying one ashift-sized block per ZIO, copy as much
as we have contiguous data up to 16MB per old vdev.  To avoid data
moves use gang ABDs, so that read ZIOs can directly fill buffers
for write ZIOs.  ABDs have much smaller overhead than ZIOs in both
memory usage and processing time, plus big I/Os do not depend on
I/O aggregation and scheduling to reach decent performance on HDDs.
 - Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms.
 - Use 32bit range tree when possible (practically always now) to
slightly reduce memory usage.
 - Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as
for main ones.
 - Fix rate overflows in `zpool status` reporting.

With these changes expanding RAIDZ1 from 4 to 5 children I am able
to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are
limited by devices instead of CPU.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15680
Closes #16819
2024-12-06 08:50:16 -08:00
Alexander Motin
e8b333e4d3
Fix false assertion in dmu_tx_dirty_buf() on cloning
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
Same as writes block cloning can increase block size and number of
indirection levels.  That means it can dirty block 0 at level 0 or
at new top indirection level without explicitly holding them.

A block cloning test case for large offsets has been added.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16825
2024-12-05 11:48:08 -08:00
Don Brady
44446dccdb
During pool export flush the ARC asynchronously
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16215
2024-12-05 08:58:20 -08:00
Alexander Motin
a01504b35c
Improve speculative prefetcher for block cloning
- Issue prescient prefetches for demand indirect blocks after the
first one.  It should be quite rare for reads/writes, but much more
useful for cloning due to much bigger (up to 1022 blocks) accesses.
It covers the gap during the first couple accesses when we can not
speculate yet, but we know what is needed right now.  It reduces
dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode().
 - Increase maximum prefetch distance for indirect blocks from 64
to 128MB.  It should cover the maximum 1022 blocks of block cloning
access size in case of default 128KB recordsize used.  In case of
bigger recordsize the above prescient prefetch should also help.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16814
2024-12-04 15:19:05 -08:00
Alexander Motin
c33a55b0c2
Allow dsl_deadlist_open() return errors
In some cases like dsl_dataset_hold_obj() it is possible to handle
those errors, so failure to hold dataset should be better than
kernel panic.  Some other places where these errors are still not
handled but asserted should be less dangerous just as unreachable.

We have a user report about pool corruption leading to assertions
on these errors.  Hopefully this will make behavior a bit nicer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16836
2024-12-04 15:15:58 -08:00
Mark Johnston
0e020bf3e1
FreeBSD: Remove an incorrect assertion in zfs_getpages()
The pages in the array may become valid after this initial unbusying,
so the assertion only holds during the first iteration of the outer
loop.

Later in zfs_getpages(), the dmu_read_pages() loop handles already-valid
pages.  Just drop the assertion, it's not terribly useful.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: Klara, Inc.
Closes #16810
Closes #16834
2024-12-04 14:24:50 -05:00
Mariusz Zaborski
4b4e346b9f
Add ability to scrub from last scrubbed txg
Some users might want to scrub only new data because they would like
to know if the new write wasn't corrupted.  This PR adds possibility
scrub only newly written data.

This introduces new `last_scrubbed_txg` property, indicating the
transaction group (TXG) up to which the most recent scrub operation
has checked and repaired the dataset, so users can run scrub only
from the last saved point. We use a scn_max_txg and scn_min_txg
which are already built into scrub, to accomplish that.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Closes #16301
2024-12-04 14:21:45 -05:00
Alexander Motin
654ade8ca2 FreeBSD: Remove some illumos compat from vnode.h
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
Should make no difference, just some dead code cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by:Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16808
2024-12-03 09:32:54 -08:00
Alexander Motin
6e3c109bc0
Fix regression in dmu_buf_will_fill()
Direct I/O implementation added condition to call dbuf_undirty()
only in case of block cloning.  But the condition is not right if
the block is no longer dirty in this TXG, but still in DB_NOFILL
state.  It resulted in block not reverting to DB_UNCACHED and
following NULL de-reference on attempt to access absent db_data.

While there, add assertions for db_data to make debugging easier.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16829
2024-12-02 17:08:40 -08:00
Pavel Snajdr
c8a326aab7
Linux: fix zfs_uio_dio_check_for_zero_page
The intent here is to replace the zero page pointer in the array of
pointers to pages in the struct.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #16812 
Closes #16689
Closes #16642
2024-12-02 16:25:35 -08:00
Pavel Snajdr
d2b0ca953f
Assert if we're logging after final txg was set
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
This allowed to debug #16714, fixed in #16782.  Without assertions
added here it is difficult to figure out what logs cause the problem,
since the assertion happens in sync thread context.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes #16795
2024-11-25 18:37:56 -05:00
Alexander Motin
d0a91b9f88
FreeBSD: Reduce copy_file_range() source lock to shared
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
Linux locks copy_file_range() source as shared.  FreeBSD was doing
it also, but then was changed to exclusive, partially because KPI
of that time was doing so, and partially seems out of caution.
Considering zfs_clone_range() uses range locks on both source and
destination, neither should require exclusive vnode locks. But one
step at a time, just sync it with Linux for now.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16789
Closes #16797
2024-11-23 14:29:03 -08:00
Alexander Motin
b3b0ce64d5
FreeBSD: Lock vnode in zfs_ioctl()
Previously vnode was not locked there, unlike Linux.  It required
locking it in vn_flush_cached_data(), which recursed on the lock
if called from zfs_clone_range(), having the vnode locked.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16789
Closes #16796
2024-11-23 14:26:52 -08:00
Pavel Snajdr
38c0324c0f
Linux: Fix zfs_prune panics
Some checks failed
checkstyle / checkstyle (push) Has been cancelled
CodeQL / Analyze (cpp) (push) Has been cancelled
CodeQL / Analyze (python) (push) Has been cancelled
zfs-qemu / Setup (push) Has been cancelled
zloop / zloop (push) Has been cancelled
zfs-qemu / qemu-x86 (push) Has been cancelled
zfs-qemu / Cleanup (push) Has been cancelled
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #16770
2024-11-21 15:30:43 -08:00
Alexander Motin
ae1d11882d
BRT: Clear bv_entcount_dirty on destroy
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
This fixes assertion in brt_sync_table() on debug builds when last
cloned block on the vdev is freed and bv_meta_dirty is cleared,
while bv_entcount_dirty is not.  Should not matter in production.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16791
2024-11-21 08:20:22 -08:00
Alexander Motin
9a81484e35
ZAP: Reduce leaf array and free chunks fragmentation
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
Previous implementation of zap_leaf_array_free() put chunks on the
free list in reverse order.  Also zap_leaf_transfer_entry() and
zap_entry_remove() were freeing name and value arrays in reverse
order.  Together this created a mess in the free list, making
following allocations much more fragmented than necessary.

This patch re-implements zap_leaf_array_free() to keep existing
chunks order, and implements non-destructive zap_leaf_array_copy()
to be used in zap_leaf_transfer_entry() to allow properly ordered
freeing name and value arrays there and in zap_entry_remove().

With this change test of some writes and deletes shows percent of
non-contiguous chunks in DDT reducing from 61% and 47% to 0% and
17% for arrays and frees respectively.  Sure some explicit sorting
could do even better, especially for ZAPs with variable-size arrays,
but it would also cost much more, while this should be very cheap.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16766
2024-11-20 13:37:52 -08:00
Mark Johnston
d76d79fd27
zio: Avoid sleeping in the I/O path
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
zio_delay_interrupt(), apparently used for fault injection, is executed
in the I/O pipeline.  It can cause the calling thread to go to sleep,
which is not allowed on FreeBSD.  This happens only for small delays,
though, and there's no apparent reason to avoid deferring to a taskqueue
in that case, as it already does otherwise.

Simply go to sleep unconditionally.  This fixes an occasional panic I
see when running the ZTS on FreeBSD.  Also remove an unhelpful comment
referencing the non-existent timeout_generic().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16785
2024-11-20 08:23:08 -08:00
Alexander Motin
457f8b76e7
BRT: More optimizations after per-vdev splitting
- With both pending and current AVL-trees being per-vdev and having
effectively identical comparison functions (pending tree compared
also birth time, but I don't believe it is possible for them to be
different for the same offset within one transaction group), it
makes no sense to move entries from one to another.  Instead inline
dramatically simplified brt_entry_addref() into brt_pending_apply().
It no longer requires bv_lock, since there is nothing concurrent
to it at the time.  And it does not need to search the tree for the
previous entries, since it is the same tree, we already have the
entry and we know it is unique.
 - Put brt_vdev_lookup() and brt_vdev_addref() into different tree
traversals to avoid false positives in the first due to the second
entcount modifications.  It saves dramatic amount of time when a
file cloned first time by not looking for non-existent ZAP entries.
 - Remove avl_is_empty(bv_tree) check from brt_maybe_exists().  I
don't think it is needed, since by the time all added entries are
already accounted in bv_entcount. The extra check must be producing
too many false positives for no reason.  Also we don't need bv_lock
there, since bv_entcount pointer must be table at this point, and
we don't care about false positive races here, while false negative
should be impossible, since all brt_vdev_addref() have already
completed by this point.  This dramatically reduces lock contention
on massive deletes of cloned blocks.  The only remaining one is
between multiple parallel free threads calling brt_entry_decref().
 - Do not update ZAP if net change for a block over the TXG was 0.
In combination with above it makes file move between datasets as
cheap operation as originally intended if it fits into one TXG.
 - Do not allocate vdevs on pool creation or import if it did not
have active block cloning. This allows to save a bit in few cases.
 - While here, add proper error handling in brt_load() on pool
import instead of assertions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16773
2024-11-20 06:20:14 -08:00
Alexander Motin
0ca82c5680
L2ARC: Stop rebuild before setting spa_final_txg
Without doing that there is a race window on export when history
log write by completed rebuild dirties transaction beyond final,
triggering assertion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16714
Closes #16782
2024-11-20 06:11:51 -08:00
Alexander Motin
534688948c
Remove hash_elements_max accounting from DBUF and ARC
Some checks are pending
checkstyle / checkstyle (push) Waiting to run
CodeQL / Analyze (cpp) (push) Waiting to run
CodeQL / Analyze (python) (push) Waiting to run
zfs-qemu / Setup (push) Waiting to run
zfs-qemu / qemu-x86 (push) Blocked by required conditions
zfs-qemu / Cleanup (push) Blocked by required conditions
zloop / zloop (push) Waiting to run
Those values require global atomics to get current hash_elements
values in few of the hottest code paths, while in all the years I
never cared about it.  If somebody wants, it should be easy to
get it by periodic sampling, since neither ARC header nor DBUF
counts change so fast that it would be difficult to catch.

For now I've left hash_elements_max kstat for ARC, since it was
used/reported by arc_summary and it would break older versions,
but now it just reports the current value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16759
2024-11-19 07:00:16 -08:00
Rob Norris
ffe2112795
Move "no name changes" from compression to checksum table
Compression names actually aren't used in dedup table names, but
checksum names are.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16776
2024-11-19 06:55:27 -08:00
Alexander Motin
fd6e8c1d2a BRT: Rework structures and locks to be per-vdev
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:04:11 -08:00
Alexander Motin
309ce6303f ZAP: Add by_dnode variants to lookup/prefetch_uint64
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:04:02 -08:00
Alexander Motin
1ee251bdde BRT: Don't call brt_pending_remove() on holes/embedded
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:03:57 -08:00
Ameer Hamza
3462f3bd50
zvol_os.c: Increase optimal IO size
Since zvol read and write can process up to (DMU_MAX_ACCESS / 2) bytes
in a single operation, the current optimal I/O size is too low. SCST
directly reports this value as the optimal transfer length for the
target SCSI device. Increasing it from the previous volblocksize results
in performance improvement for large block parallel I/O workloads.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16750
2024-11-14 14:14:33 -08:00
Mark Johnston
8dc452d907
Fix some nits in zfs_getpages()
- If we don't want dmu_read_pages() to perform extra readahead/behind,
  pass a pointer to 0 instead of a null pointer, as dum_read_pages()
  expects rahead and rbehind to be non-null.
- Avoid unneeded iterations in a loop.

Sponsored-by: Klara, Inc.
Reported-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16758
2024-11-14 14:12:57 -08:00
Rob Norris
46c4f2ce0b
dsl_dataset: put IO-inducing frees on the pool deadlist
dsl_free() calls zio_free() to free the block. For most blocks, this
simply calls metaslab_free() without doing any IO or putting anything on
the IO pipeline.

Some blocks however require additional IO to free. This at least
includes gang, dedup and cloned blocks. For those, zio_free() will issue
a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible
for dsl_dataset_block_kill() to be called millions of time on a single
transaction (eg a 2T object of 128K blocks is 16M blocks). If those are
all IO-inducing frees, that then becomes 16M FREE IOs placed on the
pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T
object that requires a 20G allocation of resident memory from the
zio_cache. If that can't be satisfied by the kernel, an out-of-memory
condition is raised.

This would be better handled by improving the cases that the
dmu_tx_assign() throttle will handle, or by reducing the overheads
required by the IO pipeline, or with a better central facility for
freeing blocks.

For now, we simply check for the cases that would cause zio_free() to
create a FREE IO, and instead put the block on the pool's freelist. This
is the same place that blocks from destroyed datasets go, and the async
destroy machinery will automatically see them and trickle them out as
normal.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #6783
Closes #16708
Closes #16722 
Closes #16697
2024-11-13 07:38:42 -08:00
Alexander Motin
a60ed3822b
L2ARC: Move different stats updates earlier
..., before we make the header or the log block visible to others.
It should fix assertion on allocated space going negative if the
header is freed once the lock is dropped, while the write is still
going.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16040
Closes #16743
2024-11-13 07:31:50 -08:00
Mark Johnston
178682506f Grab the rangelock unconditionally in zfs_getpages()
As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16643
2024-11-13 07:25:39 -08:00
Mark Johnston
25eb538778 Fix a potential page leak in mappedread_sf()
mappedread_sf() may allocate pages; if it fails to populate a page
can't free it, it needs to ensure that it's placed into a page queue,
otherwise it can't be reclaimed until the vnode is destroyed.

I think this is quite unlikely to happen in practice, it was noticed by
code inspection.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16643
2024-11-13 07:24:14 -08:00
Sam James
4a7a0a0290
Use <fcntl.h> instead of <sys/fcntl.h>
When building on musl, we get:

```
In file included from tests/zfs-tests/cmd/getversion.c:22:
/usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect
 #include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp]
 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h>

In file included from module/os/linux/zfs/vdev_file.c:36:
/usr/include/sys/fcntl.h:1:2: error: #warning redirecting incorrect
 #include <sys/fcntl.h> to <fcntl.h> [-Werror=cpp]
 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h>
```

Bug: https://bugs.gentoo.org/925235
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sam James <sam@gentoo.org>
Closes #15925
2024-11-07 11:20:37 -08:00
Brian Atkinson
187f931372
Update ABD stats for linear page Linux
a10e552 updated abd_free_linear_page() to no longer call
abd_update_scatter_stat(). This meant that linear pages that were not
attached to Direct I/O requests were not doing waste accounting for the
ARC. This led to performance issues due to incorrect ARC accounting that
resulted in 100% of CPU time being spent in arc_evict() during prolonged
I/O workloads with the ARC.

The call to abd_update_scatter_stats() is now conditionally called in
abd_free_linear_page() when the ABD is not from a Direct I/O request.

Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16729
2024-11-07 11:11:05 -08:00
Chunwei Chen
5945676bcc
ZFS send should use spill block prefetched from send_reader_thread
Currently, even though send_reader_thread prefetches spill block,
do_dump() will not use it and issues its own blocking arc_read. This
causes significant performance degradation when sending datasets with
lots of spill blocks.

For unmodified spill blocks, we also create send_range struct for them
in send_reader_thread and issue prefetches for them. We piggyback them
on the dnode send_range instead of enqueueing them so we don't break
send_range_after check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: david.chen <david.chen@nutanix.com>
Closes #16701
2024-11-06 11:52:01 -08:00
tstabrawa
7b6e9675da Use simple folio migration function
Avoids using fallback_migrate_folio, which starts unnecessary writeback
(leading to BUG in migrate_folio_extra).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: tstabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes #16568
Closes #16723
2024-11-06 11:44:10 -08:00
tstabrawa
f38e2d239f Revert "Avoid BUG in migrate_folio_extra"
This reverts commit b052035990.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: tstabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes #16568
Closes #16723
2024-11-06 11:43:05 -08:00
наб
60c202cca4 module: unicode: remove unused tolower transformations
With the previous patch this yields
  $ size -G ./module/zfs.ko ./module/zfs.new.ko
        text       data        bss      total filename
     2865126    1597982     755768    5218876 ./module/zfs.ko
     2864038    1429784     755768    5049590 ./module/zfs.new.ko
       -1088    -168198
         -1k      -164k

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #16704
2024-11-04 17:26:35 -08:00
Alexander Motin
b16e096198
Reduce dirty records memory usage
Small block workloads may use a very large number of dirty records.
During simple block cloning test due to BRT still using 4KB blocks
I can easily see up to 2.5M of those used.  Before this change
dbuf_dirty_record_t structures representing them were allocated via
kmem_zalloc(), that rounded their size up to 512 bytes.

Introduction of specialized kmem cache allows to reduce the size
from 512 to 408 bytes.  Additionally, since override and raw params
in dirty records are mutually exclusive, puting them into a union
allows to reduce structure size down to 368 bytes, increasing the
saving to 28%, that can be a 0.5GB or more of RAM.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16694
2024-11-04 16:42:06 -08:00
Rob Norris
91bd12dfeb
zfs(4): remove "experimental" from zfs_bclone_enabled
I think we've done enough experiments.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16189 
Closes #16712
2024-11-01 14:43:25 -07:00
наб
1c7d4b4c94
module: unicode: remove unused uconv.c
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #16702
2024-11-01 12:12:13 -07:00
Rob Norris
3c650bec15 Revert "Workaround issue of Linux vdev_disk.c, (#16678)"
Now that we can handle these different alignments, we don't this
workaround.

This reverts commit aefc2da8a5.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16687
2024-10-31 17:00:53 -07:00
Rob Norris
e7425ae624 vdev_disk: move abd return and free off the interrupt handler
Freeing an ABD can take sleeping locks to update various stats. We
aren't allowed to sleep on an interrupt handler. So, move the free off
to the io_done callback.

We should never have been freeing things in the interrupt handler, but
we got away with it because we were usually freeing a linear ABD, which
at most is returning two objects to a cache and never sleeping. Scatter
ABDs can be used now, and those have more complex locking.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16687
2024-10-31 17:00:53 -07:00
Rob Norris
63bafe60ec vdev_disk: try harder to ensure IO alignment rules
It seems out our notion of "properly" aligned IO was incomplete. In
particular, dm-crypt does its own splitting, and assumes that a logical
block will never cross an order-0 page boundary (ie, the physical page
size, not compound size). This effectively means that it needs to be
possible to split a BIO at any page or block size boundary and have it
work correctly.

This updates the alignment check function to enforce these rules (to the
extent possible).

Our response to misaligned data is to make some new allocation that is
properly aligned, and copy the data into it. It turns out that
linearising (via abd_borrow_buf()) is not enough, because we allocate eg
4K blocks from a general purpose slab, and so may receive (or already
have) a 4K block that crosses pages.

So instead, we allocate a new ABD, which is guaranteed to be aligned
properly to block sizes, and then copy everything into it, and back out
on the way back.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16687 #16631 #15646 #15533 #14533
2024-10-31 17:00:42 -07:00
Serapheim Dimitropoulos
ae93aeb849
Add warning for external consumers of dmu_tx_callback_register
While reading some code @grwilson came across the above function that
seemingly had no consumers besides a ztest callback that ensures that
the tx_callback infrastructure works correctly. It turns out that Lustre
is the main (and potentially the only) consumer of this. Refer to
`osd_trans_commit_cb` of `lustre/osd-zfs/osd_handler.c` in the Lustre
repo for more info. Let's add a comment highlighting this before someone
removes it by mistake.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Closes #16698
2024-10-30 20:11:40 -04:00
Alexander Motin
6187b19434
On the first vdev open ignore impossible ashift hints
If on the first open device's logical ashift is bigger than set
by pool's ashift property, ignore the last as unusable instead of
creating vdev that will fail most of I/Os due to misalignment.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #16690
2024-10-29 15:23:24 -04:00
Dimitry Andric
2bf1520211
Fix gcc uninitialized warning in FreeBSD zio_crypt.c
In FreeBSD's `zio_do_crypt_data()`, ensure that two `struct uio`
variables are cleared before copying data out of them. This avoids
accessing garbage data, and fixes gcc `-Wuninitialized` warnings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Toomas Soome <tsoome@me.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #16688
2024-10-29 15:05:02 -04:00
Alexander Motin
aefc2da8a5
Workaround issue of Linux vdev_disk.c, (#16678)
in some cases not linearizing buffers with disk sector crossing a
page boundary. It is fine for hardware, but somehow required by LUKS.
It is not typical for ZFS to produce such buffers, but it may happen
if 6KB block is compressed to 4KB, while still having 2KB alignment.
Banning the 6KB buffers helps vdevs with ashifh=12.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
2024-10-23 10:19:46 -07:00
Rob Norris
21cba06bef
config: fix dequeue_signal check for kernels <4.20
Before 4.20, kernel_siginfo_t was just called siginfo_t. This was
causing the kthread_dequeue_signal_3arg_task check, which uses
kernel_siginfo_t, to fail on older kernels.

In d6b8c17f1, we started checking for the "new" three-arg
dequeue_signal() by testing for the "old" version. Because that test is
explicitly using kernel_siginfo_t, it would fail, leading to the build
trying to use the new three-arg version, which would then not compile.

This commit fixes that by avoiding checking for the old 3-arg
dequeue_signal entirely. Instead, we check for the new one, as well as
the 4-arg form, and we use the old form as a fallback. This way, we
never have to test for it explicitly, and once we're building
HAVE_SIGINFO will make sure we get the right kernel_siginfo_t for it, so
everything works out nice.

Original-patch-by: Finix <yancw@info2soft.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16666
2024-10-20 19:50:13 -07:00
Umer Saleem
27e8f56102
Fix inconsistent mount options for ZFS root
While mounting ZFS root during boot on Linux distributions from initrd,
mount from busybox is effectively used which executes mount system call
directly. This skips the ZFS helper mount.zfs, which checks and enables
the mount options as specified in dataset properties. As a result,
datasets mounted during boot from initrd do not have correct mount
options as specified in ZFS dataset properties.

There has been an attempt to use mount.zfs in zfs initrd script,
responsible for mounting the ZFS root filesystem (PR#13305). This was
later reverted (PR#14908) after discovering that using mount.zfs breaks
mounting of snapshots on root (/) and other child datasets of root have
the same issue (Issue#9461).

This happens because switching from busybox mount to mount.zfs correctly
parses the mount options but also adds 'mntpoint=/root' to the mount
options, which is then prepended to the snapshot mountpoint in
'.zfs/snapshot'. '/root' is the directory on Debian with initramfs-tools
where root filesystem is mounted before pivot_root. When Linux runtime
is reached, trying to access the snapshots on root results in
automounting the snapshot on '/root/.zfs/*', which fails.

This commit attempts to fix the automounting of snapshots on root, while
using mount.zfs in initrd script. Since the mountpoint of dataset is
stored in vfs_mntpoint field, we can check if current mountpoint of
dataset and vfs_mntpoint are same or not. If they are not same, reset
the vfs_mntpoint field with current mountpoint. This fixes the
mountpoints of root dataset and children in respective vfs_mntpoint
fields when we try to access the snapshots of root dataset or its
children. With correct mountpoint for root dataset and children stored
in vfs_mntpoint, all snapshots of root dataset are mounted correctly
and become accessible.

This fix will come into play only if current process, that is trying to
access the snapshots is not in chroot context. The Linux kernel API
that is used to convert struct path into char format (d_path), returns
the complete path for given struct path. It works in chroot environment
as well and returns the correct path from original filesystem root.

However d_path fails to return the complete path if any directory from
original root filesystem is mounted using --bind flag or --rbind flag
in chroot environment. In this case, if we try to access the snapshot
from outside the chroot environment, d_path returns the path correctly,
i.e. it returns the correct path to the directory that is mounted with
--bind flag. However inside the chroot environment, it only returns the
path inside chroot.

For now, there is not a better way in my understanding that gives the
complete path in char format and handles the case where directories from
root filesystem are mounted with --bind or --rbind on another path which
user will later chroot into. So this fix gets enabled if current
process trying to access the snapshot is not in chroot context.

With the snapshots issue fixed for root filesystem, using mount.zfs in
ZFS initrd script, mounts the datasets with correct mount options.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by: Umer Saleem <usaleem@ixsystems.com>
Closes #16646
2024-10-17 09:09:39 -04:00
Brian Behlendorf
c642e985e5
Revert "Temporarily disable Direct IO by default"
This partially reverts commit 41210597.  Now that b4e4cbeb2 has
been merged Direct IO can be enabled by default for Linux, but
for FreeBSD there still remains a potentially insufficient range
locking in zfs_getpages() which needs to be resolved.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16629
2024-10-12 13:51:35 -07:00
Brian Atkinson
b4e4cbeb20
Always validate checksums for Direct I/O reads
This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.

To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes
2. dio_verify_rd -> Checksum verification failure for reads.
This allows for determining what I/O operation was the culprit for the
checksum verification failure. All DIO errors are reported only on the
top-level VDEV.

Even though FreeBSD can write protect pages (stable pages) it still has
the same issue as Linux with Direct I/O reads.

This commit updates the following:
1. Propogates checksum failures for reads all the way up to the
   top-level VDEV.
2. Reports errors through zpool status -d as DIO.
3. Has two zevents for checksum verify errors with Direct I/O. One for
   read and one for write.
4. Updates FreeBSD ABD code to also check for ABD_FLAG_FROM_PAGES and
   handle ABD buffer contents validation the same as Linux.
5. Updated manipulate_user_buffer.c to also manipulate a buffer while a
   Direct I/O read is taking place.
6. Adds a new ZTS test case dio_read_verify that stress tests the new
   code.
7. Updated man pages.
8. Added an IMPLY statement to zio_checksum_verify() to make sure that
   Direct I/O reads are not issued as speculative.
9. Removed self healing through mirror, raidz, and dRAID VDEVs for
   Direct I/O reads.

This issue was first observed when installing a Windows 11 VM on a ZFS
dataset with the dataset property direct set to always. The zpool
devices would report checksum failures, but running a subsequent zpool
scrub would not repair any data and report no errors.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes #16598
2024-10-09 12:28:08 -07:00
JKDingwall
0b4dcbe5b4
Fix generation of kernel uevents for snapshot rename on linux
`zvol_rename_minors()` needs to be given the full path not just the
snapshot name.  Use code removed in a0bd735ad as a guide
to providing the necessary values.

Add ZTS check for /dev changes after snapshot rename.  After
renaming a snapshot with 'snapdev=visible' ensure that the /dev
entries are updated to reflect the rename.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Dingwall <james@dingwall.me.uk>
Closes #14223 
Closes #16600
2024-10-06 14:36:33 -07:00
Alexander Motin
4ebe674d91
ARC: Cache arc_c value during arc_evict()
Since arc_evict() run can take some time, arc_c change during it
may result in undesired shift in ARC states balance. Primarily in
case of arc_c reduction it may cause eviction from MFU data state
despite its being below the target already.  Instead we should
evict as originally planned and if needed do another round after.

Reviewed-by: Theera K. <tkittich@hotmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16576
Closes #16605
2024-10-04 10:56:43 -07:00
Pavel Snajdr
0d77e738e6
Defer resilver only when progress is above a threshold
Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).

The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.

But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #15810
2024-10-04 10:41:17 -07:00
Rob Norris
224393a321
feature: large_microzap
In a4b21eadec we added the zap_micro_max_size tuneable to raise the size
at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block)
ZAPs. Before this, a microZAP was limited to 128KiB, which was the old
largest block size. The side effect of raising the max size past 128KiB
is that it be stored in a large block, requiring the large_blocks
feature.

Unfortunately, this means that a backup stream created without the
--large-block (-L) flag to zfs send would split the microZAP block into
smaller blocks and send those, as is normal behaviour for large blocks.
This would be received correctly, but since microZAPs are limited to the
first block in the object by definition, the entries in the later blocks
would be inaccessible. For directory ZAPs, this gives the appearance of
files being lost.

This commit adds a feature flag, large_microzap, that must be enabled
for microZAPs to grow beyond 128KiB, and which will be activated the
first time that occurs. This feature is later checked when generating
the stream and if active, the send operation will abort unless
--large-block has also been requested.

Changing the limit still requires zap_micro_max_size to be changed. The
state of this flag effectively sets the upper value for this tuneable,
that is, if the feature is disabled, the tuneable will be clamped to
128KiB.

A stream flag is also added to ensure that the receiver also activates
its own feature flag upon receiving the stream. This is not strictly
necessary to _use_ the received microZAP, since it doesn't care how
large its block is, but it is required to send the microZAP object on,
otherwise the original problem occurs again.

Because it's difficult to reliably distinguish a microZAP from a fatZAP
from outside the ZAP code, and because it seems unlikely that most
users are affected (a fairly niche tuneable combined with what should be
an uncommon use of send), and for the sake of expediency, this change
activates the feature the first time a microZAP grows to use a large
block, and is never deactivated after that. This can be improved in the
future.

This commit changes nothing for existing pools that already have large
microZAPs. The feature will not be retroactively applied, but will be
activated the next time a microZAP grows past the limit.

Don't use large_blocks feature for enable/disable tests.  The
large_microzap depends on large_blocks, so it gets enabled as a
dependency, breaking the test. Instead use feature "longname", which has
the exact same feature characteristics.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16593
2024-10-02 20:47:11 -07:00
Brian Behlendorf
412105977c
Temporarily disable Direct IO by default
While some remaining issues are resolved with the recently merged
Direct IO functionality disable it by default.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16597
2024-10-02 18:24:29 -07:00
Brian Behlendorf
d34d4f97a8
snapdir: add 'disabled' value to make .zfs inaccessible
In some environments, just making the .zfs control dir hidden from sight
might not be enough. In particular, the following scenarios might
warrant not allowing access at all:
- old snapshots with wrong permissions/ownership
- old snapshots with exploitable setuid/setgid binaries
- old snapshots with sensitive contents

Introducing a new 'disabled' value that not only hides the control dir,
but prevents access to its contents by returning ENOENT solves all of
the above.

The new property value takes advantage of 'iuv' semantics ("ignore
unknown value") to automatically fall back to the old default value when
a pool is accessed by an older version of ZFS that doesn't yet know
about 'disabled' semantics.

I think that technically the zfs_dirlook change is enough to prevent
access, but preventing lookups and dir entries in an already opened .zfs
handle might also be a good idea to prevent races when modifying the
property at runtime.

Add zfs_snapshot_no_setuid parameter to control whether automatically
mounted snapshots have the setuid mount option set or not.

this could be considered a partial fix for one of the scenarios
mentioned in desired.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Co-authored-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #3963
Closes #16587
2024-10-02 09:12:02 -07:00
rilysh
86737c5927
Avoid computing strlen() inside loops
Compiling with -O0 (no proper optimizations), strlen() call
in loops for comparing the size, isn't being called/initialized
before the actual loop gets started, which causes n-numbers of
strlen() calls (as long as the string is). Keeping the length
before entering in the loop is a good idea.

On some places, even with -O2, both GCC and Clang can't
recognize this pattern, which seem to happen in an array
of char pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: rilysh <nightquick@proton.me>
Closes #16584
2024-10-02 09:10:06 -07:00
Rob Norris
0cf14bf4b5 Linux 6.12: PG_error flag was removed
torvalds/linux@09022bc196 removes the flag, and the corresponding
SetPageError() and ClearPageError() macros, with no replacement offered.

Going back through the upstream history, use of this flag has been
gradually removed over the last year as part of the long tail of
converting everything to folios. Interesting tidbit comments from
torvalds/linux@29e9412b25 and torvalds/linux@420e05d0de suggest that
this flag has not been used meaningfully since page writeback failures
started being recorded in errseq_t instead (the whole "fsyncgate" thing,
~2017, around torvalds/linux@8ed1e46aaf).

Given that, it's possible that since perhaps Linux 4.13 we haven't been
getting anything by setting the flag. I don't know if that's true and/or
if there's something we should be doing instead, but my gut feel is that
its probably fine we only use the page cache as a proxy to allow mmap()
to work, rather than backing IO with it.

As such, I'm expecting that removing this will do no harm, but I'm
leaving it in for older kernels to maintain status quo, and if there is
an overall better way, that is left for a future change.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16582
2024-10-01 13:54:05 -07:00
Rob Norris
d6b8c17f1d Linux 6.12: support 3arg dequeue_signal() without task param
See torvalds/linux@a2b80ce87a. It claims the task arg is always
`current`, and so it is with us, so this is a safe change to make. The
only spanner is that we also support the older pre-5.17 3-arg
dequeue_signal() which had different meaning, so we have to check the
types to get the right one.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16582
2024-10-01 13:53:50 -07:00
Sanjeev Bagewadi
20232ecfaa Support for longnames for files/directories (Linux part)
This patch adds the ability for zfs to support file/dir name up to 1023
bytes. This number is chosen so we can support up to 255 4-byte
characters. This new feature is represented by the new feature flag
feature@longname.

A new dataset property "longname" is also introduced to toggle longname
support for each dataset individually. This property can be disabled,
even if it contains longname files. In such case, new file cannot be
created with longname but existing longname files can still be looked
up.

Note that, to my knowledge native Linux filesystems don't support name
longer than 255 bytes. So there might be programs not able to work with
longname.

Note that NFS server may needs to use exportfs_get_name to reconnect
dentries, and the buffer being passed is limit to NAME_MAX+1 (256). So
NFS may not work when longname is enabled.

Note, FreeBSD vfs layer imposes a limit of 255 name lengh, so even
though we add code to support it here, it won't actually work.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15921
2024-10-01 13:40:27 -07:00
Sanjeev Bagewadi
3cf2bfa570 Allocate zap_attribute_t from kmem instead of stack
This patch is preparatory work for long name feature. It changes all
users of zap_attribute_t to allocate it from kmem instead of stack. It
also make zap_attribute_t and zap_name_t structure variable length.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #15921
2024-10-01 13:39:08 -07:00
Don Brady
141368a4b6
Restrict raidz faulted vdev count
Specifically, a child in a replacing vdev won't count when assessing
the dtl during a vdev_fault()

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16569
2024-10-01 09:12:11 -07:00
Rob Norris
c84a37ae93
lua: add flex array field to TString type
Linux 6.10+ with CONFIG_FORTIFY_SOURCE notices memcpy() accessing past
the end of TString, because it has no indication that there there may be
an additional allocation there.

There's no appropriate upstream change for this (ancient) version of
Lua, so this is the narrowest change I could come up with to add a flex
array field to the end of TString to satisfy the check. It's loosely
based on changes from lua/lua@ca41b43f and lua/lua@9514abc2.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16541
Closes #16583
2024-09-30 10:30:03 -07:00
Rob Norris
6f50f8e16b
zfs_log: add flex array fields to log record structs
ZIL log record structs (lr_XX_t) are frequently allocated with extra
space after the struct to carry variable-sized "payload" items.

Linux 6.10+ compiled with CONFIG_FORTIFY_SOURCE has been doing runtime
bounds checking on memcpy() calls. Because these types had no indicator
that they might use more space than their simple definition,
__fortify_memcpy_chk will frequently complain about overruns eg:

    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)
    memcpy: detected field-spanning write (size 7) of single field
        "lr + 1" at zfs_log.c:425 (size 0)
    memcpy: detected field-spanning write (size 9) of single field
        "(char *)(lr + 1)" at zfs_log.c:593 (size 0)
    memcpy: detected field-spanning write (size 4) of single field
        "(char *)(lr + 1) + snamesize" at zfs_log.c:594 (size 0)

To fix this, this commit adds flex array fields to all lr_XX_t structs
that require them, and then uses those fields to access that
end-of-struct area rather than more complicated casts and pointer
addition.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16501
Closes #16539
2024-09-27 09:18:11 -07:00
tstabrawa
b052035990
Avoid BUG in migrate_folio_extra
Linux page migration code won't wait for writeback to complete unless
it needs to call release_folio.  Call SetPagePrivate wherever
PageUptodate is set and define .release_folio, to cause
fallback_migrate_folio to wait for us.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tstabrawa <59430211+tstabrawa@users.noreply.github.com>
Closes #15140
Closes #16568
2024-09-26 08:57:09 -07:00
Alexander Motin
48d1be254f
Properly release key in spa_keystore_dsl_key_hold_dd()
Since dsl_crypto_key_open() references the key, 0d23f5e2e4 should
have called dsl_crypto_key_rele() to drop it first instead of
calling dsl_crypto_key_free() directly.  The final result should
actually be the same, but without triggering dck_holds assertion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16567
2024-09-25 07:40:17 -07:00
Alexander Motin
832f66b218
FreeBSD: Sync taskq_cancel_id() returns with Linux
Couple places in the code depend on 0 returned only if the task was
actually cancelled.  Doing otherwise could lead to extra references
being dropped.  The race could be small, but I believe CI hit it
from time to time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16565
2024-09-24 16:29:18 -07:00
w0xel
ccc420acd5
Add missing guard defines for simd_stat
This adds the HAVE_KERNEL_NEON and HAVE_KERNEL_FPU_INTERNAL
guards to simd_stat.c defaulted to 0 to make it build again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Sebastian Wuerl <s.wuerl@mailbox.org>
Closes #16558
2024-09-24 09:07:26 -07:00
Theera K.
d40d40913d
Evicting too many bytes from MFU metadata
Without updating 'm' we evict from MFU metadata all that we wanted
to evict from all metadata, including already evicted MRU metadata
('m' is the total amount of metadata we had at the beginning,
and 'w' is the total amount of metadata we want to have). 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Theera K. <tkittich@hotmail.com>
Closes #16521
Closes #16546
2024-09-23 22:12:56 -07:00
Rob Norris
78e9e987e1 linux: log a scary warning when used with an experimental kernel
Since the person using the kernel may not be the person who built it,
show a warning at module load too, in case they aren't aware that it
might be weird.

Reviewed-by: Robert Evans <evansr@google.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #15986
2024-09-23 10:44:54 -07:00
George Melikov
e419a63bf4
xattr dataset prop: change defaults to sa
It's the main recommendation to set xattr=sa
even in man pages, so let's set it by default.

xattr=sa don't use feature flag, so in the worst
case we'll have non-readable xattrs by other
non-openzfs platforms.

Non-overridden default `xattr` prop of existing pools
will automatically use `sa` after this commit too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #15147
2024-09-23 09:50:48 -07:00
Rich Ercolani
1d84c9eb66
Fix /proc/spl/kstat/simd on x86
Evidently while reworking it on aarch64, I broke it on x86 and
didn't notice.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #16556
2024-09-22 13:11:19 -07:00
Rob Norris
80645d6582
FreeBSD: restore zfs_znode_update_vfs()
I accidentally removed this in c22d56e3e, and didn't notice because it
doesn't fail the build, but does fail to load into the kernel because it
can't link it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16554
2024-09-21 10:03:54 -07:00
Brian Behlendorf
f9d4f1b480
Add SIMD metadata in /proc on Linux follow up
This change accidentally broke the FreeBSD build due to
a conflict between the simd_stat_init()/simd_stat_fini()
macros on FreeBSD and the extern function prototype.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16552
2024-09-20 15:48:12 -07:00
Rich Ercolani
5d01243964
Add SIMD metadata in /proc on Linux
Too many times, people's performance problems have amounted to
"somehow your SIMD support isn't working", and determining that
at runtime is difficult to describe to people.

This adds a /proc/spl/kstat/zfs/simd node, which exposes
metadata about which instructions ZFS thinks it can use,
on AArch64 and x86_64 Linux, to make investigating things
like this much easier.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #16530
2024-09-20 08:16:44 -07:00
George Melikov
01852ffbf8 arc_hdr_authenticate: make explicit error
On compression we could be more explicit here for cases
where we can not recompress the data.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:25:02 -07:00
George Melikov
b32d48a625 ZLE compression: don't use BPE_PAYLOAD_SIZE
ZLE compressor needs additional bytes to process
d_len argument efficiently.
Don't use BPE_PAYLOAD_SIZE as d_len with it
before we rework zle compressor somehow.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:24:51 -07:00
George Melikov
522f2629c8 zio_compress: introduce max size threshold
Now default compression is lz4, which can stop
compression process by itself on incompressible data.
If there are additional size checks -
we will only make our compressratio worse.

New usable compression thresholds are:
- less than BPE_PAYLOAD_SIZE (embedded_data feature);
- at least one saved sector.

Old 12.5% threshold is left to minimize affect
on existing user expectations of CPU utilization.

If data wasn't compressed - it will be saved as
ZIO_COMPRESS_OFF, so if we really need to recompress
data without ashift info and check anything -
we can just compress it with zero threshold.
So, we don't need a new feature flag here!

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #9416
2024-09-19 17:23:58 -07:00
Rob Norris
e8ede2ba78 zfs_debug: specific variant for userspace
Just nice and simple, with room to grow.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:49:50 -07:00
Rob Norris
c22d56e3ed zfs_znode: lift common code to a single shared file
For now, userspace has no znode implementation. Some of the property and
path handling code is used there though and is the same on all
platforms, so we only need a single copy of it.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:49:45 -07:00
Rob Norris
8fc0beb66b arc_os: split userspace and Linux kernel code
The Linux arc_os.c carries userspace and kernel code, with very little
overlap between the two. This lifts the userspace parts out into a
separate arc_os.c for libzpool and removes it from the Linux side.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16492
2024-09-19 15:48:54 -07:00
Rob Norris
b7e43d6e7f linux/abd_os: remove kernel version check for compound page support
All kernels we support have compound pages that work the way we would
like. However, this code is new and this knowledge was hard won, so I'd
like to leave the description and option there for a little while, even
if it can only be disabled with a recompile.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16545
2024-09-19 15:45:05 -07:00
Rob Norris
a83762b3f4 linux: remove kernel version checks for unsupported kernels
Following 2b069768a (#16479), anything gated on a kernel version before
4.18 can be always included/excluded.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16545
2024-09-19 15:43:44 -07:00
Shengqi Chen
a877b39624 cityhash: replace invocations with specialized versions when possible
So that we can get actual benefit from last commit.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:19:17 -07:00
Shengqi Chen
0ae4460c61 zcommon: add specialized versions of cityhash4
Specializing cityhash4 on 32-bit architectures can reduce the size
of stack frames as well as instruction count. This is a tiny but
useful optimization, since some callers invoke it frequently.

When specializing into 1/2/3/4-arg versions, the stack usage
(in bytes) on some 32-bit arches are listed as follows:

- x86: 32, 32, 32, 40
- arm-v7a: 20, 20, 28, 36
- riscv: 0, 0, 0, 16
- power: 16, 16, 16, 32
- mipsel: 8, 8, 8, 24

And each actual argument (even if passing 0) contributes evenly
to the number of multiplication instructions generated:

- x86: 9, 12, 15 ,18
- arm-v7a: 6, 8, 10, 12
- riscv / power: 12, 18, 20, 24
- mipsel: 9, 12, 15, 19

On 64-bit architectures, the tendencies are similar. But both stack
sizes and instruction counts are significantly smaller thus negligible.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:18:59 -07:00
Shengqi Chen
1c35206124 dmu_objset: replace dnode_hash impl with cityhash4
As mentioned in PR #16131, replacing CRC-based hash with cityhash4
could slightly improve the performance by eliminating memory access.
Replacing algorightm is safe since the hash result is not persisted.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Closes #16131
Closes #16483
2024-09-19 15:18:12 -07:00
Rob Norris
f245541e24 zfs_file: implement zfs_file_deallocate for FreeBSD 14
FreeBSD 14 gained a `VOP_DEALLOCATE` VFS operation and a `fspacectl`
syscall to use it. At minimum, these zero the given region, and if the
underlying filesystem supports it, can make the region sparse. We can
use this to get TRIM-like behaviour for file vdevs.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16496
2024-09-18 11:35:48 -07:00
Rob Norris
fa330646b9 zfs_file: rename zfs_file_fallocate to zfs_file_deallocate
We only use it on a specific way: to punch a hole in (make sparse) a
region of a file, in order to implement TRIM-like behaviour.

So, call the op "deallocate", and move the Linux-style mode flags down
into the Linux implementation, since they're an implementation detail.

FreeBSD gets a no-op stub (for the moment).

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16496
2024-09-18 11:35:04 -07:00
Rob Norris
5df65ca9c1 config: remove HAVE_GET_USER_PAGES_*
get_user_pages_unlocked() had stabilised by 4.9.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00
Rob Norris
c57d268a78 config: remove HAVE_HAS_CAPABILITY
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16479
2024-09-18 11:23:51 -07:00