Commit Graph

9682 Commits

Author SHA1 Message Date
Alexander Motin
ff6266ee9b
Fix use-afer-free regression in RAIDZ expansion
We should not dereference rra after the last zio_nowait() is called.
It seems very unlikely, but ASAN in ztest managed to catch it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16868
2024-12-14 14:02:11 -08:00
kotauskas
586304ac44
Remount datasets on soft-reboot
The one-shot zfs-mount.service is incorrectly deemed active by 
Systemd after a systemctl soft-reboot. As such, soft-rebooting
prevents zfs mount -a from being ran automatically.

This commit makes it so that zfs-mount.service is marked as being 
undone by the time umount.target is reached, so that zfs.target then 
pulls it in again and gets it restarted after a soft reboot.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: kotauskas <v.toncharov@gmail.com>
Closes #16845
2024-12-13 13:50:50 -08:00
Rob Norris
46e06feded flush: only detect lack of flush support in one place
It seems there's no good reason for vdev_disk & vdev_geom to explicitly
detect no support for flush and set vdev_nowritecache.  Instead, just
signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess()
take care of it in one place.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16855
2024-12-13 12:19:54 -08:00
Rob Norris
fbea92432a flush: don't report flush error when disabling flush support
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!

So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16855
2024-12-13 12:19:20 -08:00
Poscat
76f57ab9f7
build: use correct bashcompletiondir on arch
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: poscat <poscat@poscat.moe>
Closes #16861
2024-12-13 12:06:40 -08:00
Rob Norris
ecc0970e3e
backtrace: fix off-by-one on string output
sizeof("foo") includes the trailing null byte, so all the output had
nulls through it. Most terminals quietly ignore it, but it makes some
tools misdetect file types and other annoyances.

Easy fix: subtract 1.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16862
2024-12-13 10:12:14 -08:00
Chunwei Chen
6c9b4f18d3
Fix DR_OVERRIDDEN use-after-free race in dbuf_sync_leaf
In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db
except for overridden case. However, this exception causes a race where
dbuf_new_size could free the arc_buf after the last dereference of
*datap and causes use-after-free. We fix this by cloning the buf
regardless if it's overridden.

The race:
--
P0                                     P1

                                       dbuf_hold_impl()
                                         // dbuf_hold_copy passed
                                         // because db_data_pending NULL

dbuf_sync_leaf()
  // doesn't clone *datap
  // *datap derefed to db_buf
  dbuf_write(*datap)

                                       dbuf_new_size()
                                         dmu_buf_will_dirty()
                                           dbuf_fix_old_data()
                                             // alloc new buf for P0 dr
                                             // but can't change *datap

                                         arc_alloc_buf()
                                         arc_buf_destroy()
                                           // alloc new buf for db_buf
                                           // and destroy old buf

  dbuf_write() // continue
    abd_get_from_buf(data->b_data,
    arc_buf_size(data))
      // use-after-free
--

Here's an example when it happens:

BUG: kernel NULL pointer dereference, address: 000000000000002e
RIP: 0010:arc_buf_size+0x1c/0x30 [zfs]
Call Trace:
 dbuf_write+0x3ff/0x580 [zfs]
 dbuf_sync_leaf+0x13c/0x530 [zfs]
 dbuf_sync_list+0xbf/0x120 [zfs]
 dnode_sync+0x3ea/0x7a0 [zfs]
 sync_dnodes_task+0x71/0xa0 [zfs]
 taskq_thread+0x2b8/0x4e0 [spl]
 kthread+0x112/0x130
 ret_from_fork+0x1f/0x30

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Co-authored-by: Chunwei Chen <david.chen@nutanix.com>
Closes #16854
2024-12-12 16:18:45 -08:00
Alexander Motin
19a04e5ad2
BRT: Check bv_mos_entries in brt_entry_lookup()
When vdev first sees some block cloning, there is a window when
brt_maybe_exists() might already return true since something was
cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet
created.  In such case we should not try to look into the ZAP
and dereference NULL bv_mos_entries_dnode.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16851
2024-12-12 10:22:41 -08:00
Rob Norris
e0039c7057 Remove unnecessary CSTYLED escapes on top-level macro invocations
cstyle can handle these cases now, so we don't need to disable it.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 08:53:57 -08:00
Rob Norris
0de8ae56f7 cstyle: ignore old non-POSIX types in macro invocations
In code generation macros, we often use names like `uint` when
constructing handler functions. These are not being used as types, so
exclude them from the admonishment to use POSIX type names.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 08:53:53 -08:00
Rob Norris
ba00a6f9a3 cstyle: understand macro params can be empty
It's not uncommon to have empty parameters in code generator macros,
usually when multiple parameters are concatenated or stringified into a
single token or literal. So, exclude the space-before-comma check, which
will allow construction like `MACRO_CALL(foo, , baz)`.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 08:53:36 -08:00
Rob Norris
903895ea5f cstyle: understand basic top-level macro invocations
We quite often invoke macros outside of functions, usually to generate
functions or data. cstyle inteprets these as function headers, which at
least have opinions for indenting.

This introduces a separate state for top-level macro invocations, and
excludes it from matching functions. For the moment, most of the
existing rules will continue to apply, but this gives us a way to add or
removes rules targeting macros specifically.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #16840
2024-12-06 08:53:12 -08:00
Alexander Motin
a44eaf1690
Optimize RAIDZ expansion
- Instead of copying one ashift-sized block per ZIO, copy as much
as we have contiguous data up to 16MB per old vdev.  To avoid data
moves use gang ABDs, so that read ZIOs can directly fill buffers
for write ZIOs.  ABDs have much smaller overhead than ZIOs in both
memory usage and processing time, plus big I/Os do not depend on
I/O aggregation and scheduling to reach decent performance on HDDs.
 - Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms.
 - Use 32bit range tree when possible (practically always now) to
slightly reduce memory usage.
 - Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as
for main ones.
 - Fix rate overflows in `zpool status` reporting.

With these changes expanding RAIDZ1 from 4 to 5 children I am able
to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are
limited by devices instead of CPU.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15680
Closes #16819
2024-12-06 08:50:16 -08:00
Alexander Motin
e8b333e4d3
Fix false assertion in dmu_tx_dirty_buf() on cloning
Same as writes block cloning can increase block size and number of
indirection levels.  That means it can dirty block 0 at level 0 or
at new top indirection level without explicitly holding them.

A block cloning test case for large offsets has been added.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16825
2024-12-05 11:48:08 -08:00
Don Brady
44446dccdb
During pool export flush the ARC asynchronously
This also includes removing L2 vdevs asynchronously.

This commit also guarantees that spa_load_guid is unique.

The zpool reguid feature introduced the spa_load_guid, which is a
transient value used for runtime identification purposes in the ARC.
This value is not the same as the spa's persistent pool guid.

However, the value is seeded from spa_generate_load_guid() which
does not check for uniqueness against the spa_load_guid from other
pools.  Although extremely rare, you can end up with two different
pools sharing the same spa_load_guid value! So we guarantee that
the value is always unique and additionally not still in use by an
async arc flush task.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16215
2024-12-05 08:58:20 -08:00
Rob Norris
2507db612d
zdb_il: use flex array member to access ZIL records
In 6f50f8e16 we added flex arrays to lr_XX_t structs to silence kernel
bounds check warnings. Userspace code was mostly not updated to use them
though.

It seems that in the right circumstances, compilers can get confused
about sizes in the same way, and throw warnings. This commits switch
those uses over to use the flex array fields also.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #16832
2024-12-04 19:03:20 -05:00
Alexander Motin
a01504b35c
Improve speculative prefetcher for block cloning
- Issue prescient prefetches for demand indirect blocks after the
first one.  It should be quite rare for reads/writes, but much more
useful for cloning due to much bigger (up to 1022 blocks) accesses.
It covers the gap during the first couple accesses when we can not
speculate yet, but we know what is needed right now.  It reduces
dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode().
 - Increase maximum prefetch distance for indirect blocks from 64
to 128MB.  It should cover the maximum 1022 blocks of block cloning
access size in case of default 128KB recordsize used.  In case of
bigger recordsize the above prescient prefetch should also help.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16814
2024-12-04 15:19:05 -08:00
Alexander Motin
c33a55b0c2
Allow dsl_deadlist_open() return errors
In some cases like dsl_dataset_hold_obj() it is possible to handle
those errors, so failure to hold dataset should be better than
kernel panic.  Some other places where these errors are still not
handled but asserted should be less dangerous just as unreachable.

We have a user report about pool corruption leading to assertions
on these errors.  Hopefully this will make behavior a bit nicer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16836
2024-12-04 15:15:58 -08:00
Mark Johnston
0e020bf3e1
FreeBSD: Remove an incorrect assertion in zfs_getpages()
The pages in the array may become valid after this initial unbusying,
so the assertion only holds during the first iteration of the outer
loop.

Later in zfs_getpages(), the dmu_read_pages() loop handles already-valid
pages.  Just drop the assertion, it's not terribly useful.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reported-by: Peter Holm <pho@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Sponsored-by: Klara, Inc.
Closes #16810
Closes #16834
2024-12-04 14:24:50 -05:00
Mariusz Zaborski
4b4e346b9f
Add ability to scrub from last scrubbed txg
Some users might want to scrub only new data because they would like
to know if the new write wasn't corrupted.  This PR adds possibility
scrub only newly written data.

This introduces new `last_scrubbed_txg` property, indicating the
transaction group (TXG) up to which the most recent scrub operation
has checked and repaired the dataset, so users can run scrub only
from the last saved point. We use a scn_max_txg and scn_min_txg
which are already built into scrub, to accomplish that.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
Sponsored-By: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
Closes #16301
2024-12-04 14:21:45 -05:00
shodanshok
1cd2419ece
Fix race in libzfs_run_process_impl
When replacing a disk, a child process is forked to run a script called
zfs_prepare_disk (which can be useful for disk firmware update or health
check). The parent than calls waitpid and checks the child error/status
code.

However, the _reap_children thread (created from zed_exec_process to
manage zedlets) also waits for all children with the same PGID and can
stole the signal, causing the replace operation to be aborted.

As waitpid returns -1, the parent incorrectly assume that the child
process had an error or was killed. This, in turn, leaves the newly
added disk in REMOVED or UNAVAIL status rather than completing the
replace process.

This patch changes the PGID of the child process execuing the
prepare script, shielding it from the _reap_children thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #16801
2024-12-04 05:36:10 -05:00
Alexander Motin
654ade8ca2 FreeBSD: Remove some illumos compat from vnode.h
Should make no difference, just some dead code cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by:Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16808
2024-12-03 09:32:54 -08:00
Alexander Motin
ae00c807dc FreeBSD: Return ifndef IN_BASE back to fix the build
FreeBSD's libprocstat seems to build kernel code in user space,
which does not work here due to undefined vnode_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by:Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16808
2024-12-03 09:32:07 -08:00
Rob Norris
027b3e06ed
zinject(8): rename "ioctl" to "flush"
Doc bug missed in d7605ae77.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16827
2024-12-02 17:21:55 -08:00
Alexander Motin
6e3c109bc0
Fix regression in dmu_buf_will_fill()
Direct I/O implementation added condition to call dbuf_undirty()
only in case of block cloning.  But the condition is not right if
the block is no longer dirty in this TXG, but still in DB_NOFILL
state.  It resulted in block not reverting to DB_UNCACHED and
following NULL de-reference on attempt to access absent db_data.

While there, add assertions for db_data to make debugging easier.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16829
2024-12-02 17:08:40 -08:00
Alexander Motin
3e9ba0f223
Add missing parenthesis in VERIFYF()
Without them the order of operations might get unexpected.

Reviewed-by: Adam Moss <c@yotes.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16826
2024-12-02 16:55:51 -08:00
Pavel Snajdr
c8a326aab7
Linux: fix zfs_uio_dio_check_for_zero_page
The intent here is to replace the zero page pointer in the array of
pointers to pages in the struct.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #16812 
Closes #16689
Closes #16642
2024-12-02 16:25:35 -08:00
Ivan Volosyuk
f29dcc25c7
Linux: Fix detection of register_sysctl_sz
Adjust the m4 function to mimic sentinel we use in spl-proc.c
This fixes the detection on kernels compiled with CONFIG_RANDSTRUCT=y

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com>
Closes: #16620
Closes: #16805
2024-11-29 22:50:37 -05:00
Rob Norris
0ffa6f3464
zdb: show dedup table and log attributes
There's interesting info in there that is going to help with
understanding dedup behavior at any given moment.

Since this is a format change, tests that rely on that output have been
modified to match.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #16755
2024-11-26 19:28:19 -05:00
Pavel Snajdr
d2b0ca953f
Assert if we're logging after final txg was set
This allowed to debug #16714, fixed in #16782.  Without assertions
added here it is difficult to figure out what logs cause the problem,
since the assertion happens in sync thread context.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Closes #16795
2024-11-25 18:37:56 -05:00
Alexander Motin
d0a91b9f88
FreeBSD: Reduce copy_file_range() source lock to shared
Linux locks copy_file_range() source as shared.  FreeBSD was doing
it also, but then was changed to exclusive, partially because KPI
of that time was doing so, and partially seems out of caution.
Considering zfs_clone_range() uses range locks on both source and
destination, neither should require exclusive vnode locks. But one
step at a time, just sync it with Linux for now.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16789
Closes #16797
2024-11-23 14:29:03 -08:00
Alexander Motin
b3b0ce64d5
FreeBSD: Lock vnode in zfs_ioctl()
Previously vnode was not locked there, unlike Linux.  It required
locking it in vn_flush_cached_data(), which recursed on the lock
if called from zfs_clone_range(), having the vnode locked.

Reviewed-by: Alan Somers <asomers@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16789
Closes #16796
2024-11-23 14:26:52 -08:00
Pavel Snajdr
38c0324c0f
Linux: Fix zfs_prune panics
by protecting against sb->s_shrink eviction on umount with newer kernels

deactivate_locked_super calls shrinker_free and only then
sops->kill_sb cb, resulting in UAF on umount when trying
to reach for the shrinker functions in zpl_prune_sb of
in-umount dataset

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Closes #16770
2024-11-21 15:30:43 -08:00
Alexander Motin
ae1d11882d
BRT: Clear bv_entcount_dirty on destroy
This fixes assertion in brt_sync_table() on debug builds when last
cloned block on the vdev is freed and bv_meta_dirty is cleared,
while bv_entcount_dirty is not.  Should not matter in production.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16791
2024-11-21 08:20:22 -08:00
Brian Behlendorf
225e76cd7d
Linux 6.12 compat: META
Update the META file to reflect compatibility with the 6.12 kernel.

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16793
2024-11-21 07:56:30 -08:00
Alexander Motin
9a81484e35
ZAP: Reduce leaf array and free chunks fragmentation
Previous implementation of zap_leaf_array_free() put chunks on the
free list in reverse order.  Also zap_leaf_transfer_entry() and
zap_entry_remove() were freeing name and value arrays in reverse
order.  Together this created a mess in the free list, making
following allocations much more fragmented than necessary.

This patch re-implements zap_leaf_array_free() to keep existing
chunks order, and implements non-destructive zap_leaf_array_copy()
to be used in zap_leaf_transfer_entry() to allow properly ordered
freeing name and value arrays there and in zap_entry_remove().

With this change test of some writes and deletes shows percent of
non-contiguous chunks in DDT reducing from 61% and 47% to 0% and
17% for arrays and frees respectively.  Sure some explicit sorting
could do even better, especially for ZAPs with variable-size arrays,
but it would also cost much more, while this should be very cheap.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16766
2024-11-20 13:37:52 -08:00
tleydxdy
d02257c280
fix: block incompatible kernel from being installed
The current "Requires" lines only ensure the old kernel is
available on the system but it does not prevent fedora from
updating to an incompatible and breaking user's system.

Set Conflicts to block incompatible kernels from being installed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: tleydxdy <shironeko.github@tesaguri.club>
Closes #16139
2024-11-20 13:19:07 -08:00
Mark Johnston
d76d79fd27
zio: Avoid sleeping in the I/O path
zio_delay_interrupt(), apparently used for fault injection, is executed
in the I/O pipeline.  It can cause the calling thread to go to sleep,
which is not allowed on FreeBSD.  This happens only for small delays,
though, and there's no apparent reason to avoid deferring to a taskqueue
in that case, as it already does otherwise.

Simply go to sleep unconditionally.  This fixes an occasional panic I
see when running the ZTS on FreeBSD.  Also remove an unhelpful comment
referencing the non-existent timeout_generic().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16785
2024-11-20 08:23:08 -08:00
Alexander Motin
457f8b76e7
BRT: More optimizations after per-vdev splitting
- With both pending and current AVL-trees being per-vdev and having
effectively identical comparison functions (pending tree compared
also birth time, but I don't believe it is possible for them to be
different for the same offset within one transaction group), it
makes no sense to move entries from one to another.  Instead inline
dramatically simplified brt_entry_addref() into brt_pending_apply().
It no longer requires bv_lock, since there is nothing concurrent
to it at the time.  And it does not need to search the tree for the
previous entries, since it is the same tree, we already have the
entry and we know it is unique.
 - Put brt_vdev_lookup() and brt_vdev_addref() into different tree
traversals to avoid false positives in the first due to the second
entcount modifications.  It saves dramatic amount of time when a
file cloned first time by not looking for non-existent ZAP entries.
 - Remove avl_is_empty(bv_tree) check from brt_maybe_exists().  I
don't think it is needed, since by the time all added entries are
already accounted in bv_entcount. The extra check must be producing
too many false positives for no reason.  Also we don't need bv_lock
there, since bv_entcount pointer must be table at this point, and
we don't care about false positive races here, while false negative
should be impossible, since all brt_vdev_addref() have already
completed by this point.  This dramatically reduces lock contention
on massive deletes of cloned blocks.  The only remaining one is
between multiple parallel free threads calling brt_entry_decref().
 - Do not update ZAP if net change for a block over the TXG was 0.
In combination with above it makes file move between datasets as
cheap operation as originally intended if it fits into one TXG.
 - Do not allocate vdevs on pool creation or import if it did not
have active block cloning. This allows to save a bit in few cases.
 - While here, add proper error handling in brt_load() on pool
import instead of assertions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16773
2024-11-20 06:20:14 -08:00
Brian Behlendorf
49a377aa30
ZTS: Fix zpool_status_008_pos false positive
Increase the injected delay to 1000ms and the ZIO_SLOW_IO_MS threshold
to 750ms to avoid false positives due to unrelated slow IOs which may
occur in the CI environment.  Additionally, clear the fault injection as
soon as it is no longer required for the test case.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #16769
2024-11-20 06:13:32 -08:00
Alexander Motin
0ca82c5680
L2ARC: Stop rebuild before setting spa_final_txg
Without doing that there is a race window on export when history
log write by completed rebuild dirties transaction beyond final,
triggering assertion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16714
Closes #16782
2024-11-20 06:11:51 -08:00
Alexander Motin
534688948c
Remove hash_elements_max accounting from DBUF and ARC
Those values require global atomics to get current hash_elements
values in few of the hottest code paths, while in all the years I
never cared about it.  If somebody wants, it should be easy to
get it by periodic sampling, since neither ARC header nor DBUF
counts change so fast that it would be difficult to catch.

For now I've left hash_elements_max kstat for ARC, since it was
used/reported by arc_summary and it would break older versions,
but now it just reports the current value.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #16759
2024-11-19 07:00:16 -08:00
Rob Norris
ffe2112795
Move "no name changes" from compression to checksum table
Compression names actually aren't used in dedup table names, but
checksum names are.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16776
2024-11-19 06:55:27 -08:00
Steve Mokris
e08e832b10
Expand zpool-remove.8 manpage with example results
Also fix comment cross-referencing to zpool.8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Steve Mokris <smokris@softpixel.com>
Closes #16777
2024-11-19 06:52:04 -08:00
Alexander Motin
0d6306be8c
Fix few __VA_ARGS typos in assertions
It should be __VA_ARGS__, not __VA_ARGS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16780
2024-11-19 06:48:06 -08:00
Ameer Hamza
ff3df1211c
zed: prevent automatic replacement of offline vdevs
When an OFFLINE device is physically removed, a spare is automatically
activated. However, this behavior differs in FreeBSD, where we do not
transition from OFFLINE state to REMOVED.
Our support team has encountered cases where customers experienced
unexpected behavior during drive replacements, with multiple spares
activating for the same VDEV due to a single disk replacement. This
patch ensures that a drive in an OFFLINE state remains in that state,
preventing it from transitioning to REMOVED and being automatically
replaced by a spare.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16751
2024-11-15 15:08:16 -08:00
Alexander Motin
fd6e8c1d2a BRT: Rework structures and locks to be per-vdev
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:04:11 -08:00
Alexander Motin
309ce6303f ZAP: Add by_dnode variants to lookup/prefetch_uint64
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:04:02 -08:00
Alexander Motin
1ee251bdde BRT: Don't call brt_pending_remove() on holes/embedded
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:03:57 -08:00
Alexander Motin
483087b06f ZTS: Avoid embedded blocks in bclone/bclone_prop_sync
If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
2024-11-15 15:02:44 -08:00