mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs synced 2025-04-29 04:09:43 +00:00

Author	SHA1	Message	Date
Alexander Motin	6a2f7b3844	Fix metaslab group fragmentation math (#17037 ) Since we are calculating a free space fragmentation, we should weight metaslabs by the amount of their free space, not a full size. Fragmentation of full metaslabs may not matter in presence empty ones. The old algorithm did not differentiate metaslabs having only one free 4KB block from metaslabs having 50% of space free in 4KB blocks, reporting higher fragmentation. While there, move metaslab_group_alloc_update() call after setting mg_fragmentation, otherwise the effect may be delayed by one TXG. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-18 10:45:42 -08:00
Rob Norris	68473c4fd8	range_tree: convert remaining range_* defs to zfs_range_* Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-14 15:37:56 -08:00
Ivan Volosyuk	d4a5a7e3aa	Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_* Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-14 15:37:48 -08:00
Rob Norris	b8c73ab780	zio: do no-op injections just before handing off to vdevs The purpose of no-op is to simulate a failure between a device cache and its permanent store. We still want it to go through the queue and respond in the same way to everything else. So, inject "success" as the very last thing, and then move on to VDEV_IO_DONE to be dequeued and so any followup work can occur. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17029	2025-02-07 20:42:24 -05:00
Paul Dagnelie	88020b993c	Add kstats tracking gang allocations Gang blocks have a significant impact on the long and short term performance of a zpool, but there is not a lot of observability into whether they're being used. This change adds gang-specific kstats to ZFS, to better allow users to see whether ganging is happening. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17003	2025-02-06 15:42:50 -05:00
Paul Dagnelie	40496514b8	Expand fragmentation table to reflect larger possibile allocation sizes When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance. We expand the fragmentation table to extend up to 512MB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #16986	2025-02-06 15:40:01 -05:00
Rob Norris	390f6c1190	zio: lock parent zios when updating wait counts on reexecute As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17016	2025-02-04 08:47:50 -05:00
Jaydeep Kshirsagar	21205f6488	Avoid ARC buffer transfrom operations in prefetch This change will prevent prefetch to perform unnecessary ARC buffer fill when reading from disk. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17013	2025-02-01 11:15:24 -05:00
Alan Somers	12f0baf348	Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Sponsored by: ConnectWise Closes #16980	2025-01-29 09:18:09 -05:00
Rob Norris	26e38aec46	zinject: add "probe" device injection type Injecting a device probe failure is not possible by matching IO types, because probe IO goes to the label regions, which is explicitly excluded from injection. Even if it were possible, it would be awkward to do, because a probe is sequence of reads and writes. This commit adds a new IO "type" to match for injection, which looks for the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the injection record and recieve the wanted error. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-01-22 16:13:21 -08:00
Rob Norris	dfdc5ea993	zinject: make iotype extendable I'm about to add a new "type", and I need somewhere to put it! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16947	2025-01-22 16:12:31 -08:00
Rob Norris	2aa3fbe761	zinject: count matches and injections for each handler When building tests with zinject, it can be quite difficult to work out if you're producing the right kind of IO to match the rules you've set up. So, here we extend injection records to count the number of times a handler matched the operation, and how often an error was actually injected (ie after frequency and other exclusions are applied). Then, display those counts in the `zinject` output. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #16938	2025-01-13 08:33:31 -05:00
Don Brady	939e0237c5	Too many vdev probe errors should suspend pool Similar to what we saw in #16569, we need to consider that a replacing vdev should not be considered as fully contributing to the redundancy of a raidz vdev even though current IO has enough redundancy. When a failed vdev_probe() is faulting a disk, it now checks if that disk is required, and if so it suspends the pool until the admin can return the missing disks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16864	2025-01-04 10:28:33 -08:00
Rob Norris	c02e1cf055	vdev_open: clear async remove flag after reopen It's possible for a vdev to be flagged for async remove after the pool has suspended. If the removed device has been returned when the pool is resumed, the ASYNC_REMOVE task will still run at the end of txg, and remove the device from the pool again. To fix, we clear the async remove flag at reopen, just as we did for the async fault flag in `5de3ac223`. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16921	2025-01-03 14:42:06 -08:00
Ameer Hamza	9dd5fe1095	zvol: implement platform-independent part of block cloning In Linux, block devices currently lack support for `copy_file_range` API because the kernel does not provide the necessary functionality. However, there is an ongoing upstream effort to address this limitation: https://patchwork.kernel.org/project/dm-devel/cover/20240520102033.9361-1-nj.shetty@samsung.com/. We have adopted this upstream kernel patch into the TrueNAS kernel and made some additional modifications to enable block cloning specifically for the zvol block device. This patch implements the platform- independent portions of these changes for inclusion in OpenZFS. This patch does not introduce any new functionality directly into OpenZFS. The `TX_CLONE_RANGE` replay capability is only relevant when zvols are migrated to non-TrueNAS systems that support Clone Range replay in the ZIL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #16901	2024-12-29 11:41:30 -08:00
Rob Norris	03b7cfdef3	spa_sync_props: remove pool userprops by setting empty-string People have noted there's no way to remove a pool userprop, only zero it. Turns vdev userprops had a method, by setting empty-string. So this makes pool userprops follow the same behaviour. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16887	2024-12-29 11:12:04 -08:00
Rob Norris	c37a2ddaaa	microzap: set hard upper limit of 1M The count of chunks in a microzap block is stored as an uint16_t (mze_chunkid). Each chunk is 64 bytes, and the first is used to store a header, so there are 32767 usable chunks, which is just under 2M. 1M is the largest power-2-rounded block size under 2M, so we must set the limit there. If it goes higher, the loop in mzap_addent can overflow and fall into the PANIC case. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16888	2024-12-26 17:10:09 -05:00
Alexander Motin	1acd246964	Fix readonly check for vdev user properties VDEV_PROP_USERPROP is equal do VDEV_PROP_INVAL and so is not a real property. That's why vdev_prop_readonly() does not work right for it. In particular it may declare all vdev user properties readonly on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16890	2024-12-20 17:25:35 -05:00
Alexander Motin	ff6266ee9b	Fix use-afer-free regression in RAIDZ expansion We should not dereference rra after the last zio_nowait() is called. It seems very unlikely, but ASAN in ztest managed to catch it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16868	2024-12-14 14:02:11 -08:00
Rob Norris	46e06feded	flush: only detect lack of flush support in one place It seems there's no good reason for vdev_disk & vdev_geom to explicitly detect no support for flush and set vdev_nowritecache. Instead, just signal it by setting the error to ENOTSUP, and let zio_vdev_io_assess() take care of it in one place. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16855	2024-12-13 12:19:54 -08:00
Rob Norris	fbea92432a	flush: don't report flush error when disabling flush support The first time a device returns ENOTSUP in repsonse to a flush request, we set vdev_nowritecache so we don't issue flushes in the future and instead just pretend the succeeded. However, we still return an error for the initial flush, even though we just decided such errors are meaningless! So, when setting vdev_nowritecache in response to a flush error, also reset the error code to assume success. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16855	2024-12-13 12:19:20 -08:00
Chunwei Chen	6c9b4f18d3	Fix DR_OVERRIDDEN use-after-free race in dbuf_sync_leaf In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db except for overridden case. However, this exception causes a race where dbuf_new_size could free the arc_buf after the last dereference of datap and causes use-after-free. We fix this by cloning the buf regardless if it's overridden. The race: -- P0 P1 dbuf_hold_impl() // dbuf_hold_copy passed // because db_data_pending NULL dbuf_sync_leaf() // doesn't clone datap // datap derefed to db_buf dbuf_write(datap) dbuf_new_size() dmu_buf_will_dirty() dbuf_fix_old_data() // alloc new buf for P0 dr // but can't change *datap arc_alloc_buf() arc_buf_destroy() // alloc new buf for db_buf // and destroy old buf dbuf_write() // continue abd_get_from_buf(data->b_data, arc_buf_size(data)) // use-after-free -- Here's an example when it happens: BUG: kernel NULL pointer dereference, address: 000000000000002e RIP: 0010:arc_buf_size+0x1c/0x30 [zfs] Call Trace: dbuf_write+0x3ff/0x580 [zfs] dbuf_sync_leaf+0x13c/0x530 [zfs] dbuf_sync_list+0xbf/0x120 [zfs] dnode_sync+0x3ea/0x7a0 [zfs] sync_dnodes_task+0x71/0xa0 [zfs] taskq_thread+0x2b8/0x4e0 [spl] kthread+0x112/0x130 ret_from_fork+0x1f/0x30 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes #16854	2024-12-12 16:18:45 -08:00
Alexander Motin	19a04e5ad2	BRT: Check bv_mos_entries in brt_entry_lookup() When vdev first sees some block cloning, there is a window when brt_maybe_exists() might already return true since something was cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet created. In such case we should not try to look into the ZAP and dereference NULL bv_mos_entries_dnode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16851	2024-12-12 10:22:41 -08:00
Rob Norris	e0039c7057	Remove unnecessary CSTYLED escapes on top-level macro invocations cstyle can handle these cases now, so we don't need to disable it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16840	2024-12-06 08:53:57 -08:00
Alexander Motin	a44eaf1690	Optimize RAIDZ expansion - Instead of copying one ashift-sized block per ZIO, copy as much as we have contiguous data up to 16MB per old vdev. To avoid data moves use gang ABDs, so that read ZIOs can directly fill buffers for write ZIOs. ABDs have much smaller overhead than ZIOs in both memory usage and processing time, plus big I/Os do not depend on I/O aggregation and scheduling to reach decent performance on HDDs. - Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms. - Use 32bit range tree when possible (practically always now) to slightly reduce memory usage. - Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as for main ones. - Fix rate overflows in `zpool status` reporting. With these changes expanding RAIDZ1 from 4 to 5 children I am able to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are limited by devices instead of CPU. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15680 Closes #16819	2024-12-06 08:50:16 -08:00
Alexander Motin	e8b333e4d3	Fix false assertion in dmu_tx_dirty_buf() on cloning Same as writes block cloning can increase block size and number of indirection levels. That means it can dirty block 0 at level 0 or at new top indirection level without explicitly holding them. A block cloning test case for large offsets has been added. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16825	2024-12-05 11:48:08 -08:00
Don Brady	44446dccdb	During pool export flush the ARC asynchronously This also includes removing L2 vdevs asynchronously. This commit also guarantees that spa_load_guid is unique. The zpool reguid feature introduced the spa_load_guid, which is a transient value used for runtime identification purposes in the ARC. This value is not the same as the spa's persistent pool guid. However, the value is seeded from spa_generate_load_guid() which does not check for uniqueness against the spa_load_guid from other pools. Although extremely rare, you can end up with two different pools sharing the same spa_load_guid value! So we guarantee that the value is always unique and additionally not still in use by an async arc flush task. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com> Closes #16215	2024-12-05 08:58:20 -08:00
Alexander Motin	a01504b35c	Improve speculative prefetcher for block cloning - Issue prescient prefetches for demand indirect blocks after the first one. It should be quite rare for reads/writes, but much more useful for cloning due to much bigger (up to 1022 blocks) accesses. It covers the gap during the first couple accesses when we can not speculate yet, but we know what is needed right now. It reduces dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode(). - Increase maximum prefetch distance for indirect blocks from 64 to 128MB. It should cover the maximum 1022 blocks of block cloning access size in case of default 128KB recordsize used. In case of bigger recordsize the above prescient prefetch should also help. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16814	2024-12-04 15:19:05 -08:00
Alexander Motin	c33a55b0c2	Allow dsl_deadlist_open() return errors In some cases like dsl_dataset_hold_obj() it is possible to handle those errors, so failure to hold dataset should be better than kernel panic. Some other places where these errors are still not handled but asserted should be less dangerous just as unreachable. We have a user report about pool corruption leading to assertions on these errors. Hopefully this will make behavior a bit nicer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16836	2024-12-04 15:15:58 -08:00
Mariusz Zaborski	4b4e346b9f	Add ability to scrub from last scrubbed txg Some users might want to scrub only new data because they would like to know if the new write wasn't corrupted. This PR adds possibility scrub only newly written data. This introduces new `last_scrubbed_txg` property, indicating the transaction group (TXG) up to which the most recent scrub operation has checked and repaired the dataset, so users can run scrub only from the last saved point. We use a scn_max_txg and scn_min_txg which are already built into scrub, to accomplish that. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Closes #16301	2024-12-04 14:21:45 -05:00
Alexander Motin	6e3c109bc0	Fix regression in dmu_buf_will_fill() Direct I/O implementation added condition to call dbuf_undirty() only in case of block cloning. But the condition is not right if the block is no longer dirty in this TXG, but still in DB_NOFILL state. It resulted in block not reverting to DB_UNCACHED and following NULL de-reference on attempt to access absent db_data. While there, add assertions for db_data to make debugging easier. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16829	2024-12-02 17:08:40 -08:00
Pavel Snajdr	d2b0ca953f	Assert if we're logging after final txg was set This allowed to debug #16714, fixed in #16782. Without assertions added here it is difficult to figure out what logs cause the problem, since the assertion happens in sync thread context. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #16795	2024-11-25 18:37:56 -05:00
Alexander Motin	ae1d11882d	BRT: Clear bv_entcount_dirty on destroy This fixes assertion in brt_sync_table() on debug builds when last cloned block on the vdev is freed and bv_meta_dirty is cleared, while bv_entcount_dirty is not. Should not matter in production. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16791	2024-11-21 08:20:22 -08:00
Alexander Motin	9a81484e35	ZAP: Reduce leaf array and free chunks fragmentation Previous implementation of zap_leaf_array_free() put chunks on the free list in reverse order. Also zap_leaf_transfer_entry() and zap_entry_remove() were freeing name and value arrays in reverse order. Together this created a mess in the free list, making following allocations much more fragmented than necessary. This patch re-implements zap_leaf_array_free() to keep existing chunks order, and implements non-destructive zap_leaf_array_copy() to be used in zap_leaf_transfer_entry() to allow properly ordered freeing name and value arrays there and in zap_entry_remove(). With this change test of some writes and deletes shows percent of non-contiguous chunks in DDT reducing from 61% and 47% to 0% and 17% for arrays and frees respectively. Sure some explicit sorting could do even better, especially for ZAPs with variable-size arrays, but it would also cost much more, while this should be very cheap. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16766	2024-11-20 13:37:52 -08:00
Mark Johnston	d76d79fd27	zio: Avoid sleeping in the I/O path zio_delay_interrupt(), apparently used for fault injection, is executed in the I/O pipeline. It can cause the calling thread to go to sleep, which is not allowed on FreeBSD. This happens only for small delays, though, and there's no apparent reason to avoid deferring to a taskqueue in that case, as it already does otherwise. Simply go to sleep unconditionally. This fixes an occasional panic I see when running the ZTS on FreeBSD. Also remove an unhelpful comment referencing the non-existent timeout_generic(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes #16785	2024-11-20 08:23:08 -08:00
Alexander Motin	457f8b76e7	BRT: More optimizations after per-vdev splitting - With both pending and current AVL-trees being per-vdev and having effectively identical comparison functions (pending tree compared also birth time, but I don't believe it is possible for them to be different for the same offset within one transaction group), it makes no sense to move entries from one to another. Instead inline dramatically simplified brt_entry_addref() into brt_pending_apply(). It no longer requires bv_lock, since there is nothing concurrent to it at the time. And it does not need to search the tree for the previous entries, since it is the same tree, we already have the entry and we know it is unique. - Put brt_vdev_lookup() and brt_vdev_addref() into different tree traversals to avoid false positives in the first due to the second entcount modifications. It saves dramatic amount of time when a file cloned first time by not looking for non-existent ZAP entries. - Remove avl_is_empty(bv_tree) check from brt_maybe_exists(). I don't think it is needed, since by the time all added entries are already accounted in bv_entcount. The extra check must be producing too many false positives for no reason. Also we don't need bv_lock there, since bv_entcount pointer must be table at this point, and we don't care about false positive races here, while false negative should be impossible, since all brt_vdev_addref() have already completed by this point. This dramatically reduces lock contention on massive deletes of cloned blocks. The only remaining one is between multiple parallel free threads calling brt_entry_decref(). - Do not update ZAP if net change for a block over the TXG was 0. In combination with above it makes file move between datasets as cheap operation as originally intended if it fits into one TXG. - Do not allocate vdevs on pool creation or import if it did not have active block cloning. This allows to save a bit in few cases. - While here, add proper error handling in brt_load() on pool import instead of assertions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16773	2024-11-20 06:20:14 -08:00
Alexander Motin	0ca82c5680	L2ARC: Stop rebuild before setting spa_final_txg Without doing that there is a race window on export when history log write by completed rebuild dirties transaction beyond final, triggering assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16714 Closes #16782	2024-11-20 06:11:51 -08:00
Alexander Motin	534688948c	Remove hash_elements_max accounting from DBUF and ARC Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16759	2024-11-19 07:00:16 -08:00
Rob Norris	ffe2112795	Move "no name changes" from compression to checksum table Compression names actually aren't used in dedup table names, but checksum names are. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16776	2024-11-19 06:55:27 -08:00
Alexander Motin	fd6e8c1d2a	BRT: Rework structures and locks to be per-vdev While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-15 15:04:11 -08:00
Alexander Motin	309ce6303f	ZAP: Add by_dnode variants to lookup/prefetch_uint64 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-15 15:04:02 -08:00
Alexander Motin	1ee251bdde	BRT: Don't call brt_pending_remove() on holes/embedded We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740	2024-11-15 15:03:57 -08:00
Rob Norris	46c4f2ce0b	dsl_dataset: put IO-inducing frees on the pool deadlist dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #6783 Closes #16708 Closes #16722 Closes #16697	2024-11-13 07:38:42 -08:00
Alexander Motin	a60ed3822b	L2ARC: Move different stats updates earlier ..., before we make the header or the log block visible to others. It should fix assertion on allocated space going negative if the header is freed once the lock is dropped, while the write is still going. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16040 Closes #16743	2024-11-13 07:31:50 -08:00
Chunwei Chen	5945676bcc	ZFS send should use spill block prefetched from send_reader_thread Currently, even though send_reader_thread prefetches spill block, do_dump() will not use it and issues its own blocking arc_read. This causes significant performance degradation when sending datasets with lots of spill blocks. For unmodified spill blocks, we also create send_range struct for them in send_reader_thread and issue prefetches for them. We piggyback them on the dnode send_range instead of enqueueing them so we don't break send_range_after check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: david.chen <david.chen@nutanix.com> Closes #16701	2024-11-06 11:52:01 -08:00
Alexander Motin	b16e096198	Reduce dirty records memory usage Small block workloads may use a very large number of dirty records. During simple block cloning test due to BRT still using 4KB blocks I can easily see up to 2.5M of those used. Before this change dbuf_dirty_record_t structures representing them were allocated via kmem_zalloc(), that rounded their size up to 512 bytes. Introduction of specialized kmem cache allows to reduce the size from 512 to 408 bytes. Additionally, since override and raw params in dirty records are mutually exclusive, puting them into a union allows to reduce structure size down to 368 bytes, increasing the saving to 28%, that can be a 0.5GB or more of RAM. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16694	2024-11-04 16:42:06 -08:00
Rob Norris	91bd12dfeb	zfs(4): remove "experimental" from zfs_bclone_enabled I think we've done enough experiments. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16189 Closes #16712	2024-11-01 14:43:25 -07:00
Rob Norris	3c650bec15	Revert "Workaround issue of Linux vdev_disk.c, (#16678 )" Now that we can handle these different alignments, we don't this workaround. This reverts commit `aefc2da8a5`. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #16687	2024-10-31 17:00:53 -07:00
Serapheim Dimitropoulos	ae93aeb849	Add warning for external consumers of dmu_tx_callback_register While reading some code @grwilson came across the above function that seemingly had no consumers besides a ztest callback that ensures that the tx_callback infrastructure works correctly. It turns out that Lustre is the main (and potentially the only) consumer of this. Refer to `osd_trans_commit_cb` of `lustre/osd-zfs/osd_handler.c` in the Lustre repo for more info. Let's add a comment highlighting this before someone removes it by mistake. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Serapheim Dimitropoulos <serapheimd@gmail.com> Closes #16698	2024-10-30 20:11:40 -04:00
Alexander Motin	6187b19434	On the first vdev open ignore impossible ashift hints If on the first open device's logical ashift is bigger than set by pool's ashift property, ignore the last as unusable instead of creating vdev that will fail most of I/Os due to misalignment. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16690	2024-10-29 15:23:24 -04:00

1 2 3 4 5 ...

3302 Commits