mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs synced 2025-10-31 04:00:50 +00:00

Author	SHA1	Message	Date
Ameer Hamza	30cc2331f4	zed: Ensure spare activation after kernel-initiated device removal In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in a customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event. To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17187	2025-03-28 15:48:38 -04:00
Alexander Motin	4abc21b28c	Block remap for cloned blocks on device removal When after device removal we handle block pointers remap, skip blocks that might be cloned. BRTs are indexed by vdev id and offset from block pointer's DVA[0]. So if we start addressing the same block by some different DVA, we won't get the proper reference counter. As result, we might either remap the block twice, that may result in assertion during indirect mapping condense, or free it prematurely, that may result in data overwrite, or free it twice, that may result in assertion in spacemap code. Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #15604 Closes #17180	2025-03-26 16:45:34 -07:00
Pavel Snajdr	a0e62718cf	Linux: Fix zfs_prune panics v2 (#17121 ) It turns out that approach taken in the original version of the patch was wrong. So now, we're taking approach in-line with how kernel actually does it - when sb is being torn down, access to it is serialized via sb->s_umount rwsem, only when that lock is taken is it okay to work with s_flags - and the other mistake I was doing was trying to make SB_ACTIVE work, but apparently the kernel checks the negative variant - not SB_DYING and not SB_BORN. Kernels pre-6.6 don't have SB_DYING, but check if sb is hashed instead. Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-25 15:20:16 -07:00
Alexander Motin	94a3fabcb0	Unified allocation throttling (#17020 ) Existing allocation throttling had a goal to improve write speed by allocating more data to vdevs that are able to write it faster. But in the process it completely broken the original mechanism, designed to balance vdev space usage. With severe vdev space use imbalance it is possible that some with higher use start growing fragmentation sooner than others and after getting full will stop any writes at all. Also after vdev addition it might take a very long time for pool to restore the balance, since the new vdev does not have any real preference, unless the old one is already much slower due to fragmentation. Also the old throttling was request- based, which was unpredictable with block sizes varying from 512B to 16MB, neither it made much sense in case of I/O aggregation, when its 32-100 requests could be aggregated into few, leaving device underutilized, submitting fewer and/or shorter requests, or in opposite try to queue up to 1.6GB of writes per device. This change presents a completely new throttling algorithm. Unlike the request-based old one, this one measures allocation queue in bytes. It makes possible to integrate with the reworked allocation quota (aliquot) mechanism, which is also byte-based. Unlike the original code, balancing the vdevs amounts of free space, this one balances their free/used space fractions. It should result in a lower and more uniform fragmentation in a long run. This algorithm still allows to improve write speed by allocating more data to faster vdevs, but does it in more controllable way. On top of space-based allocation quota, it also calculates minimum queue depth that vdev is allowed to maintain, and respectively the amount of extra allocations it can receive if it appear faster. That amount is based on vdev's capacity and space usage, but also applied only when the pool is busy. This way the code can choose between faster writes when needed and better vdev balance when not, with the choice gradually reducing together with the free space. This change also makes allocation queues per-class, allowing them to throttle independently and in parallel. Allocations that are bounced between classes due to allocation errors will be able to properly throttle in the new class. Allocations that should not be throttled (ZIL, gang, copies) are not, but may still follow the rotor and allocation quota mechanism of the class without disrupting it. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com>	2025-03-24 09:25:01 -07:00
Rob Norris	45e9b54e9e	freebsd/kstat: allow multi-level module names This extends the existing special-case for zfs/poolname to split and create any number of intermediate sysctl names, so that multi-level module names are possible. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-20 16:25:32 -07:00
Rob Norris	d28d2e3007	linux/kstat: allow multi-level module names Module names are mapped directly to directory names in procfs, but nothing is done to create the intermediate directories, or remove them. This makes it impossible to sensibly present kstats about sub-objects. This commit loops through '/'-separated names in the full module name, creates a separate module for each, and hooks them up with a parent pointer and child counter, and then unrolls this on the other side when deleting a module. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-20 16:24:50 -07:00
Paul Dagnelie	9250403ba6	Make ganging redundancy respect redundant_metadata property (#17073 ) The redundant_metadata setting in ZFS allows users to trade resilience for performance and space savings. This applies to all data and metadata blocks in zfs, with one exception: gang blocks. Gang blocks currently just take the copies property of the IO being ganged and, if it's 1, sets it to 2. This means that we always make at least two copies of a gang header, which is good for resilience. However, if the users care more about performance than resilience, their gang blocks will be even more of a penalty than usual. We add logic to calculate the number of gang headers copies directly, and store it as a separate IO property. This is stored in the IO properties and not calculated when we decide to gang because by that point we may not have easy access to the relevant information about what kind of block is being stored. We also check the redundant_metadata property when doing so, and use that to decide whether to store an extra copy of the gang headers, compared to the underlying blocks. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Co-authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-19 15:58:29 -07:00
Alexander Motin	676b7ef104	Fix deadlock on I/O errors during device removal spa_vdev_remove_thread() should not hold svr_lock while loading a metaslab. It may block ZIO threads, required to handle metaslab loading, at least in case of read errors causing recovery writes. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17145	2025-03-19 14:48:47 -04:00
aokblast	83fa051ceb	spl_vfs: fix vrele task runner signature mismatch Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17101	2025-03-19 11:26:45 -04:00
Alan Somers	d033f26765	Always perform bounds-checking in metaslab_free_concrete The vd->vdev_ms access can overflow due to on-disk corruption, not just due to programming bugs. So it makes sense to check its boundaries even in production builds. Sponsored by: ConnectWise Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17136	2025-03-19 11:24:43 -04:00
Alexander Motin	3cd9934a48	Some arc_release() cleanup - Don't drop L2ARC header if we have more buffers in this header. Since we leave them the header, leave them the L2ARC header also. Honestly we are not required to drop it even if there are no other buffers, but then we'd need to allocate it a separate header, which we might drop soon if the old block is really deleted. Multiple buffers in a header likely mean active snapshots or dedup, so we know that the block in L2ARC will remain valid. It might be rare, but why not? - Remove some impossible assertions and conditions. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17126	2025-03-18 21:25:50 -04:00
Rob Norris	f69631992d	dmu_tx: rename dmu_tx_assign() flags from TXG_* to DMU_TX_* (#17143 ) This helps to avoids confusion with the similarly-named txg_wait_synced(). Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-03-18 16:04:22 -07:00
Rob Norris	a8847a7e4f	SPDX: license tags: LicenseRef-OpenZFS-ThirdParty-PublicDomain SPDX have repeatedly rejected the creation of a tag for a public domain dedication, as not all dedications are clear and unambiguious in their meaning and not all jurisdictions permit relinquishing a copyright anyway. A reasonably common workaround appears to be to create a local (project-specific) identifier to convey whatever meaning the project wishes it to. To cover OpenZFS' use of third-party code with a public domain dedication, we use this custom tag. Further reading: - https://github.com/spdx/old-wiki/blob/main/Pages/Legal%20Team/Decisions/Dealing%20with%20Public%20Domain%20within%20SPDX%20Files.md - https://spdx.github.io/spdx-spec/v2.3/other-licensing-information-detected/ - https://cr.yp.to/spdx.html Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:31 -07:00
Rob Norris	bf754d6010	SPDX: license tags: OpenSSL-standalone Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:27 -07:00
Rob Norris	44dc09614e	SPDX: license tags: Brian-Gladman-3-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:24 -07:00
Rob Norris	9fd7401d75	SPDX: license tags: BSD-2-Clause OR GPL-2.0-only Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:21 -07:00
Rob Norris	ff28d9fc6f	SPDX: license tags: BSD-3-Clause OR GPL-2.0-only Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:17 -07:00
Rob Norris	f83431b3bd	SPDX: license tags: GPL-2.0-or-later Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:09 -07:00
Rob Norris	c404ffb446	SPDX: license tags: Apache-2.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:57:05 -07:00
Rob Norris	7d8dd8d9a5	SPDX: license tags: MIT Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:54 -07:00
Rob Norris	4eafa9e5e8	SPDX: license tags: BSD-3-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:50 -07:00
Rob Norris	137045be98	SPDX: license tags: BSD-2-Clause Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:46 -07:00
Rob Norris	eb9098ed47	SPDX: license tags: CDDL-1.0 Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>	2025-03-13 17:56:27 -07:00
shodanshok	201d262949	Add receive:append permission for limited receive Force receive (zfs receive -F) can rollback or destroy snapshots and file systems that do not exist on the sending side (see zfs-receive man page). This means an user having the receive permission can effectively delete data on receiving side, even if such user does not have explicit rollback or destroy permissions. This patch adds the receive:append permission, which only permits limited, non-forced receive. Behavior for users with full receive permission is not changed in any way. Fixes #16943 Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes #17015	2025-03-13 13:54:14 -04:00
Paul Dagnelie	1b495eeab3	FDT dedup log sync -- remove incremental This PR condenses the FDT dedup log syncing into a single sync pass. This reduces the overhead of modifying indirect blocks for the dedup table multiple times per txg. In addition, changes were made to the formula for how much to sync per txg. We now also consider the backlog we have to clear, to prevent it from growing too large, or remaining large on an idle system. Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Authored-by: Don Brady <don.brady@klarasystems.com> Authored-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17038	2025-03-13 13:47:03 -04:00
Alexander Motin	0ea44e576b	Fix deduplication of overridden blocks Implementation of DDT pruning introduced verification of DVAs in a block pointer during ddt_lookup() to not by mistake free previous pruned incarnation of the entry. But when writing a new block in zio_ddt_write() we might have the DVAs only from override pointer, which may never have "D" flag to be confused with pruned DDT entry, and we'll abandon those DVAs if we find a matching entry in DDT. This fixes deduplication for blocks written via dmu_sync() for purposes of indirect ZIL write records, that I have tested. And I suspect it might actually allow deduplication for Direct I/O, even though in an odd way -- first write block directly and then delete it later during TXG commit if found duplicate, which part I haven't tested. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17120	2025-03-13 13:27:57 -04:00
Alexander Motin	09f4dd06c3	Prefer embedded blocks to dedup Since embedded blocks introduction 11 years ago, their writing was blocked if dedup is enabled. After searching through the modern code I see no reason for this restriction to exist. Same time embedded blocks are dramatically cheaper. Even regular write of so small blocks would likely be cheaper than deduplication, even if the last is successful, not mentioning otherwise. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17113	2025-03-13 13:27:15 -04:00
Rob Norris	13ec35ce3b	Linux/vnops: implement STATX_DIOALIGN This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both memory and IO alignment. For memory, it's always PAGE_SIZE. For IO, we return the current block size for the file, which is the required alignment for an arbitrary block, and for the first block we'll fall back to the ARC when necessary, so it should always work. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #16972	2025-03-13 13:15:14 -04:00
Alan Somers	0433523ca2	Verify every block pointer is either embedded, hole, or has a valid DVA Now instead of crashing when attempting to read the corrupt block pointer, ZFS will return ECKSUM, in a stack that looks like this: ``` none:set-error zfs.ko`arc_read+0x1d82 zfs.ko`dbuf_read+0xa8c zfs.ko`dmu_buf_hold_array_by_dnode+0x292 zfs.ko`dmu_read_uio_dnode+0x47 zfs.ko`zfs_read+0x2d5 zfs.ko`zfs_freebsd_read+0x7b kernel`VOP_READ_APV+0xd0 kernel`vn_read+0x20e kernel`vn_io_fault_doio+0x45 kernel`vn_io_fault1+0x15e kernel`vn_io_fault+0x150 kernel`dofileread+0x80 kernel`sys_read+0xb7 kernel`amd64_syscall+0x424 kernel`0xffffffff810633cb ``` This patch should hopefully also prevent such corrupt block pointers from being written to disk in the first place. And in zdb, don't crash when printing a block pointer with no valid DVAs. If a block pointer isn't embedded yet doesn't have any valid DVAs, that's a data corruption bug. zdb should be able to handle the situation gracefully. Finally, remove an extra check for gang blocks in SNPRINTF_BLKPTR. This check, which compares the asizes of two different DVAs within the same BP, was added by illumos-gate commit b24ab67[^1], and I can't understand why. It doesn't appear to do anything useful, so remove it. [^1]: `b24ab67627` Fixes #17077 Sponsored by: ConnectWise Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #17078	2025-03-13 13:07:48 -04:00
Alexander Motin	fe674998bb	Check portable objset MAC even if local is zeroed PR #14161 made spa_do_crypt_objset_mac_abd() to ignore MAC errors if local MAC can not be calculated at the time. But it does not mean we should also ignore portable MAC errors there. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17122	2025-03-08 21:15:11 -05:00
Rob Norris	7f05fface3	gcm_avx_init: zero the ghash state after hashing the IV IVs != 96 bits get hashed with GHASH to bring them to 96 bits. Any call to GHASH will mix the ghash state in gcm_ghash. This is expected to be zero at first use in an encrypt or decrypt operation, so it needs to be zeroed after using GHASH in setup. gcm_init() does this, but gcm_avx_init() zeroed it before setup, not after, resulting in incorrect encrypt/decrypt results when using AVX GCM with an IV != 96 bits. OpenZFS _always_ uses a 96 bit IV (ZIO_DATA_IV_LEN) so this will never have been hit in any real-world use, which is extremely fortunate, as we would have incorrectly-encrypted data on-disk. Still, as long as we have this code here we should make sure it's correct. Thanks-to: Joel Low <joel@joelsplace.sg> Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Attila Fülöp <attila@fueloep.org>	2025-02-25 17:31:08 -08:00
Ameer Hamza	ab3db6d15d	arc: avoid possible deadlock in arc_read In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #17071	2025-02-25 14:32:12 -05:00
Paul Dagnelie	701093c44f	Don't try to get mg of hole vdev in removal Don't try to get mg of hole vdev in removal Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17080	2025-02-25 14:30:51 -05:00
aokblast	a5fb5c55be	spa: fix signature mismatch for spa_boot_init as eventhandler required Reviewed-by: Ameer Hamza <ahamza@ixsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: SHENGYI HONG <aokblast@FreeBSD.org> Closes #17088	2025-02-25 14:28:57 -05:00
Alexander Motin	d7d2744711	Better fill empty metaslabs Before this change zfs_metaslab_switch_threshold tunable switched metaslabs each time ones index reduced by two (which means biggest contiguous chunk reduced to 1/4). It is a good idea to balance metaslabs fragmentation. But for empty metaslabs (having power- of-2 sizes) this means switching when they get just below the half of their capacity. Inspection with zdb after filling new pool to half capacity shown most of its metaslabs filled to half capacity. I consider this sub-optimal for pool fragmentation in a long run. This change blocks the metaslabs switching if most of the metaslab free space (15/16) is represented by a single contiguous range. Such metaslab should not be considered fragmented until it actually fail some big allocation. More contiguous filling should improve data locality and increase time before previously filled and partially freed metaslab is touched again, giving it more time to free more contiguous chunks for lower fragmentation. It should also slightly reduce spacemap traffic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #17081	2025-02-25 14:26:34 -05:00
Rob Norris	ee8803adc2	vdev_file: make FLUSH and TRIM asynchronous zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the zio_taskq thread is active and blocked both while waiting for the IO call and then while calling zio_execute() for the next stage. This is a particular issue for FLUSH, as the z_flush_iss queue typically only has one thread; multiple flushes arriving at once can cause long delays if the underlying fsync() response is particularly slow. To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq, just as we do for reads and writes. Further, we return all results through zio_interrupt(), so neither the issue nor the file taskqs are blocked. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17064	2025-02-22 14:16:54 -05:00
Chunwei Chen	682c5f6a0a	Fix wrong free function in arc_hdr_decrypt Need to use arc_free_data_abd to free abd type buffer. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #17079	2025-02-22 13:50:33 -05:00
Rob Norris	c43df8bbbf	vdev_file: unify FreeBSD and Linux implementations (#17046 ) Kernel & userspace specifics are in zfs_file_os.c, so there's no particular reason these have to be separate. The one platform-specific part is in the Linux kernel part, to offload flushes to a taskq if we're already inside a filesystem transaction. This would be normally be an unsatisfying wart, but I'm intending to remove this shortly, so I'm content to leave it gated for the moment. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>	2025-02-20 10:42:42 -08:00
Alexander Motin	6a2f7b3844	Fix metaslab group fragmentation math (#17037 ) Since we are calculating a free space fragmentation, we should weight metaslabs by the amount of their free space, not a full size. Fragmentation of full metaslabs may not matter in presence empty ones. The old algorithm did not differentiate metaslabs having only one free 4KB block from metaslabs having 50% of space free in 4KB blocks, reporting higher fragmentation. While there, move metaslab_group_alloc_update() call after setting mg_fragmentation, otherwise the effect may be delayed by one TXG. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-18 10:45:42 -08:00
Rob Norris	68473c4fd8	range_tree: convert remaining range_* defs to zfs_range_* Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-14 15:37:56 -08:00
Ivan Volosyuk	d4a5a7e3aa	Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_* Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com>	2025-02-14 15:37:48 -08:00
vandanarungta	db62886d98	Free memory in an error path in spl-kmem-cache.c skc->skc_name also needs to be freed in an error path. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Vandana Rungta <vrungta@amazon.com> Closes #17041	2025-02-11 20:37:17 -05:00
Rob Norris	b8c73ab780	zio: do no-op injections just before handing off to vdevs The purpose of no-op is to simulate a failure between a device cache and its permanent store. We still want it to go through the queue and respond in the same way to everything else. So, inject "success" as the very last thing, and then move on to VDEV_IO_DONE to be dequeued and so any followup work can occur. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17029	2025-02-07 20:42:24 -05:00
Dr. Christian Kohlschütter	d2147de319	Fix "make install" with DESTDIR set (#16995 ) "DESTDIR=/path/to/target/root/ make install" may fail when installing to a root that contains an existing lib/modules structure. When run as root we may even affect the wrong kernel (the build system's one, or, if running a different version, some other directory in /lib/modules, but not the desired one installed in DESTDIR). Add a missing reference to the INSTALL_MOD_PATH root when calling "depmod" during "make install" Also add a switch "DONT_DELETE_MODULES_FILES=1" that skips the removal of files named "modules.*" prior to running depmod. Signed-off-by: Christian Kohlschütter <christian@kohlschutter.com> Closes #16994 Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-07 16:38:58 -08:00
Paul Dagnelie	88020b993c	Add kstats tracking gang allocations Gang blocks have a significant impact on the long and short term performance of a zpool, but there is not a lot of observability into whether they're being used. This change adds gang-specific kstats to ZFS, to better allow users to see whether ganging is happening. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #17003	2025-02-06 15:42:50 -05:00
Paul Dagnelie	40496514b8	Expand fragmentation table to reflect larger possibile allocation sizes When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance. We expand the fragmentation table to extend up to 512MB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Paul Dagnelie <paul.dagnelie@klarasystems.com> Closes #16986	2025-02-06 15:40:01 -05:00
Rob Norris	2ca91ba3cf	Linux 6.14: BLK_MQ_F_SHOULD_MERGE was removed According to the upstream change, all callers set it, and all block devices either honoured it or ignored it, so removing it entirely allows a bunch of handling for the "unset" case to be removed, and it becomes effectively implied. We follow suit, and keep setting it for older kernels. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-05 09:43:45 -08:00
Rob Norris	7ef6b70e96	Linux 6.14: dops->d_revalidate now takes four args This is a convenience for filesystems that need the inode of their parent or their own name, as its often complicated to get that information. We don't need those things, so this is just detecting which prototype is expected and adjusting our callback to match. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov>	2025-02-05 09:42:37 -08:00
Rob Norris	390f6c1190	zio: lock parent zios when updating wait counts on reexecute As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes #17016	2025-02-04 08:47:50 -05:00
Jaydeep Kshirsagar	21205f6488	Avoid ARC buffer transfrom operations in prefetch This change will prevent prefetch to perform unnecessary ARC buffer fill when reading from disk. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jaydeep Kshirsagar <jkshirsagar@maxlinear.com> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Closes #17013	2025-02-01 11:15:24 -05:00

1 2 3 4 5 ...

4837 Commits