Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
The default buffer size when requesting multiple quota entries
is 100 times the zfs_useracct_t size. In practice this works out
to exactly 27200 bytes. Since this will be a short lived buffer
in a non-performance critical path it is preferable to vmem_alloc()
the needed memory.
We will never bring over the pyzfs.py helper script from Solaris
to Linux. Instead the missing functionality will be directly
integrated in to the zfs commands and libraries. To avoid
confusion remove the warning about the missing pyzfs.py utility
and simply use the default internal support.
The Illumous developers are of the same mind and have proposed an
initial patch to do this which has been integrated in to the 'allow'
development branch. After some additional testing this code
can be merged in to master as the right long term solution.
Initially when zfsdev_ioctl() was ported to Linux we didn't have
any credential support implemented. So at the time we simply
passed NULL which wasn't much of a problem since most of the
secpolicy code was disabled.
However, one exception is quota handling which does require the
credential. Now that proper credentials are supported we can
safely start passing the callers credential. This is also an
initial step towards fully implemented the zfs secpolicy.
Normally when the arc_shrinker_func() function is called the return
value should be:
>=0 - To indicate the number of freeable objects in the cache, or
-1 - To indicate this cache should be skipped
However, when the shrinker callback is called with 'nr_to_scan' equal
to zero. The caller simply wants the number of freeable objects in
the cache and we must never return -1. This patch reorders the
first two conditionals in arc_shrinker_func() to ensure this behavior.
This patch also now explictly casts arc_size and arc_c_min to signed
int64_t types so MAX(x, 0) works as expected. As unsigned types
we would never see an negative value which defeated the purpose of
the MAX() lower bound and broke the shrinker logic.
Finally, when nr_to_scan is non-zero we explictly prevent all reclaim
below arc_c_min. This is done to prevent the Linux page cache from
completely crowding out the ARC. This limit is tunable and some
experimentation is likely going to be required to set it exactly right.
For now we're sticking with the OpenSolaris defaults.
Closes#218Closes#243
The comment in zfs_close() pertaining to decrementing the synchronous
open count needs to be updated for Linux. The code was already
updated to be correct, but the comment was missed and is now misleading.
Under Linux the zfs_close() hook is only called once when the final
reference is dropped. This differs from Solaris where zfs_close()
is called for each close.
Closes#237
This workaround was introduced to workaround issue #164. This
issue was fixed by commit 5f35b19 so the workaround can be safely
dropped from both the zfs.fedora and zfs.gentoo init scripts.
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute. While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets. Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed. Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.
Closes#216
The direct reclaim path in the z_wr_* threads must be disabled
to ensure forward progress is always maintained for txg processing.
This ensures that a txg will never get stuck waiting on itself
because it entered the following memory reclaim callpath.
->prune_icache()->dispose_list()->zpl_clear_inode()->zfs_inactive()
->dmu_tx_assign()->dmu_tx_wait()->tgx_wait_open()
It would be preferable to target this exact code path but the
kernel offers no way to do this without custom patches. To avoid
this we are forced to disable all reclaim for these threads. It
should not be necessary to do this for other other z_* threads
because they will not hold a txg open.
Closes#232
How nfsd handles .fsync() has been changed a couple of times in the
recent kernels. But basically there are three cases we need to
consider.
Linux 2.6.12 - 2.6.33
* The .fsync() hook takes 3 arguments
* The nfsd will call .fsync() with a NULL file struct pointer.
Linux 2.6.34
* The .fsync() hook takes 3 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
Linux 2.6.35 - 2.6.x
* The .fsync() hook takes 2 arguments
* The nfsd no longer calls .fsync() but instead used sync_inode()
For once it looks like we've gotten lucky. The first two cases can
actually be collased in to one if we stop using the file struct
pointer entirely. Since the dentry is still passed in both cases
this is possible. The last case can then be safely handled by
unconditionally using the dentry in the file struct pointer now
that we know the nfsd caller has been removed.
Closes#230
The zpool_id and zpool_layout helper scripts have been updated to
use the more common /usr/bin/awk symlink. On Fedora/Redhat systems
there are both /bin/awk and /usr/bin/awk symlinks to your installed
version of awk. On Debian/Ubuntu systems only the /usr/bin/awk
symlink exists.
Additionally, add the '\<' token to the beginning of the regex
pattern to prevent partial matches. This pattern only appears to
work with gawk despite the mawk man page claiming to support this
extended regex. Thus you will need to have gawk installed to use
these optional helper scripts. A comment has been added to the
script to reflect this reality.
The default buffer size when requesting history is 128k. This
is far to large for a kmem_alloc() so instead use the slower
vmem_alloc(). This path has no performance concerns and the
buffer is immediately free'd after its contents are copied to
the user space buffer.
Resolve this lintian warning:
W: libzfs-dev: wrong-section-according-to-package-name libzfs-dev => libdevel
N:
N: This package has a name suggesting that it belongs to a section other
N: than the one it is currently categorized in.
N:
N: Severity: normal, Certainty: possible
N:
Resolve this lintian warning:
W: zfs-dkms: maintainer-script-empty postrm
N:
N: The maintainer script doesn't seem to contain any code other than
N: comments and boilerplate (set -e, exit statements, and the case
N: statement to parse options). While this is harmless in most cases, it is
N: probably not what you wanted, may mean the package will leave
N: unnecessary files behind until purged, and may even lead to problems in
N: rare situations where dpkg would fail if no maintainer script was
N: present.
N:
N: If the package currently doesn't need to do anything in this maintainer
N: script, it shouldn't be included in the package.
N:
N: Severity: minor, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-dkms: maintainer-script-ignores-errors prerm
N:
N: The maintainer script doesn't seem to set the -e flag which ensures that
N: the script's execution is aborted when any executed command fails.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-dkms: maintainer-script-empty preinst
N:
N: The maintainer script doesn't seem to contain any code other than
N: comments and boilerplate (set -e, exit statements, and the case
N: statement to parse options). While this is harmless in most cases, it is
N: probably not what you wanted, may mean the package will leave
N: unnecessary files behind until purged, and may even lead to problems in
N: rare situations where dpkg would fail if no maintainer script was
N: present.
N:
N: If the package currently doesn't need to do anything in this maintainer
N: script, it shouldn't be included in the package.
N:
N: Severity: minor, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-dkms: executable-not-elf-or-script usr/src/zfs-0.6.0/config/ltmain.sh
N:
N: This executable file is not an ELF format binary, and does not start
N: with the #! sequence that marks interpreted scripts. It might be a sh
N: script that fails to name /bin/sh as its shell, or it may be incorrectly
N: marked as executable. Sometimes upstream files developed on Windows are
N: marked unnecessarily as executable on other systems.
N:
N: If you are using debhelper to build your package, running dh_fixperms
N: will often correct this problem for you.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-dkms: maintainer-script-ignores-errors postinst
N:
N: The maintainer script doesn't seem to set the -e flag which ensures that
N: the script's execution is aborted when any executed command fails.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-dkms: script-not-executable usr/src/zfs-0.6.0/autogen.sh
N:
N: This file starts with the #! sequence that marks interpreted scripts,
N: but it is not executable.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-initramfs: maintainer-script-ignores-errors preinst
N:
N: The maintainer script doesn't seem to set the -e flag which ensures that
N: the script's execution is aborted when any executed command fails.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning:
W: zfsutils: extra-license-file usr/share/doc/zfsutils/COPYING
N:
N: All license information should be collected in the debian/copyright
N: file. This usually makes it unnecessary for the package to install this
N: information in other places as well.
N:
N: Refer to Debian Policy Manual section 12.5 (Copyright information) for
N: details.
N:
N: Severity: normal, Certainty: possible
N:
Resolve this lintian warning:
W: zfsutils: maintainer-script-ignores-errors preinst
N:
N: The maintainer script doesn't seem to set the -e flag which ensures that
N: the script's execution is aborted when any executed command fails.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian warning by disabling the zpool-layout example:
W: zfsutils: executable-not-elf-or-script usr/libexec/zfs/zpool-layout/dragon.llnl.conf
W: zfsutils: executable-not-elf-or-script usr/libexec/zfs/zpool-layout/dragon.ddn.conf
N:
N: This executable file is not an ELF format binary, and does not start
N: with the #! sequence that marks interpreted scripts. It might be a sh
N: script that fails to name /bin/sh as its shell, or it may be incorrectly
N: marked as executable. Sometimes upstream files developed on Windows are
N: marked unnecessarily as executable on other systems.
N:
N: If you are using debhelper to build your package, running dh_fixperms
N: will often correct this problem for you.
N:
N: Refer to Debian Policy Manual section 10.4 (Scripts) for details.
N:
N: Severity: normal, Certainty: certain
N:
Resolve this lintian error:
E: zfsutils: extended-description-is-empty
N:
N: The extended description (the lines after the first line of the
N: "Description:" field) is empty.
N:
N: Severity: serious, Certainty: certain
N:
Resolve this lintian warning:
W: zfs-linux source: maintainer-script-lacks-debhelper-token debian/zfsutils.preinst
N:
N: This package is built using debhelper commands that may modify
N: maintainer scripts, but the maintainer scripts do not contain the
N: "#DEBHELPER#" token debhelper uses to modify them.
N:
N: Adding the token to the scripts is recommended.
N:
N: Severity: normal, Certainty: possible
N:
Resolve this lintian warning:
W: zfs-linux source: maintainer-script-lacks-debhelper-token debian/zfs-initramfs.preinst
N:
N: This package is built using debhelper commands that may modify
N: maintainer scripts, but the maintainer scripts do not contain the
N: "#DEBHELPER#" token debhelper uses to modify them.
N:
N: Adding the token to the scripts is recommended.
N:
N: Severity: normal, Certainty: possible
N:
With the addition of the mount helper we accidentally regressed
the ability to manually mount snapshots. This commit updates
the mount helper to expect the possibility of a ZFS_TYPE_SNAPSHOT.
All snapshot will be automatically treated as 'legacy' type mounts
so they can be mounted manually.
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)