The zvol_major and zvol_threads module options were being created
with 0 permission bits. This prevented them from being listed in
the /sys/module/zfs/parameters/ directory, although they were
visible in `modinfo zfs`. This patch fixes the issue by updating
the permission bits to 0444. For the moment these options must
be read-only because they are used during module initialization.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #392
In the upstream OpenSolaris ZFS code the maximum ARC usage is
limited to 3/4 of memory or all but 1GB, whichever is larger.
Because of how Linux's VM subsystem is organized these defaults
have proven to be too large which can lead to stability issues.
To avoid making everyone manually tune the ARC the defaults are
being changed to 1/2 of memory or all but 4GB. The rational for
this is as follows:
* Desktop Systems (less than 8GB of memory)
Limiting the ARC to 1/2 of memory is desirable for desktop
systems which have highly dynamic memory requirements. For
example, launching your web browser can suddenly result in a
demand for several gigabytes of memory. This memory must be
reclaimed from the ARC cache which can take some time. The
user will experience this reclaim time as a sluggish system
with poor interactive performance. Thus in this case it is
preferable to leave the memory as free and available for
immediate use.
* Server Systems (more than 8GB of memory)
Using all but 4GB of memory for the ARC is preferable for
server systems. These systems often run with minimal user
interaction and have long running daemons with relatively
stable memory demands. These systems will benefit most by
having as much data cached in memory as possible.
These values should work well for most configurations. However,
if you have a desktop system with more than 8GB of memory you may
wish to further restrict the ARC. This can still be accomplished
by setting the 'zfs_arc_max' module option.
Additionally, keep in mind these aren't currently hard limits.
The ARC is based on a slab implementation which can suffer from
memory fragmentation. Because this fragmentation is not visible
from the ARC it may believe it is within the specified limits while
actually consuming slightly more memory. How much more memory get's
consumed will be determined by how badly fragmented the slabs are.
In the long term this can be mitigated by slab defragmentation code
which was OpenSolaris solution. Or preferably, using the page cache
to back the ARC under Linux would be even better. See issue #75
for the benefits of more tightly integrating with the page cache.
This change also fixes a issue where the default ARC max was being
set incorrectly for machines with less than 2GB of memory. The
constant in the arc_c_max comparison must be explicitly cast to
a uint64_t type to prevent overflow and the wrong conditional
branch being taken. This failure was typically observed in VMs
which are commonly created with less than 2GB of memory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #75
For consistency and safety, quote all variables in the zfs.lsb script.
This protects in the unlikely case that any of the file names contain
whitespace.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
Let the administrator override all script variables by sourcing the
/etc/default/zfs file after the default values are set.
The spelling mistake in the old path name makes it unlikely that this
bug affected any users.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #371
The zpool_id script is posixly correct and does not use bash
features, so change its whackbang from /bin/bash to /bin/sh.
Debian policy also stipulates that system scripts be dash compatible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Direct invocation of GNU egrep is deprecated by its man page, and the
its argument in the zpool_id script is not an extended expression.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency and safety, quote all variables in the zpool_id
script. This accomodates a `-c CONFIG` parameter value with
whitespace in the path name.
Also fix a typo in the usage synopsis for `-h`.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
The /lib/udev/path_id helper became a builtin command in the udev 174
release, so test whether path_id is external in the zpool_id script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #429
Some of the functions' purpose wasn't immediately obvious without
additional explanations. This commit adds these missing comments.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of GCC 4.6, specific kernel 2.6.32 header files do not compile
cleanly without warnings. One specific example of this is the
arch/x86/include/asm/percpu.h file. Thus, a few of the configure tests
were getting hung up on this and the '-Wno-unsued-but-set-variables'
compile option had to be introduced.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#459
The Solaris version of ZFS does not allow xattrs to be set on
symlinks due to the way they implemented the attropen() system
call. Linux however implements xattrs through the lgetxattr()
and lsetxattr() system calls which do not have this limitation.
The only reason this hasn't always worked under ZFS on Linux
is that the xattr handlers were not registered for symlink type
inodes. This was done simply to be consistent with the Solaris
behavior.
Upon futher reflection I believe this should be allowed under
Linux. The only ill effect would be that the xattrs on symlinks
will not be visible when the pool is imported on a Solaris
system. This also has the benefit that it allows for SELinux
style security xattr labeling which expects to be able to set
xattrs on all inode types.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#272
The current ZFS implementation stores xattrs on disk using a hidden
directory. In this directory a file name represents the xattr name
and the file contexts are the xattr binary data. This approach is
very flexible and allows for arbitrarily large xattrs. However,
it also suffers from a significant performance penalty. Accessing
a single xattr can requires up to three disk seeks.
1) Lookup the dnode object.
2) Lookup the dnodes's xattr directory object.
3) Lookup the xattr object in the directory.
To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk. When
the xattr is to large to store in the inode then a single external
block is allocated for them. In practice most xattrs are small
and this approach works well.
The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization. When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block. If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.
This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below). When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.
However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations. If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.
----------------------------------------------------------------------
Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
| setxattr | getxattr
bytes | ext4 xfs zfs-dir zfs-sa | ext4 xfs zfs-dir zfs-sa
------+--------------------------------+------------------------------
1 | 2.33 31.88 21.50 4.57 | 2.35 2.64 6.29 2.43
32 | 2.79 30.68 21.98 4.60 | 2.44 2.59 6.78 2.48
256 | 3.25 31.99 21.36 5.92 | 2.32 2.71 6.22 3.14
1024 | 3.30 32.61 22.83 8.45 | 2.40 2.79 6.24 3.27
4096 | 3.57 317.46 22.52 10.73 | 2.78 28.62 6.90 3.94
16384 | n/a 2342.39 34.30 19.20 | n/a 45.44 145.90 7.55
65536 | n/a 2941.39 128.15 131.32* | n/a 141.92 256.85 262.12*
Legend:
* ext4 - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir - Directory based xattrs only.
* zfs-sa - Prefer SAs but spill in to directories as needed, a
trailing * indicates overflow in to directories occured.
NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
This change updates the AC_LANG_PROGRAM autoconf macro invocations to be
wrapped in quotes. As of autoconf version 2.68, the quotes are necessary
to prevent warnings from appearing. Specifically, the autoconf v2.68
Forward Porting Notes specifies:
It is important to note that you need to ensure that the call to
AC_LANG_SOURCE is quoted and not expanded, otherwise that will
cause the warning to appear nonetheless.
Finally, because of the additional quoting we can drop the extra
quotas used by the ZFS_AC_CONFIG_USER_STACK_GUARD autoconf check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#464
While setting/getting userquota and groupquota properties, the input
was not treated as a possible username or groupname if it had a
leading digit. While useradd in linux recommends the regexp
[a-z_][a-z0-9_-]*[$]? , it is not enforced. This causes problem for
usernames with leading digits in them. We need to be able to support
getting and setting properties for this unconventional but possible
input category
I've updated the code to validate the username or groupname directly
via the API. Also, note that I moved this validation to the beginning
before the check for SID names with @. This also supports usernames
with @ character in them which are valid. Only when input with @ is
not a valid username, it is interpreted as a potential SID name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#428
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.
Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#425
The depmod utility from module-init-tools 3.12-pre3 generates a
warning when the -e option is used without -E or -F. This was
observed under OpenSuse 11.4. To resolve the issue when the
exact System.map-* for your kernel cannot be found fallback to
a generic safe '/sbin/depmod -a'.
WARNING: -e needs -E or -F
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The Linux 3.1 kernel updated the fops->fsync() callback yet again.
They now pass the requested range and delegate the responsibility
for calling filemap_write_and_wait_range() to the callback. In
addition imutex is no longer held by the caller and the callback
is responsible for taking the lock if required.
This commit updates the code to provide a zpl_fsync() function
for the updated API. Implementations for the previous two APIs
are also maintained for compatibility.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#445
Only under Ubuntu Lucid the rpm packaging step mistakenly adds
the following files twice to the package because of the /lib
naming convention. This is harmless but results in a warning
which the buildot flags as a failure. Suppress this warning.
warning: File listed twice: /lib/udev/rules.d
warning: File listed twice: /lib/udev/rules.d/60-zpool.rules
warning: File listed twice: /lib/udev/rules.d/60-zvol.rules
warning: File listed twice: /lib/udev/rules.d/90-zfs.rules
warning: File listed twice: /lib/udev/sas_switch_id
warning: File listed twice: /lib/udev/zpool_id
warning: File listed twice: /lib/udev/zvol_id
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
Fix an unlikely failure cause in zfs_sb_create() which could
leave the dataset owned on error and thus unavailable until
after a reboot. Disown the dataset if SA are expected but
are in fact missing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Profiling the system during meta data intensive workloads such
as creating/removing millions of files, revealed that the system
was cpu bound. A large fraction of that cpu time was being spent
waiting on the virtual address space spin lock.
It turns out this was caused by certain heavily used kmem_caches
being backed by virtual memory. By default a kmem_cache will
dynamically determine the type of memory used based on the object
size. For large objects virtual memory is usually preferable
and for small object physical memory is a better choice. See
the spl_slab_alloc() function for a longer discussion on this.
However, there is a certain amount of gray area when defining a
'large' object. For the following caches it turns out they were
just over the line:
* dnode_cache
* zio_cache
* zio_link_cache
* zio_buf_512_cache
* zfs_data_buf_512_cache
Now because we know there will be a lot of churn in these caches,
and because we know the slabs will still be reasonably sized.
We can safely request with the KMC_KMEM flag that the caches be
backed with physical memory addresses. This entirely avoids the
need to serialize on the virtual address space lock.
As a bonus this also reduces our vmalloc usage which will be good
for 32-bit kernels which have a very small virtual address space.
It will also probably be good for interactive performance since
unrelated processes could also block of this same global lock.
Finally, we may see less cpu time being burned in the arc_reclaim
and txg_sync_threads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #258
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure. It may have already been set when entering
zpl_putpage() in which case it must remain set on exit. In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim. By clearing
it we allow the following NULL deref to potentially occur.
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #287
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#431
Per issue zfsonlinux/zfs#147 by @jvolkman:
The `zpool_id` script invokes `/lib/udev/path_id`, which was removed
in udev-174, so depend the zfsutils package on earlier releases.
When calculating space needed for SA_BONUS buffers, hdrsize is
always rounded up to next 8-aligned boundary. However, in two places
the round up was done against sum of 'total' plus hdrsize. On the
other hand, hdrsize increments by 4 each time, which means in certain
conditions, we would end up returning with will_spill == 0 and
(total + hdrsize) larger than full_space, leading to a failed
assertion because it's invalid for dmu_set_bonus.
Reviewed by: Matthew Ahrens <matt@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/1661
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#426
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Increment the META version. The previous build was bad because the
volatile-version.patch was not updated, which caused this error:
Setting up zfs-dkms (0.6.0.35-0ubuntu1~oneiric1) ...
First Installation: checking all kernels...
Building only for 3.0.0-12-generic
This package appears to be a binaries-only package
you will not be able to build against kernel 3.0.0-12-generic
since the package source was not provided
Register the setattr/getattr callbacks for symlinks. Without these
the generic inode_setattr() and generic_fillattr() functions will
be used. In the setattr case this will only result in the inode being
updated in memory, the dirty_inode callback would also normally run
but none is registered for zfs.
The straight forward fix is to set the setattr/getattr callbacks
for symlinks so they are handled just like files and directories.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#412
An incomplete guid_to_ds_map would cause restore_write_byref() to fail
while receiving a de-duplicated backup stream.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Garrett D`Amore <garrett@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/755
- https://github.com/illumos/illumos-gate/commit/ec5cf9d53a
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#372
zfs.spec.in and zfs-modules.spec.in had the License field incorrectly
set to @LICENSE@, causing generated rpm packages to report an invalid
license string. Fix this by using @ZFS_META_LICENSE@.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#422
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reflect upstream commit 571837e130 in
the debian packaging.
The Lustre packages satify their backend fs requirement by
checking that lustre-backend-fs is provided. Update the zfs
packaging accordingly.
The system shell on most Debian and Ubuntu systems is dash, so change
the whack-bang in the Dracut scripts from /bin/sh to /bin/bash.
The `printf "\x$DD\x$CC\x$BB\x$AA" >$TMP` line is problematic because
dash builtin does not implement the hex format.
The dracut/ tree needs more testing and tweaking for the older
dracut-005 package that is in Debian Squeeze and Ubuntu Oneiric.