and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
+165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
DrPKlXwIgva7zq5/S0Vr
=4s1Z
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
in the jbd2 layer and in xattr handling when the extended attributes
spill over into an external block.
Other than that, the usual clean ups and minor bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
ext4: fix premature freeing of partial clusters split across leaf blocks
ext4: remove unneeded test of ret variable
ext4: fix comment typo
ext4: make ext4_block_zero_page_range static
ext4: atomically set inode->i_flags in ext4_set_inode_flags()
ext4: optimize Hurd tests when reading/writing inodes
ext4: kill i_version support for Hurd-castrated file systems
ext4: each filesystem creates and uses its own mb_cache
fs/mbcache.c: doucple the locking of local from global data
fs/mbcache.c: change block and index hash chain to hlist_bl_node
ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
ext4: refactor ext4_fallocate code
ext4: Update inode i_size after the preallocation
ext4: fix partial cluster handling for bigalloc file systems
ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
ext4: only call sync_filesystm() when remounting read-only
fs: push sync_filesystem() down to the file system's remount_fs()
jbd2: improve error messages for inconsistent journal heads
jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
jbd2: minimize region locked by j_list_lock in journal_get_create_access()
...
The following case may lead to the same system inode ref in confusion.
A thread B thread
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
because of *arr == NULL,
ocfs2_get_system_file_inode
->get_local_system_inode
->_ocfs2_get_system_file_inode
gets first ref thru
_ocfs2_get_system_file_inode,
gets second ref thru igrab and
set *arr = inode
at the moment, B thread also gets
two refs, so lead to one more
inode ref.
So add mutex lock to avoid multi thread set two inode ref once at the
same time.
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following patches are reverted in this patch because these patches
caused performance regression in the remote unlink() calls.
ea455f8ab6 - ocfs2: Push out dropping of dentry lock to ocfs2_wq
f7b1aa69be - ocfs2: Fix deadlock on umount
5fd1318937 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
Previous patches in this series removed the possible deadlocks from
downconvert thread so the above patches shouldn't be needed anymore.
The regression is caused because these patches delay the iput() in case
of dentry unlocks. This also delays the unlocking of the open lockres.
The open lockresource is required to test if the inode can be wiped from
disk or not. When the deleting node does not get the open lock, it
marks it as orphan (even though it is not in use by another
node/process) and causes a journal checkpoint. This delays operations
following the inode eviction. This also moves the inode to the orphaned
inode which further causes more I/O and a lot of unneccessary orphans.
The following script can be used to generate the load causing issues:
declare -a create
declare -a remove
declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384)
unique="`mktemp -u XXXXX`"
script="/tmp/idontknow-${unique}.sh"
cat <<EOF > "${script}"
for n in {1..8}; do mkdir -p test/dir\${n}
eval touch test/dir\${n}/foo{1.."\$1"}
done
EOF
chmod 700 "${script}"
function fcreate ()
{
exec 2>&1 /usr/bin/time --format=%E "${script}" "$1"
}
function fremove ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*"
}
function fcp ()
{
exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new"
}
echo -------------------------------------------------
echo "| # files | create #s | copy #s | remove #s |"
echo -------------------------------------------------
for ((x=0; x < ${#iterations[*]} ; x++)) do
create[$x]="`fcreate ${iterations[$x]}`"
copy[$x]="`fcp ${iterations[$x]}`"
remove[$x]="`fremove`"
printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]}
done
rm "${script}"
echo "------------------------"
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We cannot drop last dquot reference from downconvert thread as that
creates the following deadlock:
NODE 1 NODE2
holds dentry lock for 'foo'
holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
dquot_initialize(bar)
ocfs2_dquot_acquire()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
...
downconvert thread (triggered from another
node or a different process from NODE2)
ocfs2_dentry_post_unlock()
...
iput(foo)
ocfs2_evict_inode(foo)
ocfs2_clear_inode(foo)
dquot_drop(inode)
...
ocfs2_dquot_release()
ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
- blocks
finds we need more space in
quota file
...
ocfs2_extend_no_holes()
ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
- deadlocks waiting for
downconvert thread
We solve the problem by postponing dropping of the last dquot reference to
a workqueue if it happens from the downconvert thread.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, ocfs2_sync_file grabs i_mutex and forces the current journal
transaction to complete. This isn't terribly efficient, since sync_file
really only needs to wait for the last transaction involving that inode
to complete, and this doesn't require i_mutex.
Therefore, implement the necessary bits to track the newest tid
associated with an inode, and teach sync_file to wait for that instead
of waiting for everything in the journal to commit. Furthermore, only
issue the flush request to the drive if jbd2 hasn't already done so.
This also eliminates the deadlock between ocfs2_file_aio_write() and
ocfs2_sync_file(). aio_write takes i_mutex then calls
ocfs2_aiodio_wait() to wait for unaligned dio writes to finish.
However, if that dio completion involves calling fsync, then we can get
into trouble when some ocfs2_sync_file tries to take i_mutex.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it
up.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a problem that waitqueue_active() may check stale data thus miss
a wakeup of threads waiting on ip_unaligned_aio.
The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to
be of type mutex thus the above prolem is avoid. Another benifit is that
mutex which works as FIFO is fairer than wake_up_all().
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
create a file 1.
Node A Node B
open 1, get open lock
rm 1, and then add 1 to orphan_dir
storage link down,
o2hb_write_timeout
->o2quo_disk_timeout
->emergency_restart
at the moment, Node B dismount and do
ocfs2rec simultaneously
1) ocfs2_dismount_volume
->ocfs2_recovery_exit
->wait_event(osb->recovery_event)
->flush_workqueue(ocfs2_wq)
2) ocfs2rec
->queue_work(&journal->j_recovery_work)
->ocfs2_recover_orphans
->ocfs2_commit_truncate
->queue_delayed_work(&osb->osb_truncate_log_wq)
In ocfs2_recovery_exit, it flushes workqueue and then releases system
inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
which will try to get sys_root_inode, and NULL pointer dereference
occurs.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce <xuejiufei@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code. This was tested by using the kernel with changes on older
unmodified tools. The kernel used ocfs2_controld as expected, and
displayed the appropriate warning message.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
This patch (of 6):
Add clustername to cluster connection.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So return ENOMEM instead when it fails.
[joseph.qi@huawei.com: ocfs2_symlink_get_block() and ocfs2_read_blocks_sync() and ocfs2_read_blocks() need the same change]
Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While printing 32-bit node numbers, an 8-byte string is not enough.
Increase the size of the string to 12 chars.
This got left out in commit 49fa8140e4 ("fs/ocfs2/super.c: Use bigger
nodestr to accomodate 32-bit node numbers").
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While using pacemaker/corosync, the node numbers are generated using IP
address as opposed to serial node number generation. This may not fit
in a 8-byte string. Use a bigger string to print the complete node
number.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Code cleanup: needs_checkpoint is assigned to but never used. Delete
the variable.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add missing MODULE_ALIAS_FS("ocfs2") how did I miss that?
Remove unnecessary MODULE_ALIAS_FS("devpts") devpts can not be modular.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
There's no reason to call rcu_barrier() on every
deactivate_locked_super(). We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.
Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths. E.g. on my machine exit_group() of a last process in IPC
namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ocfs2: Fix cleancache initialization call to correctly pass uuid
As reported by Steven Whitehouse in https://lkml.org/lkml/2011/5/27/221
the ocfs2 volume UUID is incorrectly passed to cleancache.
As a result, shared-ephemeral tmem pools will not actually
be created; instead they will be private (unshared) which
misses out on a major benefit of tmem.
Reported-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf4) and xfs recently had to fix
similar issues.
In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.
If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.
I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ocfs2 cannot currently mount a device that is readonly at the media
("hard readonly"). Fix the broken places.
see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322
[ Description edited -- Joel ]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Reviewed-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (28 commits)
Ocfs2: Teach local-mounted ocfs2 to handle unwritten_extents correctly.
ocfs2/dlm: Do not migrate resource to a node that is leaving the domain
ocfs2/dlm: Add new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG
Ocfs2/move_extents: Set several trivial constraints for threshold.
Ocfs2/move_extents: Let defrag handle partial extent moving.
Ocfs2/move_extents: move/defrag extents within a certain range.
Ocfs2/move_extents: helper to calculate the defraging length in one run.
Ocfs2/move_extents: move entire/partial extent.
Ocfs2/move_extents: helpers to update the group descriptor and global bitmap inode.
Ocfs2/move_extents: helper to probe a proper region to move in an alloc group.
Ocfs2/move_extents: helper to validate and adjust moving goal.
Ocfs2/move_extents: find the victim alloc group, where the given #blk fits.
Ocfs2/move_extents: defrag a range of extent.
Ocfs2/move_extents: move a range of extent.
Ocfs2/move_extents: lock allocators and reserve metadata blocks and data clusters for extents moving.
Ocfs2/move_extents: Add basic framework and source files for extent moving.
Ocfs2/move_extents: Adding new ioctl code 'OCFS2_IOC_MOVE_EXT' to ocfs2.
Ocfs2/refcounttree: Publicize couple of funcs from refcounttree.c
Ocfs2: Add a new code 'OCFS2_INFO_FREEFRAG' for o2info ioctl.
Ocfs2: Add a new code 'OCFS2_INFO_FREEINODE' for o2info ioctl.
...
This eighth patch of eight in this cleancache series "opts-in"
cleancache for ocfs2. Clustered filesystems must explicitly enable
cleancache by calling cleancache_init_shared_fs anytime an instance
of the filesystem is mounted. Ocfs2 is currently the only user of
the clustered filesystem interface but nevertheless, the cleancache
hooks in the VFS layer are sufficient for ocfs2 including the matching
cleancache_flush_fs hook which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v8: trivial merge conflict update]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Nitin Gupta <ngupta@vflare.org>
As ocfs2 supports relatime and strictatime, we need update the
relative document. Atime_quantum need work with strictatime, so only
show it in procfs when mount with strictatime.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix build failure introduced by s/freezeable/freezable/
workqueue: add system_freezeable_wq
rds/ib: use system_wq instead of rds_ib_fmr_wq
net/9p: replace p9_poll_task with a work
net/9p: use system_wq instead of p9_mux_wq
xfs: convert to alloc_workqueue()
reiserfs: make commit_wq use the default concurrency level
ocfs2: use system_wq instead of ocfs2_quota_wq
ext4: convert to alloc_workqueue()
scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
misc/iwmc3200top: use system_wq instead of dedicated workqueues
i2o: use alloc_workqueue() instead of create_workqueue()
acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
fs/aio: aio_wq isn't used in memory reclaim path
input/tps6507x-ts: use system_wq instead of dedicated workqueue
cpufreq: use system_wq instead of dedicated workqueues
wireless/ipw2x00: use system_wq instead of dedicated workqueues
arm/omap: use system_wq in mailbox
workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
About one year ago, Wengang Wang tried some first steps
to add tracepoints to ocfs2. Hiss original patch is here:
http://oss.oracle.com/pipermail/ocfs2-devel/2009-November/005512.html
But as Steven Rostedt indicated in his article
http://lwn.net/Articles/383362/, we'd better have our trace
files resides in fs/ocfs2, so I rewrited the patch using the
method Steven mentioned in that article.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
mlog_exit is used to record the exit status of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
This patch just try to remove it or change it. So:
1. if all the error paths already use mlog_errno, it is just removed.
Otherwise, it will be replaced by mlog_errno.
2. if it is used to print some return value, it is replaced with
mlog(0,...).
mlog_exit_ptr is changed to mlog(0.
All those mlog(0,...) will be replaced with trace events later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
ENTRY is used to record the entry of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
So for mlog_entry_void, we just remove it.
for mlog_entry(...), we replace it with mlog(0,...), and they
will be replace by trace event later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Commit 2c442719e9 added some checks for proper
heartbeat mode when the o2cb stack is running. Unfortunately, it didn't
take into account that a userpsace stack could be running. Fix this by only
doing the check if o2cb is in use. This patch allows userspace stacks to
mount the fs again.
Cc: stable@kernel.org
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
ocfs2_quota_wq is not depended upon during memory reclaim and, with
cmwq, there's no reason to use a dedicated workqueue. Drop
ocfs2_quota_wq and use system_wq instead. dqi_sync_work is already
sync canceled on quota disable and no further synchronization is
necessary.
This change makes ocfs2_quota_setup/shutdown() noops. Both functions
removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.
Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
The big kernel lock has been removed from all these files at some point,
leaving only the #include.
Remove this too as a cleanup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
BKL: remove BKL from freevxfs
BKL: remove BKL from qnx4
autofs4: Only declare function when CONFIG_COMPAT is defined
autofs: Only declare function when CONFIG_COMPAT is defined
ncpfs: Lock socket in ncpfs while setting its callbacks
fs/locks.c: prepare for BKL removal
BKL: Remove BKL from ncpfs
BKL: Remove BKL from OCFS2
BKL: Remove BKL from squashfs
BKL: Remove BKL from jffs2
BKL: Remove BKL from ecryptfs
BKL: Remove BKL from afs
BKL: Remove BKL from USB gadgetfs
BKL: Remove BKL from autofs4
BKL: Remove BKL from isofs
BKL: Remove BKL from fat
BKL: Remove BKL from ext2 filesystem
BKL: Remove BKL from do_new_mount()
BKL: Remove BKL from cgroup
BKL: Remove BKL from NTFS
...
Currently, the default behavior of O_DIRECT writes was allowing
concurrent writing among nodes to the same file, with no cluster
coherency guaranteed (no EX lock held). This can leave stale data in
the cache for buffered reads on other nodes.
The new mount option introduce a chance to choose two different
behaviors for O_DIRECT writes:
* coherency=full, as the default value, will disallow
concurrent O_DIRECT writes by taking
EX locks.
* coherency=buffered, allow concurrent O_DIRECT writes
without EX lock among nodes, which
gains high performance at risk of
getting stale data on other nodes.
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Functions such as ocfs2_recovery_init() make use of osb->max_slots.
Initialize osb->max_slots early so the functions may use the correct
value.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Adds support for heartbeat=global mount option. It ensures that the heartbeat
mode passed matches the one enabled on disk.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
OCFS2_FEATURE_INCOMPAT_CLUSTERINFO allows us to use sb->s_cluster_info for
both userspace and o2cb cluster stacks. It also allows us to extend cluster
info to include stack flags.
This patch also adds stackflags to sb->s_clusterinfo. It also introduces a
clusterinfo flag OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT to denote the enabled
global heartbeat mode.
This incompat flag can be set/cleared using tunefs.ocfs2 --fs-features. The
clusterinfo flag is set/cleared using tunefs.ocfs2 --update-cluster-stack.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The BKL in ocfs2/dlmfs is used in put_super, fill_super and remount_fs
that are all three protected by the superblocks s_umount rw_semaphore.
The use in ocfs2_control_open is evidently unrelated and the function
is protected by ocfs2_control_lock.
Therefore it is safe to remove the BKL entirely.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.
I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.
do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.
Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.
[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]
Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Durring orphan scan, if we are slot 0, and we are replaying
orphan_dir:0001, the general process is that for every file
in this dir:
1. we will iget orphan_dir:0001, since there is no inode for it.
we will have to create an inode and read it from the disk.
2. do the normal work, such as delete_inode and remove it from
the dir if it is allowed.
3. call iput orphan_dir:0001 when we are done. In this case,
since we have no dcache for this inode, i_count will
reach 0, and VFS will have to call clear_inode and in
ocfs2_clear_inode we will checkpoint the inode which will let
ocfs2_cmt and journald begin to work.
4. We loop back to 1 for the next file.
So you see, actually for every deleted file, we have to read the
orphan dir from the disk and checkpoint the journal. It is very
time consuming and cause a lot of journal checkpoint I/O.
A better solution is that we can have another reference for these
inodes in ocfs2_super. So if there is no other race among
nodes(which will let dlmglue to checkpoint the inode), for step 3,
clear_inode won't be called and for step 1, we may only need to
read the inode for the 1st time. This is a big win for us.
So this patch will try to cache system inodes of other slots so
that we will have one more reference for these inodes and avoid
the extra inode read and journal checkpoint.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The OCFS2 developers have already done all of the hard work to allow
volumes larger than 16 TiB. But there is still a "sanity check" in
fs/ocfs2/super.c that prevents the mounting of such volumes, even when
the cluster size and journal options would allow it.
This patch replaces that sanity check with a more sophisticated one to
mount a huge volume provided that (a) it is addressable by the raw
word/address size of the system (borrowing a test from ext4); (b) the
volume is using JBD2; and (c) the JBD2_FEATURE_INCOMPAT_64BIT flag is
set on the journal.
I factored out the sanity check into its own function. I also moved it
from ocfs2_initialize_super() down to ocfs2_check_volume(); any earlier,
and the journal will not have been initialized yet.
This patch is one of a pair, and it depends on the other ("JBD2: Allow
feature checks before journal recovery").
I have tested this patch on small volumes, huge volumes, and huge
volumes without 64-bit block support in the journal. All of them appear
to work or to fail gracefully, as appropriate.
Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
no need for list_for_each_entry_safe()/resetting with superblock list
Fix sget() race with failing mount
vfs: don't hold s_umount over close_bdev_exclusive() call
sysv: do not mark superblock dirty on remount
sysv: do not mark superblock dirty on mount
btrfs: remove junk sb_dirt change
BFS: clean up the superblock usage
AFFS: wait for sb synchronization when needed
AFFS: clean up dirty flag usage
cifs: truncate fallout
mbcache: fix shrinker function return value
mbcache: Remove unused features
add f_flags to struct statfs(64)
pass a struct path to vfs_statfs
update VFS documentation for method changes.
All filesystems that need invalidate_inode_buffers() are doing that explicitly
convert remaining ->clear_inode() to ->evict_inode()
Make ->drop_inode() just return whether inode needs to be dropped
fs/inode.c:clear_inode() is gone
fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
...
Fix up trivial conflicts in fs/nilfs2/super.c
Follow the dquot_* style used elsewhere in dquot.c.
[Jan Kara: Fixed up missing conversion of ext2]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Remount handling has fully moved into the filesystem, so all this is
superflous now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly. Also rename vfs_quota_disable to
dquot_disable while we're at it.
[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
We also have to cancel quota syncing thread on remount read only because
at that moment quota is being turned off. Otherwise quota syncing thread
will try to access already freed quota structures.
Signed-off-by: Jan Kara <jack@suse.cz>
We cannot cancel delayed work from ocfs2_local_free_info because that is called
with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to
finish. Cancel the work before acquiring dqonoff_mutex.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
ocfs2 sometimes needs to block signals around dlm operations, but it
currently does it with sigprocmask(). Even worse, it's checking the
error code of sigprocmask(). The in-kernel sigprocmask() can only error
if you get the SIG_* argument wrong. We don't.
Wrap the sigprocmask() calls with ocfs2_[un]block_signals(). These
functions are void, but they will BUG() if somehow sigprocmask() returns
an error.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
OCFS2 has never really supported intr. This patch acknowledges this reality
and makes nointr the default mount option. In a later patch, we intend to
support intr.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The default behavior for directory reservations stays the same, but we add a
mount option so people can tweak the size of directory reservations
according to their workloads.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
I have observed that the current size of 8M gives us pretty poor
fragmentation on multi-threaded workloads which do lots of writes.
Generally, I can increase the size of local alloc windows and observe a
marked decrease in fragmentation, even up and beyond window sizes of 512
megabytes. This makes sense for a couple reasons - larger local alloc means
more room for reservation windows. On multi-node workloads the larger local
alloc helps as well because we don't have to do window slides as often.
Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
longer used and the comment above it was out of date.
To test fragmentation, I used a workload which launched 4 threads that did
4k writes into a series of about 140 alternating files.
With resv_level=2, and a 4k/4k file system I observed the following average
fragmentation for various localalloc= parameters:
localalloc= avg. fragmentation
8 48
32 16
64 10
120 7
On larger cluster sizes, the difference is more dramatic.
The new default size top out at 256M, which we'll only get for cluster
sizes of 32K and above.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch pulls the local alloc sizing code into localalloc.c and provides
a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
except that I correctly calculate the maximum local alloc size. The old code
in ocfs2_parse_options() calculated the max size as:
ocfs2_local_alloc_size(sb) * 8
which is correct, in bits. Unfortunately though the option passed in is in
megabytes. Ultimately, this bug made no real difference - the shrink code
would catch a too-large size and bring it down to something reasonable.
Still, it's less than efficient as-is.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch improves Ocfs2 allocation policy by allowing an inode to
reserve a portion of the local alloc bitmap for itself. The reserved
portion (allocation window) is advisory in that other allocation
windows might steal it if the local alloc bitmap becomes
full. Otherwise, the reservations are honored and guaranteed to be
free. When the local alloc window is moved to a different portion of
the bitmap, existing reservations are discarded.
Reservation windows are represented internally by a red-black
tree. Within that tree, each node represents the reservation window of
one inode. An LRU of active reservations is also maintained. When new
data is written, we allocate it from the inodes window. When all bits
in a window are exhausted, we allocate a new one as close to the
previous one as possible. Should we not find free space, an existing
reservation is pulled off the LRU and cannibalized.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_group_bitmap_size has to handle the case when the
volume don't have discontiguous block group support. So
pass the feature_incompat in and check it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Defines the OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG feature bit and modifies
struct ocfs2_group_desc for the feature.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c
ocfs2/trivial: Use proper mask for 2 places in hearbeat.c
Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink.
Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode?
ocfs2: Use FIEMAP_EXTENT_SHARED
fiemap: Add new extent flag FIEMAP_EXTENT_SHARED
ocfs2: replace u8 by __u8 in ocfs2_fs.h
ocfs2: explicit declare uninitialized var in user_cluster_connect()
ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock()
ocfs2: return -EAGAIN instead of EAGAIN in dlm
ocfs2/cluster: Make fence method configurable - v2
ocfs2: Set MS_POSIXACL on remount
ocfs2: Make acl use the default
ocfs2: Always include ACL support
Currently the f_fsid of struct kstatfs returned from ocfs2_statfs() is
undefined (vfs layer fills in 0 as default). Since in some conditions,
f_fsid value might be used in a (f_fsid, ino) pair to uniquely identify
a file, ocfs2 should return a unique defined f_fsid value from
ocfs2_statfs().
Because uuid_str is the same on big or litlle endian machine, it's
endian consistent to use osb->uuid_str to generate f_fsid value.
Signed-off-by: Coly Li <coly.li@suse.de>
Cc: Sunil Mushran <sunil.mushran@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We have to set MS_POSIXACL on remount as well. Otherwise VFS
would not know we started supporting ACLs after remount and
thus ACLs would not work.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Change acl mount options handling to match the one of XFS and BTRFS and
hopefully it is also easier to use now. When admin does not specify any
acl mount option, acls are enabled if and only if the filesystem has
xattr feature enabled. If admin specifies 'acl' mount option, we fail
the mount if the filesystem does not have xattr feature and thus acls
cannot be enabled.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
To become consistent with filesystems such as XFS or BTRFS, make posix
ACLs always available. This also reduces possibility of
misconfiguration on admin's side.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In case of non-modular kernels the root filesystem is mounted by trying
several filesystems. If ocfs2 was tried before the actual filesystem
type, the mount would fail because ocfs2_sb_probe() returns -EAGAIN
instead of -EINVAL. ocfs2 will now return -EINVAL properly.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reported-by: Laszlo Attila Toth <panther@balabit.hu>
* remove asm/atomic.h inclusion from linux/utsname.h --
not needed after kref conversion
* remove linux/utsname.h inclusion from files which do not need it
NOTE: it looks like fs/binfmt_elf.c do not need utsname.h, however
due to some personality stuff it _is_ needed -- cowardly leave ELF-related
headers and files alone.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (85 commits)
ocfs2: Use buffer IO if we are appending a file.
ocfs2: add spinlock protection when dealing with lockres->purge.
dlmglue.c: add missed mlog lines
ocfs2: __ocfs2_abort() should not enable panic for local mounts
ocfs2: Add ioctl for reflink.
ocfs2: Enable refcount tree support.
ocfs2: Implement ocfs2_reflink.
ocfs2: Add preserve to reflink.
ocfs2: Create reflinked file in orphan dir.
ocfs2: Use proper parameter for some inode operation.
ocfs2: Make transaction extend more efficient.
ocfs2: Don't merge in 1st refcount ops of reflink.
ocfs2: Modify removing xattr process for refcount.
ocfs2: Add reflink support for xattr.
ocfs2: Create an xattr indexed block if needed.
ocfs2: Call refcount tree remove process properly.
ocfs2: Attach xattr clusters to refcount tree.
ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
ocfs2: Abstract the creation of xattr block.
ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
...
In a clustered setup, we have to panic the box on journal abort. This is
because we don't have the facility to go hard readonly. With hard ro, another
node would detect node failure and initiate recovery.
Having said that, we shouldn't force panic if the volume is mounted locally.
This patch defers the handling to the mount option, errors.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Implement locking around struct ocfs2_refcount_tree. This protects
all read/write operations on refcount trees. ocfs2_refcount_tree
has its own lock and its own caching_info, protecting buffers among
multiple nodes.
User must call ocfs2_lock_refcount_tree before his operation on
the tree and unlock it after that.
ocfs2_refcount_trees are referenced by the block number of the
refcount tree root block, So we create an rb-tree on the ocfs2_super
to look them up.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Similar ip_last_trans, ip_created_trans tracks the creation of a journal
managed inode. This specifically tracks what transaction created the
inode. This is so the code can know if the inode has ever been written
to disk.
This behavior is desirable for any journal managed object. We move it
to struct ocfs2_caching_info as ci_created_trans so that any object
using ocfs2_caching_info can rely on this behavior.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We have the read side of metadata caching isolated to struct
ocfs2_caching_info, now we need the write side. This means the journal
functions. The journal only does a couple of things with struct inode.
This change moves the ip_last_trans field onto struct
ocfs2_caching_info as ci_last_trans. This field tells the journal
whether a pending journal flush is required.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We are really passing the inode into the ocfs2_read/write_blocks()
functions to get at the metadata cache. This commit passes the cache
directly into the metadata block functions, divorcing them from the
inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We don't really want to cart around too many new fields on the
ocfs2_caching_info structure. So let's wrap all our access of the
parent object in a set of operations. One pointer on caching_info, and
more flexibility to boot.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We want to use the ocfs2_caching_info structure in places that are not
inodes. To do that, it can no longer rely on referencing the inode
directly.
This patch moves the flags to ocfs2_caching_info->ci_flags, stores
pointers to the parent's locks on the ocfs2_caching_info, and renames
the constants and flags to reflect its independant state.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
If we fail to mount the filesystem, we have to be careful not to dereference
uninitialized structures in ocfs2_kill_sb.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We just set blockcheck stats to zeros but we should also
properly initialize the spinlock there.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In commit ea455f8ab6, we moved the dentry lock
put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
can drop references to inodes while they are being invalidated by
invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
ending in an infinite loop, "Busy inodes after umount" messages etc.).
We fix the problem by stopping ocfs2_wq from doing any further releasing of
inode references on the superblock being unmounted, wait until it finishes
the current round of releasing and finally cleaning up all the references in
dentry_lock_list from ocfs2_put_super().
The issue was tracked down by Tao Ma <tao.ma@oracle.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
If the mount fails for any reason, ocfs2_dismount_volume calls
ocfs2_orphan_scan_stop. It requires that ocfs2_orphan_scan_init
be called to setup the mutex and work queues, but that doesn't
happen if the mount has failed and we oops accessing an uninitialized
work queue.
This patch splits the init and startup of the orphan scan, eliminating
the oops.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2/trivial: Wrap ocfs2_sysfile_cluster_lock_key within define.
ocfs2: Add lockdep annotations
vfs: Set special lockdep map for dirs only if not set by fs
ocfs2: Disable orphan scanning for local and hard-ro mounts
ocfs2: Do not initialize lvb in ocfs2_orphan_scan_lock_res_init()
ocfs2: Stop orphan scan as early as possible during umount
ocfs2: Fix ocfs2_osb_dump()
ocfs2: Pin journal head before accessing jh->b_committed_data
ocfs2: Update atime in splice read if necessary.
ocfs2: Provide the ocfs2_dlm_lvb_valid() stack API.
Local and Hard-RO mounts do not need orphan scanning.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Currently if the orphan scan fires a tick before the user issues the umount,
the umount will wait for the queued orphan scan tasks to complete.
This patch makes the umount stop the orphan scan as early as possible so as
to reduce the probability of the queued tasks slowing down the umount.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Skip printing information that is not valid for local mounts.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".
Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2/net: Use wait_event() in o2net_send_message_vec()
ocfs2: Adjust rightmost path in ocfs2_add_branch.
ocfs2: fdatasync should skip unimportant metadata writeout
ocfs2: Remove redundant gotos in ocfs2_mount_volume()
ocfs2: Add statistics for the checksum and ecc operations.
ocfs2 patch to track delayed orphan scan timer statistics
ocfs2: timer to queue scan of all orphan slots
ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories
ocfs2: Fix possible deadlock in quota recovery
ocfs2: Fix possible deadlock with quotas in ocfs2_setattr()
ocfs2: Fix lock inversion in ocfs2_local_read_info()
ocfs2: Fix possible deadlock in ocfs2_global_read_dquot()
ocfs2: update comments in masklog.h
ocfs2: Don't printk the error when listing too many xattrs.
Move BKL into ->put_super from the only caller. A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.
[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It would be nice to know how often we get checksum failures. Even
better, how many of them we can fix with the single bit ecc. So, we add
a statistics structure. The structure can be installed into debugfs
wherever the user wants.
For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions. The stats are only
registered with debugfs when the filesystem supports metadata ecc.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock. When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert. The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.
A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.
This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.
This patch renames hardsect_size to logical_block_size.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.
If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.
This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.
Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The superblock is read via a raw call. Validate it after we find it
from its signature.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Enable quota usage tracking on mount and disable it on umount. Also
add support for quota on and quota off quotactls and usrquota and
grpquota mount options. Add quota features among supported ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates a work queue for periodic syncing of locally cached quota
information to the global quota files. We constantly queue a delayed work
item, to get the periodic behavior.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Jan Kara <jack@suse.cz>
For each quota type each node has local quota file. In this file it stores
changes users have made to disk usage via this node. Once in a while this
information is synced to global file (and thus with other nodes) so that
limits enforcement at least aproximately works.
Global quota files contain all the information about usage and limits. It's
mostly handled by the generic VFS code (which implements a trie of structures
inside a quota file). We only have to provide functions to convert structures
from on-disk format to in-memory one. We also have to provide wrappers for
various quota functions starting transactions and acquiring necessary cluster
locks before the actual IO is really started.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds the Kconfig option "CONFIG_OCFS2_FS_POSIX_ACL"
and mount options "acl" to enable acls in Ocfs2.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Patch adds check for [no]user_xattr in ocfs2_show_options() that completes
the list of all mount options.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.
It's a pretty trivial change. Most functions are just renamed. The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.
Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem. It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.
We provide a compatibility option so that paranoid people can still use
JBD for the time being. This will go away shortly.
[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
ocfs2_truncate_for_delete(). --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit. This parallels XFS. 64bit systems can handle the
larger inode numbers.
[ Added description of inode64 mount option in ocfs2.txt. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.
To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.
Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.
This was posted for review some time ago and I believe its been in -mm
since then.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As the fs recovery is asynchronous, there is a small chance that another
node can mount (and thus recover) the slot before the recovery thread
gets to it.
If this happens, the recovery thread will block indefinitely on the
journal/slot lock as that lock will be held for the duration of the mount
(by design) by the node assigned to that slot.
The solution implemented is to keep track of the journal replays using
a recovery generation in the journal inode, which will be incremented by the
thread replaying that journal. The recovery thread, before attempting the
blocking lock on the journal/slot lock, will compare the generation on disk
with what it has cached and skip recovery if it does not match.
This bug appears to have been inadvertently introduced during the mount/umount
vote removal by mainline commit 34d024f843. In the
mount voting scheme, the messaging would indirectly indicate that the slot
was being recovered.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch ensures the mount fails if the fs is unable to load the journal.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The filesystem gains the USERSPACE_STACK incomat bit and the
s_cluster_info field on the superblock. When a userspace stack is in
use, the name of the stack is stored on-disk for mount-time
verification.
The "cluster_stack" option is added to mount(2) processing. The mount
process needs to pass the matching stack name. If the passed name and
the on-disk name do not match, the mount is failed.
When using the classic o2cb stack, the incompat bit is *not* set and no
mount option is used other than the usual heartbeat=local. Thus, the
filesystem is compatible with older tools.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.
The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().
ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.
[ Disconnect when node query fails -- Joel ]
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.
We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.
The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.
With this change, only stackglue.c includes cluster/ headers.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.
[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
truly opaque. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.
heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.
The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old recovery map was a bitmap of node numbers. This was sufficient
for the maximum node number of 254. Going forward, we want node numbers
to be UINT32. Thus, we need a new recovery map.
Note that we can't keep track of slots here. We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.
The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.
Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init(). This actually cleans up
ocfs2_initialize_super() a little as well. Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().
A number of node map functions are rendered obsolete and are removed.
Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event. This is a cleanup from Mark.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
journal.c and dlmglue.c would refresh the slot map by hand. Instead, have
the update and clear functions do the work inside slot_map.c. The eventual
result is to make ocfs2_slot_info defined privately in slot_map.c
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Currently, when ocfs2 nodes connect via TCP, they advertise their
compatibility level. If the versions do not match, two nodes cannot speak
to each other and they disconnect. As a result, this provides no forward or
backwards compatibility.
This patch implements a simple protocol negotiation at the dlm level by
introducing a major/minor version number scheme for entities that
communicate. Specifically, o2dlm has a major/minor version for interaction
with o2dlm on other nodes, and ocfs2 itself has a major/minor version for
interacting with the filesystem on other nodes.
This will allow rolling upgrades of ocfs2 clusters when changes to the
locking or network protocols can be done in a backwards compatible manner.
In those cases, only the minor number is changed and the negotatied protocol
minor is returned from dlm join. In the far less likely event that a
required protocol change makes backwards compatibility impossible, we simply
bump the major number.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>