Commit Graph

1363 Commits

Author SHA1 Message Date
Theodore Ts'o
829fa70ddd ext4: validate that metadata blocks do not overlap superblock
A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.

This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-08-01 00:51:02 -04:00
Linus Torvalds
396d10993f The major change this cycle is deleting ext4's copy of the file system
encryption code and switching things over to using the copies in
 fs/crypto.  I've updated the MAINTAINERS file to add an entry for
 fs/crypto listing Jaeguk Kim and myself as the maintainers.
 
 There are also a number of bug fixes, most notably for some problems
 found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
 fixed is a writeback deadlock detected by generic/130, and some
 potential races in the metadata checksum code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXlbP9AAoJEPL5WVaVDYGjGxgIAJ9YIqme//yix63oHYLhDNea
 lY/TLqZrb9/TdDRvGyZa3jYaKaIejL53eEQS9nhEB/JI0sEiDpHmOrDOxdj8Hlsw
 fm7nJyh1u4vFKPyklCbIvLAje1vl8X/6OvqQiwh45gIxbbsFftaBWtccW+UtEkIP
 Fx65Vk7RehJ/sNrM0cRrwB79YAmDS8P6BPyzdMRk+vO/uFqyq7Auc+pkd+bTlw/m
 TDAEIunlk0Ovjx75ru1zaemL1JJx5ffehrJmGCcSUPHVbMObOEKIrlV50gAAKVhO
 qbZAri3mhDvyspSLuS/73L9skeCiWFLhvojCBGu4t2aa3JJolmItO7IpKi4HdRU=
 =bxGK
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The major change this cycle is deleting ext4's copy of the file system
  encryption code and switching things over to using the copies in
  fs/crypto.  I've updated the MAINTAINERS file to add an entry for
  fs/crypto listing Jaeguk Kim and myself as the maintainers.

  There are also a number of bug fixes, most notably for some problems
  found by American Fuzzy Lop (AFL) courtesy of Vegard Nossum.  Also
  fixed is a writeback deadlock detected by generic/130, and some
  potential races in the metadata checksum code"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: verify extent header depth
  ext4: short-cut orphan cleanup on error
  ext4: fix reference counting bug on block allocation error
  MAINTAINRES: fs-crypto maintainers update
  ext4 crypto: migrate into vfs's crypto engine
  ext2: fix filesystem deadlock while reading corrupted xattr block
  ext4: fix project quota accounting without quota limits enabled
  ext4: validate s_reserved_gdt_blocks on mount
  ext4: remove unused page_idx
  ext4: don't call ext4_should_journal_data() on the journal inode
  ext4: Fix WARN_ON_ONCE in ext4_commit_super()
  ext4: fix deadlock during page writeback
  ext4: correct error value of function verifying dx checksum
  ext4: avoid modifying checksum fields directly during checksum verification
  ext4: check for extents that wrap around
  jbd2: make journal y2038 safe
  jbd2: track more dependencies on transaction commit
  jbd2: move lockdep tracking to journal_s
  jbd2: move lockdep instrumentation for jbd2 handles
  ext4: respect the nobarrier mount option in nojournal mode
  ...
2016-07-26 18:35:55 -07:00
Vegard Nossum
c65d5c6c81 ext4: short-cut orphan cleanup on error
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.

    EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
    EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
    Aborting journal on device loop0-8.
    EXT4-fs (loop0): Remounting filesystem read-only
    EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
    EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
    EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
    EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
    EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
    [...]
    EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
    [...]

See-also: c9eb13a910 ("ext4: fix hang when processing corrupted orphaned inode list")
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-14 23:21:35 -04:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Theodore Ts'o
5b9554dc5b ext4: validate s_reserved_gdt_blocks on mount
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory.  Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.

Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-07-05 20:01:52 -04:00
Pranay Kr. Srivastava
4743f83990 ext4: Fix WARN_ON_ONCE in ext4_commit_super()
If there are racing calls to ext4_commit_super() it's possible for
another writeback of the superblock to result in the buffer being
marked with an error after we check if the buffer is marked as having
a write error and the buffer up-to-date flag is set again.  If that
happens mark_buffer_dirty() can end up throwing a WARN_ON_ONCE.

Fix this by moving this check to write before we call
write_buffer_dirty(), and keeping the buffer locked during this whole
sequence.

Signed-off-by: Pranay Kr. Srivastava <pranjas@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-04 10:24:52 -04:00
Daeho Jeong
b47820edd1 ext4: avoid modifying checksum fields directly during checksum verification
We temporally change checksum fields in buffers of some types of
metadata into '0' for verifying the checksum values. By doing this
without locking the buffer, some metadata's checksums, which are
being committed or written back to the storage, could be damaged.
In our test, several metadata blocks were found with damaged metadata
checksum value during recovery process. When we only verify the
checksum value, we have to avoid modifying checksum fields directly.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-07-03 17:51:39 -04:00
Mike Christie
dfec8a14fc fs: have ll_rw_block users pass in op and flags separately
This has ll_rw_block users pass in the operation and flags separately,
so ll_rw_block can setup the bio op and bi_rw flags on the bio that
is submitted.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Toshi Kani
87eefeb4e8 ext4: Add alignment check for DAX mount
When a partition is not aligned by 4KB, mount -o dax succeeds,
but any read/write access to the filesystem fails, except for
metadata update.

Call bdev_dax_supported() to perform proper precondition checks
which includes this partition alignment check.

Reported-by: Micah Parrish <micah.parrish@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-17 00:44:11 -06:00
Daeho Jeong
c8585c6fca ext4: fix races between changing inode journal mode and ext4_writepages
In ext4, there is a race condition between changing inode journal mode
and ext4_writepages(). While ext4_writepages() is executed on a
non-journalled mode inode, the inode's journal mode could be enabled
by ioctl() and then, some pages dirtied after switching the journal
mode will be still exposed to ext4_writepages() in non-journaled mode.
To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
Kara's suggestion because we don't want to waste ext4_inode_info's
space for this extra rare case.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-04-25 23:22:35 -04:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Theodore Ts'o
c325a67c72 ext4: ignore quota mount options if the quota feature is enabled
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present.  This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified.  This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-03 17:03:37 -04:00
Theodore Ts'o
8f0e8746b4 ext4: avoid calling dquot_get_next_id() if quota is not enabled
This should be fixed in the quota layer so we can test with the quota
mutex held, but for now, we need this to avoid tests from crashing the
kernel aborting the regression test suite.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-04-01 12:00:03 -04:00
Theodore Ts'o
daf647d2dd ext4: add lockdep annotations for i_data_sem
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode.  This results in this complaint:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&ei->i_data_sem);
                                lock(&s->s_dquot.dqio_mutex);
                                lock(&ei->i_data_sem);
   lock(&s->s_dquot.dqio_mutex);

Google-Bug-Id: 27907753

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-04-01 01:31:28 -04:00
Linus Torvalds
77d913178c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
 "This contains a rewrite of UDF handling of filename encoding to fix
  remaining overflow issues from Andrew Gabbasov and quota changes to
  support new Q_[X]GETNEXTQUOTA quotactl for VFS quota formats"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Fix possible GPF due to uninitialised pointers
  ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
  quota: Forbid Q_GETQUOTA and Q_GETNEXTQUOTA for frozen filesystem
  quota: Fix possible races during quota loading
  ocfs2: Implement get_next_id()
  quota_v2: Implement get_next_id() for V2 quota format
  quota: Add support for ->get_nextdqblk() for VFS quota
  udf: Merge linux specific translation into CS0 conversion function
  udf: Remove struct ustr as non-needed intermediate storage
  udf: Use separate buffer for copying split names
  udf: Adjust UDF_NAME_LEN to better reflect actual restrictions
  udf: Join functions for UTF8 and NLS conversions
  udf: Parameterize output length in udf_put_filename
  quota: Allow Q_GETQUOTA for frozen filesystem
  quota: Fixup comments about return value of Q_[X]GETNEXTQUOTA
2016-03-21 12:22:37 -07:00
Ales Novak
7915a861c0 ext4: print ext4 mount option data_err=abort correctly
If data_err=abort option is specified for an ext3/ext4 mount,
/proc/mounts does show it as "(null)". This is caused by token2str()
returning NULL for Opt_data_err_abort (due to its pattern containing
'=').

Signed-off-by: Ales Novak <alnovak@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-12 21:55:50 -05:00
Jan Kara
600be30a8b ext4: remove i_ioend_count
Remove counter of pending io ends as it is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 23:39:21 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Andreas Gruenbacher
2335d05f3a ext4: kill ext4_mballoc_ready
This variable, introduced in commit 9c191f70, is unnecessary: it is set
once the module has been initialized correctly, and ext4_fill_super
cannot run unless the module has been initialized correctly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 22:41:05 -05:00
Jan Kara
82939d7999 ext4: convert to mbcache2
The conversion is generally straightforward. The only tricky part is
that xattr block corresponding to found mbcache entry can get freed
before we get buffer lock for that block. So we have to check whether
the entry is still valid after getting buffer lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-22 11:50:13 -05:00
Eric Sandeen
6332b9b5e7 ext4: Make Q_GETNEXTQUOTA work for quota in hidden inodes
We forgot to set .get_nextdqblk operation in quotactl_ops structure used
by ext4 when quota is using hidden inode thus the operation was not
really supported. Fix the omission.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-19 19:28:07 +01:00
Jan Kara
be6257b251 quota: Add support for ->get_nextdqblk() for VFS quota
Add infrastructure for supporting get_nextdqblk() callback for VFS
quotas. Translate the operation into a callback to appropriate
filesystem and consequently to quota format callback.

Signed-off-by: Jan Kara <jack@suse.cz>
2016-02-09 13:05:23 +01:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Linus Torvalds
391f2a16b7 Some locking and page fault bug fixes from Jan Kara, some ext4
encryption fixes from me, and Li Xi's Project Quota commits.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJWnwkiAAoJEPL5WVaVDYGjyAAH/1dj1nNL9h+d12V3zXbvoPkg
 5RFw/2QfMZ+GE3Lln9gxTBDSyo/9m8hUK8eg0WpIRtGX9NbKcyrWEGJa2XF++43k
 tVpKGyN6cqkwPu4M6EPIK9yRvuALGB5PJE/u0q1lu9VoIAgtin3F/bAQK/iHnrUg
 M3+lVDtKcmbhqCdocaLLZD6Q4xlQI3wJne99pYt+Dtx95aOQY9v9SV030i7sOnEt
 R5JrhmfkgNqVTB8Zz0IxOp5LQlOkuyvtnZ44yYgJH8ckCUnDQI2hbksSqcMamJ1Y
 QJWBzRhVXU9gs1nCRy/Xh48mSk+nvZW9aglk+Syzbzg5C63SgwYcqvbCBqJJEdc=
 =HjkT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Some locking and page fault bug fixes from Jan Kara, some ext4
  encryption fixes from me, and Li Xi's Project Quota commits"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  fs: clean up the flags definition in uapi/linux/fs.h
  ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR interface support
  ext4: add project quota support
  ext4: adds project ID support
  ext4 crypto: simplify interfaces to directory entry insert functions
  ext4 crypto: add missing locking for keyring_key access
  ext4: use pre-zeroed blocks for DAX page faults
  ext4: implement allocation of pre-zeroed blocks
  ext4: provide ext4_issue_zeroout()
  ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag
  ext4: document lock ordering
  ext4: fix races of writeback with punch hole and zero range
  ext4: fix races between buffered IO and collapse / insert range
  ext4: move unlocked dio protection from ext4_alloc_file_blocks()
  ext4: fix races between page faults and hole punching
2016-01-22 11:23:35 -08:00
Vladimir Davydov
5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Li Xi
689c958cbe ext4: add project quota support
This patch adds mount options for enabling/disabling project quota
accounting and enforcement. A new specific inode is also used for
project quota accounting.

[ Includes fix from Dan Carpenter to crrect error checking from dqget(). ]

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:22 -05:00
Li Xi
040cb3786d ext4: adds project ID support
Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-01-08 16:01:21 -05:00
Jan Kara
e74031fd7e ext4: document lock ordering
We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:35:49 -05:00
Jan Kara
ea3d7209ca ext4: fix races between page faults and hole punching
Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-12-07 14:28:03 -05:00
Dan Williams
ef83b6e8f4 ext2, ext4: warn when mounting with dax enabled
Similar to XFS warn when mounting DAX while it is still considered under
development.  Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem.  The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-11-16 09:43:54 -08:00
Linus Torvalds
ad804a0b2a Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton:

 - most of the rest of MM

 - procfs

 - lib/ updates

 - printk updates

 - bitops infrastructure tweaks

 - checkpatch updates

 - nilfs2 update

 - signals

 - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
   dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
  ipc,msg: drop dst nil validation in copy_msg
  include/linux/zutil.h: fix usage example of zlib_adler32()
  panic: release stale console lock to always get the logbuf printed out
  dma-debug: check nents in dma_sync_sg*
  dma-mapping: tidy up dma_parms default handling
  pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
  kexec: use file name as the output message prefix
  fs, seqfile: always allow oom killer
  seq_file: reuse string_escape_str()
  fs/seq_file: use seq_* helpers in seq_hex_dump()
  coredump: change zap_threads() and zap_process() to use for_each_thread()
  coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
  signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
  signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
  signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
  signals: kill block_all_signals() and unblock_all_signals()
  nilfs2: fix gcc uninitialized-variable warnings in powerpc build
  nilfs2: fix gcc unused-but-set-variable warnings
  MAINTAINERS: nilfs2: add header file for tracing
  nilfs2: add tracepoints for analyzing reading and writing metadata files
  ...
2015-11-07 14:32:45 -08:00
Mel Gorman
d0164adc89 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Dmitry Monakhov
1e381f60da ext4: do not allow journal_opts for fs w/o journal
It is appeared that we can pass journal related mount options and such options
be shown in /proc/mounts

Example:
#mkfs.ext4 -F /dev/vdb
#tune2fs -O ^has_journal /dev/vdb
#mount /dev/vdb /mnt/  -ocommit=20,journal_async_commit
#cat /proc/mounts  | grep /mnt
 /dev/vdb /mnt ext4 rw,relatime,journal_checksum,journal_async_commit,commit=20,data=ordered 0 0

But options:"journal_checksum,journal_async_commit,commit=20,data=ordered" has
nothing with reality because there is no journal at all.

This patch disallow following options for journalless configurations:
 - journal_checksum
 - journal_async_commit
 - commit=%ld
 - data={writeback,ordered,journal}

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2015-10-18 23:50:26 -04:00
Dmitry Monakhov
c93cf2d757 ext4: explicit mount options parsing cleanup
Currently MOPT_EXPLICIT treated as EXPLICIT_DELALLOC which may be changed
in future. Let's fix it now.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-18 23:35:32 -04:00
Daeho Jeong
4327ba52af ext4, jbd2: ensure entering into panic after recording an error in superblock
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option.  But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A                        Task B
ext4_handle_error()
-> jbd2_journal_abort()
  -> __journal_abort_soft()
    -> __jbd2_journal_abort_hard()
    | -> journal->j_flags |= JBD2_ABORT;
    |
    |                         __ext4_abort()
    |                         -> jbd2_journal_abort()
    |                         | -> __journal_abort_soft()
    |                         |   -> if (journal->j_flags & JBD2_ABORT)
    |                         |           return;
    |                         -> panic()
    |
    -> jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-10-18 17:02:56 -04:00
Darrick J. Wong
e2b911c535 ext4: clean up feature test macros with predicate functions
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros.  Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-10-17 16:18:43 -04:00
Darrick J. Wong
6a797d2737 ext4: call out CRC and corruption errors with specific error codes
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:04 -04:00
Darrick J. Wong
8c81bd8f58 ext4: store checksum seed in superblock
Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 16:16:02 -04:00
Theodore Ts'o
ebd173beb8 ext4: move procfs registration code to fs/ext4/sysfs.c
This allows us to refactor the procfs code, which saves a bit of
compiled space.  More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:46:17 -04:00
Theodore Ts'o
b579901882 ext4: move sysfs code from super.c to fs/ext4/sysfs.c
Also statically allocate the ext4_kset and ext4_feat objects, since we
only need exactly one of each, and it's simpler and less code if we
drop the dynamic allocation and deallocation when it's not needed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:44:17 -04:00
Kees Cook
a068acf2ee fs: create and use seq_show_option for escaping
Many file systems that implement the show_options hook fail to correctly
escape their output which could lead to unescaped characters (e.g.  new
lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
could lead to confusion, spoofed entries (resulting in things like
systemd issuing false d-bus "mount" notifications), and who knows what
else.  This looks like it would only be the root user stepping on
themselves, but it's possible weird things could happen in containers or
in other situations with delegated mount privileges.

Here's an example using overlay with setuid fusermount trusting the
contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
of "sudo" is something more sneaky:

  $ BASE="ovl"
  $ MNT="$BASE/mnt"
  $ LOW="$BASE/lower"
  $ UP="$BASE/upper"
  $ WORK="$BASE/work/ 0 0
  none /proc fuse.pwn user_id=1000"
  $ mkdir -p "$LOW" "$UP" "$WORK"
  $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
  $ cat /proc/mounts
  none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
  none /proc fuse.pwn user_id=1000 0 0
  $ fusermount -u /proc
  $ cat /proc/mounts
  cat: /proc/mounts: No such file or directory

This fixes the problem by adding new seq_show_option and
seq_show_option_n helpers, and updating the vulnerable show_option
handlers to use them as needed.  Some, like SELinux, need to be open
coded due to unusual existing escape mechanisms.

[akpm@linux-foundation.org: add lost chunk, per Kees]
[keescook@chromium.org: seq_show_option should be using const parameters]
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Jan Kara <jack@suse.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: J. R. Okajima <hooanon05g@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds
ea814ab9aa Pretty much all bug fixes and clean ups for 4.3, after a lot of
features and other churn going into 4.2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJV55TlAAoJEPL5WVaVDYGjyzYH/1WtZpIzRjp7o+3H4/vFqONg
 R1Fsw785C1w8WX2QuIK/m31u4XO+VeCV4jWA9DuqnSzWm9w9C/4kTqITd4Hwp416
 /9gJvYoZHHaDikxpWWADptDi8IoLohTlcFVCHIvvf53cGehVEEsc2WOijUZo7Cgv
 O454Nm3tK0CQ3yrCIlf5SyvkUZSMTiawLLJJzd4GCyvU13C1SnABNQj8UxKisBA5
 cP8q4O2nPg/S9rkYxnFAifQyZppd3jMvorUaq9eHiWMjl95o6e/6+wYGnHhoFUvr
 /P1dNjJYbzk+TUzlsDkq2zANK2UsB3iNNi8YwLFOpfFcuYopmUAYRIWOgIZWYUQ=
 =ijuI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Pretty much all bug fixes and clean ups for 4.3, after a lot of
  features and other churn going into 4.2"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  Revert "ext4: remove block_device_ejected"
  ext4: ratelimit the file system mounted message
  ext4: silence a format string false positive
  ext4: simplify some code in read_mmp_block()
  ext4: don't manipulate recovery flag when freezing no-journal fs
  jbd2: limit number of reserved credits
  ext4 crypto: remove duplicate header file
  ext4: update c/mtime on truncate up
  jbd2: avoid infinite loop when destroying aborted journal
  ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
  ext4 crypto: fix spelling typo in comment
  ext4 crypto: exit cleanly if ext4_derive_key_aes() fails
  ext4: reject journal options for ext2 mounts
  ext4: implement cgroup writeback support
  ext4: replace ext4_io_submit->io_op with ->io_wbc
  ext4 crypto: check for too-short encrypted file names
  ext4 crypto: use a jbd2 transaction when adding a crypto policy
  jbd2: speedup jbd2_journal_dirty_metadata()
2015-09-03 12:52:19 -07:00
Theodore Ts'o
bdfe0cbd74 Revert "ext4: remove block_device_ejected"
This reverts commit 08439fec26.

Unfortunately we still need to test for bdi->dev to avoid a crash when a
USB stick is yanked out while a file system is mounted:

   usb 2-2: USB disconnect, device number 2
   Buffer I/O error on dev sdb1, logical block 15237120, lost sync page write
   JBD2: Error -5 detected when updating journal superblock for sdb1-8.
   BUG: unable to handle kernel paging request at 34beb000
   IP: [<c136ce88>] __percpu_counter_add+0x18/0xc0
   *pdpt = 0000000023db9001 *pde = 0000000000000000 
   Oops: 0000 [#1] SMP 
   CPU: 0 PID: 4083 Comm: umount Tainted: G     U     OE   4.1.1-040101-generic #201507011435
   Hardware name: LENOVO 7675CTO/7675CTO, BIOS 7NETC2WW (2.22 ) 03/22/2011
   task: ebf06b50 ti: ebebc000 task.ti: ebebc000
   EIP: 0060:[<c136ce88>] EFLAGS: 00010082 CPU: 0
   EIP is at __percpu_counter_add+0x18/0xc0
   EAX: f21c8e88 EBX: f21c8e88 ECX: 00000000 EDX: 00000001
   ESI: 00000001 EDI: 00000000 EBP: ebebde60 ESP: ebebde40
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
   CR0: 8005003b CR2: 34beb000 CR3: 33354200 CR4: 000007f0
   Stack:
    c1abe100 edcb0098 edcb00ec ffffffff f21c8e68 ffffffff f21c8e68 f286d160
    ebebde84 c1160454 00000010 00000282 f72a77f8 00000984 f72a77f8 f286d160
    f286d170 ebebdea0 c11e613f 00000000 00000282 f72a77f8 edd7f4d0 00000000
   Call Trace:
    [<c1160454>] account_page_dirtied+0x74/0x110
    [<c11e613f>] __set_page_dirty+0x3f/0xb0
    [<c11e6203>] mark_buffer_dirty+0x53/0xc0
    [<c124a0cb>] ext4_commit_super+0x17b/0x250
    [<c124ac71>] ext4_put_super+0xc1/0x320
    [<c11f04ba>] ? fsnotify_unmount_inodes+0x1aa/0x1c0
    [<c11cfeda>] ? evict_inodes+0xca/0xe0
    [<c11b925a>] generic_shutdown_super+0x6a/0xe0
    [<c10a1df0>] ? prepare_to_wait_event+0xd0/0xd0
    [<c1165a50>] ? unregister_shrinker+0x40/0x50
    [<c11b92f6>] kill_block_super+0x26/0x70
    [<c11b94f5>] deactivate_locked_super+0x45/0x80
    [<c11ba007>] deactivate_super+0x47/0x60
    [<c11d2b39>] cleanup_mnt+0x39/0x80
    [<c11d2bc0>] __cleanup_mnt+0x10/0x20
    [<c1080b51>] task_work_run+0x91/0xd0
    [<c1011e3c>] do_notify_resume+0x7c/0x90
    [<c1720da5>] work_notify
   Code: 8b 55 e8 e9 f4 fe ff ff 90 90 90 90 90 90 90 90 90 90 90 55 89 e5 83 ec 20 89 5d f4 89 c3 89 75 f8 89 d6 89 7d fc 89 cf 8b 48 14 <64> 8b 01 89 45 ec 89 c2 8b 45 08 c1 fa 1f 01 75 ec 89 55 f0 89
   EIP: [<c136ce88>] __percpu_counter_add+0x18/0xc0 SS:ESP 0068:ebebde40
   CR2: 0000000034beb000
   ---[ end trace dd564a7bea834ecd ]---

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101011

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-16 10:03:57 -04:00
Theodore Ts'o
e294a5371b ext4: ratelimit the file system mounted message
The xfstests ext4/305 will mount and unmount the same file system over
4,000 times, and each one of these will cause a system log message.
Ratelimit this message since if we are getting more than a few dozen
of these messages, they probably aren't going to be helpful.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-08-15 14:59:44 -04:00
Eric Sandeen
c642dc9e1a ext4: don't manipulate recovery flag when freezing no-journal fs
At some point along this sequence of changes:

f6e63f9 ext4: fold ext4_nojournal_sops into ext4_sops
bb04457 ext4: support freezing ext2 (nojournal) file systems
9ca9238 ext4: Use separate super_operations structure for no_journal filesystems

ext4 started setting needs_recovery on filesystems without journals
when they are unfrozen.  This makes no sense, and in fact confuses
blkid to the point where it doesn't recognize the filesystem at all.

(freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
see needs_recovery set on fs w/ no journal).

To fix this, don't manipulate the INCOMPAT_RECOVER feature on
filesystems without journals.

Reported-by: Stu Mark <smark@datto.com>
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-08-15 10:45:06 -04:00
Jan Kara
c290ea01ab fs: Remove ext3 filesystem driver
The functionality of ext3 is fully supported by ext4 driver. Major
distributions (SUSE, RedHat) already use ext4 driver to handle ext3
filesystems for quite some time. There is some ugliness in mm resulting
from jbd cleaning buffers in a dirty page without cleaning page dirty
bit and also support for buffer bouncing in the block layer when stable
pages are required is there only because of jbd. So let's remove the
ext3 driver. This saves us some 28k lines of duplicated code.

Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23 20:59:40 +02:00
Daeho Jeong
564bc40252 ext4, jbd2: add REQ_FUA flag when recording an error in the superblock
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.

[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-23 09:46:11 -04:00
Carlos Maiolino
5ba92bcf0d ext4: reject journal options for ext2 mounts
There is no reason to allow ext2 filesystems be mounted with journal
mount options. So, this patch adds them to the MOPT_NO_EXT2 mount
options list.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:57:59 -04:00
Tejun Heo
001e4a8775 ext4: implement cgroup writeback support
For ordered and writeback data modes, all data IOs go through
ext4_io_submit.  This patch adds cgroup writeback support by invoking
wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
io_submit_add_bh().  Journal data which is written by jbd2 worker is
left alone by this patch and will always be written out from the root
cgroup.

ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
either ordered or writeback.  In journaled data mode, most IOs become
synchronous through the journal and enabling cgroup writeback support
doesn't make much sense or difference.  Journaled data mode is left
alone.

Lightly tested with sequential data write workload.  Behaves as
expected.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-07-21 23:51:26 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Linus Torvalds
47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Rasmus Villemoes
ec3904dc65 fs/ext4/super.c: use strreplace() in ext4_fill_super()
This makes a very large function a little smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:41 -07:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Miklos Szeredi
9bf39ab2ad vfs: add file_path() helper
Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:00:05 -04:00
Theodore Ts'o
a2fd66d069 ext4: set lazytime on remount if MS_LAZYTIME is set by mount
Newer versions of mount parse the lazytime feature and pass it to the
mount system call via the flags field in the mount system call,
removing the lazytime string from the mount options list.  So we need
to check for the presence of MS_LAZYTIME and set it in sb->s_flags in
order for this flag to be set on a remount.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-23 11:03:54 -04:00
Theodore Ts'o
c5e298ae53 ext4: prevent ext4_quota_write() from failing due to ENOSPC
In order to prevent quota block tracking to be inaccurate when
ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
file can now use the reserved block (since the quota file is arguably
file system metadata), and ext4_quota_write() now uses
ext4_should_retry_alloc() to retry the block allocation after a commit
has completed and released some blocks for allocation.

This fixes failures of xfstests generic/270:

Quota error (device vdc): write_blk: dquota write failed
Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-21 01:25:29 -04:00
Theodore Ts'o
89d96a6f8e ext4: call sync_blockdev() before invalidate_bdev() in put_super()
Normally all of the buffers will have been forced out to disk before
we call invalidate_bdev(), but there will be some cases, where a file
system operation was aborted due to an ext4_error(), where there may
still be some dirty buffers in the buffer cache for the device.  So
try to force them out to memory before calling invalidate_bdev().

This fixes a warning triggered by generic/081:

WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-06-20 22:50:33 -04:00
Andreas Dilger
b03a2f7eb2 ext4: improve warning directory handling messages
Several ext4_warning() messages in the directory handling code do not
report the inode number of the (potentially corrupt) directory where a
problem is seen, and others report this in an ad-hoc manner.  Add an
ext4_warning_inode() helper to print the inode number and command name
consistent with ext4_error_inode().

Consolidate the place in ext4.h that these macros are defined.

Clean up some other directory error and warning messages to print the
calling function name.

Minor code style fixes in nearby lines.

Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-15 14:50:26 -04:00
Theodore Ts'o
1cb767cd4a ext4 crypto: fail the mount if blocksize != pagesize
We currently don't correctly handle the case where blocksize !=
pagesize, so disallow the mount in those cases.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-12 23:44:33 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Chao Yu
e298e73bd7 ext4 crypto: release crypto resource on module exit
Crypto resource should be released when ext4 module exits, otherwise
it will cause memory leak.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:37:35 -04:00
Theodore Ts'o
c936e1ec28 ext4 crypto: use per-inode tfm structure
As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
we read or write a page.  Instead we can use a single tfm hanging off
the inode's crypt_info structure for all of our encryption needs for
that inode, since the tfm can be used by multiple crypto requests in
parallel.

Also use cmpxchg() to avoid races that could result in crypt_info
structure getting doubly allocated or doubly freed.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-31 13:34:22 -04:00
Theodore Ts'o
f5aed2c2a8 ext4: clean up superblock encryption mode fields
The superblock fields s_file_encryption_mode and s_dir_encryption_mode
are vestigal, so remove them as a cleanup.  While we're at it, allow
file systems with both encryption and inline_data enabled at the same
time to work correctly.  We can't have encrypted inodes with inline
data, but there's no reason to prohibit unencrypted inodes from using
the inline data feature.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:18:47 -04:00
Theodore Ts'o
b7236e21d5 ext4 crypto: reorganize how we store keys in the inode
This is a pretty massive patch which does a number of different things:

1) The per-inode encryption information is now stored in an allocated
   data structure, ext4_crypt_info, instead of directly in the node.
   This reduces the size usage of an in-memory inode when it is not
   using encryption.

2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
   encryption structure instead.  This remove an unnecessary memory
   allocation and free for the fname_crypto_ctx as well as allowing us
   to reuse the ctfm in a directory for multiple lookups and file
   creations.

3) We also cache the inode's policy information in the ext4_crypt_info
   structure so we don't have to continually read it out of the
   extended attributes.

4) We now keep the keyring key in the inode's encryption structure
   instead of releasing it after we are done using it to derive the
   per-inode key.  This allows us to test to see if the key has been
   revoked; if it has, we prevent the use of the derived key and free
   it.

5) When an inode is released (or when the derived key is freed), we
   will use memset_explicit() to zero out the derived key, so it's not
   left hanging around in memory.  This implies that when a user logs
   out, it is important to first revoke the key, and then unlink it,
   and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
   release any decrypted pages and dcache entries from the system
   caches.

6) All this, and we also shrink the number of lines of code by around
   100.  :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:17:47 -04:00
Theodore Ts'o
e2881b1b51 ext4 crypto: separate kernel and userspace structure for the key
Use struct ext4_encryption_key only for the master key passed via the
kernel keyring.

For internal kernel space users, we now use struct ext4_crypt_info.
This will allow us to put information from the policy structure so we
can cache it and avoid needing to constantly looking up the extended
attribute.  We will do this in a spearate patch.  This patch is mostly
mechnical to make it easier for patch review.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-18 13:16:47 -04:00
Theodore Ts'o
1b46617b8d ext4: don't save the error information if the block device is read-only
Google-Bug-Id: 20939131
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-14 18:37:30 -04:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Linus Torvalds
6162e4b0be A few bug fixes and add support for file-system level encryption in ext4.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVMvVGAAoJEPL5WVaVDYGjjZgH/0Z4bdtQpuQKAd2EoSUhiOh4
 tReqE1IuTU+urrL9qNA4qUFhAKq0Iju0INrnoYNb1+YxZ2myvUrMY4y2GkapaKgZ
 SFYL8LTS7E79/LuR6q1SFmUYoXCjqpWeHb7rAZ9OluSNQhke8SWdywLnp/0q05Go
 6SDwYdT8trxGED/wYTGPy9zMHcYEYHqIIvfFZd3eYtRnaP42Zo5rUvISg3cP0ekG
 LiX2D9Bi9pyqxgMjTG0+0xiC3ohTfXOujyHbnLVQ7kdZmpzZKfQspoczEIUolYb4
 /Ic4qPQQdbtjooQ7uRYUOFXeVjt7HZuTb3aVmh90RWrEhsLsyBmNd9StLFVdlcg=
 =9f7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A few bug fixes and add support for file-system level encryption in
  ext4"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4 crypto: enable encryption feature flag
  ext4 crypto: add symlink encryption
  ext4 crypto: enable filename encryption
  ext4 crypto: filename encryption modifications
  ext4 crypto: partial update to namei.c for fname crypto
  ext4 crypto: insert encrypted filenames into a leaf directory block
  ext4 crypto: teach ext4_htree_store_dirent() to store decrypted filenames
  ext4 crypto: filename encryption facilities
  ext4 crypto: implement the ext4 decryption read path
  ext4 crypto: implement the ext4 encryption write path
  ext4 crypto: inherit encryption policies on inode and directory create
  ext4 crypto: enforce context consistency
  ext4 crypto: add encryption key management facilities
  ext4 crypto: add ext4 encryption facilities
  ext4 crypto: add encryption policy and password salt support
  ext4 crypto: add encryption xattr support
  ext4 crypto: export ext4_empty_dir()
  ext4 crypto: add ext4 encryption Kconfig
  ext4 crypto: reserve codepoints used by the ext4 encryption feature
  ext4 crypto: add ext4_mpage_readpages()
  ...
2015-04-19 14:26:31 -07:00
Theodore Ts'o
6ddb244784 ext4 crypto: enable encryption feature flag
Also add the test dummy encryption mode flag so we can more easily
test the encryption patches using xfstests.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-16 01:56:00 -04:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Michael Halcrow
b30ab0e034 ext4 crypto: add ext4 encryption facilities
On encrypt, we will re-assign the buffer_heads to point to a bounce
page rather than the control_page (which is the original page to write
that contains the plaintext). The block I/O occurs against the bounce
page.  On write completion, we re-assign the buffer_heads to the
original plaintext page.

On decrypt, we will attach a read completion callback to the bio
struct. This read completion will decrypt the read contents in-place
prior to setting the page up-to-date.

The current encryption mode, AES-256-XTS, lacks cryptographic
integrity. AES-256-GCM is in-plan, but we will need to devise a
mechanism for handling the integrity data.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: Ildar Muslukhov <ildarm@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-12 00:43:56 -04:00
Christoph Hellwig
08439fec26 ext4: remove block_device_ejected
bdi->dev now never goes away, so this function became useless.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:56:32 -04:00
Sheng Yong
72b8e0f9fa ext4: remove unused header files
Remove unused header files and header files which are included in
ext4.h.

Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-04-02 23:47:42 -04:00
Jan Kara
0a240339a8 quota: Make VFS quotas use new interface for getting quota info
Create new internal interface for getting information about quota which
contains everything needed for both VFS quotas and XFS quotas. Make VFS
use this and hook it up to Q_GETINFO.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-03-04 16:06:34 +01:00
Linus Torvalds
feaf222925 Ext4 bug fixes for 3.20. We also reserved code points for encryption
and read-only images (for which the implementation is mostly just the
 reserved code point for a read-only feature :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU6lssAAoJENNvdpvBGATwpsEQAOcpCqj0gp/istbsoFpl5v5K
 +BU2aPvR5CPLtUQz9MqrVF5/6zwDbHGN+GIB6CEmh/qHIVQAhhS4XR+opSc7qqUr
 fAQ1AhL5Oh8Dyn9DRy5Io8oRv+wo5lRdD7aG7SPiizCMRQ34JwJ2sWIAwbP2Ea7W
 Xg51v3LWEu+UpqpgY3YWBoJKHj4hXwFvTVOCHs94239Y2zlcg2c4WwbKPzkvPcV/
 TvvZOOctty+l3FOB2bqFj3VnvywQmNv8/OixKjSprxlR7nuQlhKaLTWCtRjFbND4
 J/rk2ls5Bl79dnMvyVfV5ghpmGYBf5kkXCP716YsQkRCZUfNVrTOPJrNHZtYilAb
 opRo2UjAyTWxZBvyssnCorHJZUdxlYeIuSTpaG0zUbR0Y6p/7qd31F5k41GbBCFf
 B0lV3IaiVnXk23S2jFVHGhrzoKdFqu30tY7LMaO4xyGVMigOZJyBu8TZ7Utj9HmW
 /4GfjlvYqlfB7p+6yBkDv/87hjdmfMWIw48A7xWCiIeguQhB79gwTV7uAHVtgfng
 h5RF2EH/fx5klbAZx9vlaAh3pGFBHbh9fkeBmW9qNm7glz7aMUuxQaSo6X8HrCAJ
 LrECgDGbuiOHnMYuzZRERZiqwLB7JT82C1xopGzefsE/i0kN1eMjITkfggjQ5whu
 caLPn49tAb9U8P6TsPeE
 =PF+t
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Ext4 bug fixes.

  We also reserved code points for encryption and read-only images (for
  which the implementation is mostly just the reserved code point for a
  read-only feature :-)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix indirect punch hole corruption
  ext4: ignore journal checksum on remount; don't fail
  ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
  ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
  ext4: support read-only images
  ext4: change to use setup_timer() instead of init_timer()
  ext4: reserve codepoints used by the ext4 encryption feature
  jbd2: complain about descriptor block checksum errors
2015-02-22 18:05:13 -08:00
Linus Torvalds
038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Ross Zwisler
923ae0ff92 ext4: add DAX functionality
This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Eric Sandeen
2d5b86e048 ext4: ignore journal checksum on remount; don't fail
As of v3.18, ext4 started rejecting a remount which changes the
journal_checksum option.

Prior to that, it was simply ignored; the problem here is that
if someone has this in their fstab for the root fs, now the box
fails to boot properly, because remount of root with the new options
will fail, and the box proceeds with a readonly root.

I think it is a little nicer behavior to accept the option, but
warn that it's being ignored, rather than failing the mount,
but that might be a subjective matter...

Reported-by: Cónräd <conradsand.arma@gmail.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:07:37 -05:00
Eric Sandeen
b94a8b36be ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
rejection of, changing journal_checksum during remount.  One suffices.

While we're at it, remove old comment about the "check" option
which has been deprecated for some time now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 23:04:27 -05:00
Darrick J. Wong
2cb5cc8b09 ext4: support read-only images
Add a rocompat feature, "readonly" to mark a FS image as read-only.
The feature prevents the kernel and e2fsprogs from changing the image;
the flag can be toggled by tune2fs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-02-12 22:31:21 -05:00
Linus Torvalds
6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Theodore Ts'o
a26f49926d ext4: add optimization for the lazytime mount option
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-05 02:45:00 -05:00
Jan Kara
1fa5efe362 ext4: Use generic helpers for quotaon and quotaoff
Ext4 can just use the generic helpers provided by quota code for turning
quotas on and off when quota files are stored as system inodes. The only
difference is the feature test in ext4_quota_on_sysfile() but the same
is achieved in dquot_quota_enable() by checking whether usage tracking
for the corresponding quota type is enabled (which can happen only if
quota feature is set).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-30 12:50:42 +01:00
Jan Mrazek
04ecddb73d ext4: change to use setup_timer() instead of init_timer()
Signed-off-by: Jan Mrazek <email@honzamrazek.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-26 14:42:31 -05:00
Christoph Hellwig
de1414a654 fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:04 -07:00
Linus Torvalds
3b421b80be Revert a potential seek_data/hole regression which shows up when using
ext4 to handle ext3 file systems, plus two minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUqxTQAAoJENNvdpvBGATwUC8QAIYfP02XYyGrBQIoAoCiRJji
 TmhAe2Amy9xHgBq2I8Yz3bi8j1wGuiqc57fYgwwVScf10tzmO/HuwnAZZDAmg9hK
 ZWp9WyPoyf+U/nbkIfC5mRh3Qz0dt1pt6R3uQDUlcUuAamdMBrdJnhkQC6WMbpU2
 fAqsJT3/yGrLnMF29eVqJzcxb5KORJ8hEcD7kwkvJwe4sGm3C7iDsjS0i63YWDz4
 QNclW6zF4THhmuVNxwRupOgMQNSq8sHg8U23nP4DZLvLE7GlgtwfDvehU7uBfw5n
 WO5UfsEYLoeODNmujUJCtjXNLpzDXmrtByyWbbTK7EX3MmV94ym4uu5lHLfyMiTc
 o2ppxcsKBVcOsPWnFwuhJ5p/Wyy0Uld9Q3P6b5ymhyzDhkuwcTURpeRxBRXHEgcm
 nY5GE1bBdO7OigDz/+DFL/Zgr8EO7hW72hrBaLDWMEbrrl0asZw/ReC/bnreMmm4
 sP87DB+MqRXzRs8aOPWmCofJwGSgCYmOq2nqNCAaxgk/ofvrDURrnZYfLrbzspGa
 hqE1W0X5hvQydcifi4qq2Na76+Js3atSY38EOH/HNknSqlQjysnkW4ajTDWk/GFy
 M/fKUCfIl1tmCMN2myZzl89E7uMSyod75ycd0BQy36iHPE14JVvk/u7GfcKHLs53
 1rAPpW90a72GX2Z9+xxA
 =uPYz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Revert a potential seek_data/hole regression which shows up when using
  ext4 to handle ext3 file systems, plus two minor bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove spurious KERN_INFO from ext4_warning call
  Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
  ext4: prevent online resize with backup superblock
2015-01-06 14:05:40 -08:00
Jakub Wilk
363307e6e5 ext4: remove spurious KERN_INFO from ext4_warning call
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:31:14 -05:00
Linus Torvalds
9bfccec24e Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
 under heavy memory pressure, and Eric Whitney's bigalloc fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
 4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
 R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
 PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
 fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
 YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
 jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
 3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
 BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
 HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
 ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
 nNB2zVrHZDMvZ7MCZy85
 =ZrTc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
  fixes, which should improve CPU utilization and potential soft lockups
  under heavy memory pressure, and Eric Whitney's bigalloc fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
  ext4: fix suboptimal seek_{data,hole} extents traversial
  ext4: ext4_inline_data_fiemap should respect callers argument
  ext4: prevent fsreentrance deadlock for inline_data
  ext4: forbid journal_async_commit in data=ordered mode
  jbd2: remove unnecessary NULL check before iput()
  ext4: Remove an unnecessary check for NULL before iput()
  ext4: remove unneeded code in ext4_unlink
  ext4: don't count external journal blocks as overhead
  ext4: remove never taken branch from ext4_ext_shift_path_extents()
  ext4: create nojournal_checksum mount option
  ext4: update comments regarding ext4_delete_inode()
  ext4: cleanup GFP flags inside resize path
  ext4: introduce aging to extent status tree
  ext4: cleanup flag definitions for extent status tree
  ext4: limit number of scanned extents in status tree shrinker
  ext4: move handling of list of shrinkable inodes into extent status code
  ext4: change LRU to round-robin in extent status tree shrinker
  ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  ext4: fix block reservation for bigalloc filesystems
  ...
2014-12-12 09:28:03 -08:00
Jan Kara
d4f7610743 ext4: forbid journal_async_commit in data=ordered mode
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.

To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:19:17 -05:00
Eric Sandeen
b003b52496 ext4: don't count external journal blocks as overhead
This was fixed for ext3 with:

e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs

but was never fixed for ext4.

With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:27:44 -05:00
Darrick J. Wong
c6d3d56dd0 ext4: create nojournal_checksum mount option
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:20:50 -05:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Jan Kara
dd47592551 ext4: limit number of scanned extents in status tree shrinker
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:51:23 -05:00
Jan Kara
b0dea4c165 ext4: move handling of list of shrinkable inodes into extent status code
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.

Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:49:25 -05:00
Zheng Liu
edaa53cac8 ext4: change LRU to round-robin in extent status tree shrinker
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:45:37 -05:00
Al Viro
b93b41d4c7 ext4: kill ext4_kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20 12:19:11 -05:00
Jan Kara
96c7e0d964 ext4: Convert to private i_dquot field
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10 10:06:11 +01:00
Darrick J. Wong
50460fe8c6 ext4: remove extent status procfs files if journal load fails
If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Darrick J. Wong
6b992ff256 ext4: disallow changing journal_csum option during remount
ext4 does not permit changing the metadata or journal checksum feature
flag while mounted.  Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-30 10:53:16 -04:00
Darrick J. Wong
98c1a7593f ext4: enable journal checksum when metadata checksum feature enabled
If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-30 10:53:16 -04:00
Linus Torvalds
c2661b8060 A large number of cleanups and bug fixes, with some (minor) journal
optimizations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
 Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
 1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
 8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
 IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
 ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
 IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
 M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
 0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
 fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
 dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
 9g4GKAXAyh15PeBTJ5K/
 =vWic
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A large number of cleanups and bug fixes, with some (minor) journal
  optimizations"

[ This got sent to me before -rc1, but was stuck in my spam folder.   - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
  ext4: check s_chksum_driver when looking for bg csum presence
  ext4: move error report out of atomic context in ext4_init_block_bitmap()
  ext4: Replace open coded mdata csum feature to helper function
  ext4: delete useless comments about ext4_move_extents
  ext4: fix reservation overflow in ext4_da_write_begin
  ext4: add ext4_iget_normal() which is to be used for dir tree lookups
  ext4: don't orphan or truncate the boot loader inode
  ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
  ext4: optimize block allocation on grow indepth
  ext4: get rid of code duplication
  ext4: fix over-defensive complaint after journal abort
  ext4: fix return value of ext4_do_update_inode
  ext4: fix mmap data corruption when blocksize < pagesize
  vfs: fix data corruption when blocksize < pagesize for mmaped data
  ext4: fold ext4_nojournal_sops into ext4_sops
  ext4: support freezing ext2 (nojournal) file systems
  ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
  ext4: don't check quota format when there are no quota files
  jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
  jbd2: avoid pointless scanning of checkpoint lists
  ...
2014-10-20 09:50:11 -07:00
Darrick J. Wong
813d32f913 ext4: check s_chksum_driver when looking for bg csum presence
Convert the ext4_has_group_desc_csum predicate to look for a checksum
driver instead of the metadata_csum flag and change the bg checksum
calculation function to look for GDT_CSUM before taking the crc16
path.

Without this patch, if we mount with ^uninit_bg,^metadata_csum and
later metadata_csum gets turned on by accident, the block group
checksum functions will incorrectly assume that checksumming is
enabled (metadata_csum) but that crc16 should be used
(!s_chksum_driver).  This is totally wrong, so fix the predicate
and the checksum formula selection.

(Granted, if the metadata_csum feature bit gets enabled on a live FS
then something underhanded is going on, but we could at least avoid
writing garbage into the on-disk fields.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-14 02:35:49 -04:00
Dmitry Monakhov
9aa5d32ba2 ext4: Replace open coded mdata csum feature to helper function
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized  csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum  $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:36:16 -04:00
Theodore Ts'o
f4bb298102 ext4: add ext4_iget_normal() which is to be used for dir tree lookups
If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes.  This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:56:00 -04:00
Tejun Heo
d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Theodore Ts'o
f6e63f9080 ext4: fold ext4_nojournal_sops into ext4_sops
There's no longer any need to have a separate set of super_operations
for nojournal mode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 17:12:30 -04:00
Theodore Ts'o
bb04457658 ext4: support freezing ext2 (nojournal) file systems
Through an oversight, when we added nojournal support to ext4, we
didn't add support to allow file system freezing.  This is relatively
easy to add, so let's do it.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dexuan Cui <decui@microsoft.com>
2014-09-18 17:12:02 -04:00
Theodore Ts'o
bda3253043 ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
This allows us to eliminate duplicate code, and eventually allow us to
also fold ext4_sops and ext4_nojournal_sops together.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 16:12:37 -04:00
Jan Kara
279bf6d390 ext4: don't check quota format when there are no quota files
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 01:12:15 -04:00
Dmitry Monakhov
844749764b ext4: explicitly inform user about orphan list cleanup
Production fs likely compiled/mounted w/o jbd debugging, so orphan
list clearing will be silent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16 14:52:03 -04:00
Darrick J. Wong
df4763bea5 ext4: validate external journal superblock checksum
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:44:36 -04:00
Darrick J. Wong
feb8c6d3dd jbd2: fix journal checksum feature flag handling
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want.  Rearrange the error
checking so that newer flags get complained about first.

Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:38:21 -04:00
Lukas Czerner
c7f725435a ext4: provide separate operations for sysfs feature files
Currently sysfs feature files uses ext4_attr_ops as the file operations
to show/store data. However the feature files is not supposed to contain
any data at all, the sole existence of the file means that the module
support the feature. Moreover, none of the sysfs feature attributes
actually register show/store functions so that would not be a problem.

However if a sysfs feature attribute register a show or store function
we might be in trouble because the kobject in this case is _not_ embedded
in the ext4_sb_info structure as ext4_attr_show/store expect.

So just to be safe, provide separate empty sysfs_ops to use in
ext4_feat_ktype. This might safe us from potential problems in the
future. As a bonus we can "store" something more descriptive than
nothing in the files, so let it contain "enabled" to make it clear that
the feature is really present in the module.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:27:58 -04:00
Lukas Czerner
52c198c682 ext4: add sysfs entry showing whether the fs contains errors
Currently there is no easy way to tell that the mounted file system
contains errors other than checking for log messages, or reading the
information directly from superblock.

This patch adds new sysfs entries:

errors_count		(number of fs errors we encounter)
first_error_time	(unix timestamp for the first error we see)
last_error_time		(unix timestamp for the last error we see)

If the file system is not marked as containing errors then any of the
file will return 0. Otherwise it will contain valid information. More
details about the errors should as always be found in the logs.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:18:13 -04:00
Jan Kara
a2d4a646e6 ext4: don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ext4 supports. Although
ext4 will support project quotas, use ext4 private definition for
consistency with other filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:15:15 -04:00
Tejun Heo
908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Gioh Kim
a8ac900b81 ext4: use non-movable memory for the ext4 superblock
Since the ext4 superblock is not released until the file system is
unmounted, allocate the buffer cache entry for the ext4 superblock out
of the non-moveable are to allow page migrations and thus CMA
allocations to more easily succeed if the CMA area is limited.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:36:15 -04:00
Zheng Liu
eb68d0e2fc ext4: track extent status tree shrinker delay statictics
This commit adds some statictics in extent status tree shrinker.  The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker.  Here we count
the following statictics:
  stats:
    the number of all objects on all extent status trees
    the number of reclaimable objects on lru list
    cache hits/misses
    the last sorted interval
    the number of inodes on lru list
  average:
    scan time for shrinking some objects
    the number of shrunk objects
  maximum:
    the inode that has max nr. of objects on lru list
    the maximum scan time for shrinking some objects

The output looks like below:
  $ cat /proc/fs/ext4/sda1/es_shrinker_info
  stats:
    28228 objects
    6341 reclaimable objects
    5281/631 cache hits/misses
    586 ms last sorted interval
    250 inodes on lru list
  average:
    153 us scan time
    128 shrunk objects
  maximum:
    255 inode (255 objects, 198 reclaimable)
    125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
    586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
    250 inodes on lru list
  ...
  maximum:
    255 inode (255 objects, 198 reclaimable)
    0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:26:49 -04:00
Darrick J. Wong
45f1a9c3f6 ext4: enable block_validity by default
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 21:34:09 -04:00
Theodore Ts'o
1c2150283c ext4: convert ext4_bread() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:15 -04:00
Darrick J. Wong
db9ee22036 jbd2: fix descriptor block size handling errors with journal_csum
It turns out that there are some serious problems with the on-disk
format of journal checksum v2.  The foremost is that the function to
calculate descriptor tag size returns sizes that are too big.  This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.

Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.

Add a few function helpers so we don't have to open-code quite so
many pieces.

Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Theodore Ts'o
d5e03cbb0c ext4: rearrange initialization to fix EXT4FS_DEBUG
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
it.)

Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.

This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.

(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15 06:01:38 -04:00
Theodore Ts'o
f9ae9cf5d7 ext4: revert commit which was causing fs corruption after journal replays
Commit 007649375f ("ext4: initialize multi-block allocator before
checking block descriptors") causes the block group descriptor's count
of the number of free blocks to become inconsistent with the number of
free blocks in the allocation bitmap.  This is a harmless form of fs
corruption, but it causes the kernel to potentially remount the file
system read-only, or to panic, depending on the file systems's error
behavior.

Thanks to Eric Whitney for his tireless work to reproduce and to find
the guilty commit.

Fixes: 007649375f ("ext4: initialize multi-block allocator before checking block descriptors"

Cc: stable@vger.kernel.org  # 3.15
Reported-by: David Jander <david@protonic.nl>
Reported-by: Matteo Croce <technoboy85@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Suggested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-11 13:55:40 -04:00
Eric Sandeen
5dd214248f ext4: disable synchronous transaction batching if max_batch_time==0
The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:18:22 -04:00
Theodore Ts'o
ae0f78de2c ext4: clarify error count warning messages
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-07-05 18:40:52 -04:00
liang xie
5d60125530 ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 22:06:43 -04:00
Lukas Czerner
c8b459f492 ext4: remove unnecessary double parentheses
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 12:55:07 -04:00
Stephen Hemminger
c197855ea1 ext4: make local functions static
I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:50:23 -04:00
Darrick J. Wong
bd63f6b0cd ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
On a filesystem with a 1k block size, the group descriptors live in
block 2, not block 1.  If the filesystem has bigalloc,meta_bg set,
however, the calculation of the group descriptor table location does
not take this into account and returns the wrong block number.  Fix
the calculation to return the correct value for this case.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 10:06:27 -04:00
Theodore Ts'o
202ee5df38 ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
To avoid potential data races, use a spinlock which protects the raw
(on-disk) inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-04-21 14:37:55 -04:00
Lukas Czerner
556615dcbf ext4: rename uninitialized extents to unwritten
Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.

The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.

This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.

This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-20 23:45:47 -04:00
Azat Khuzhin
007649375f ext4: initialize multi-block allocator before checking block descriptors
With EXT4FS_DEBUG ext4_count_free_clusters() will call
ext4_read_block_bitmap() without s_group_info initialized, so we need to
initialize multi-block allocator before.

And dependencies that must be solved, to allow this:
- multi-block allocator needs in group descriptors
- need to install s_op before initializing multi-block allocator,
  because in ext4_mb_init_backend() new inode is created.
- initialize number of group desc blocks (s_gdb_count) otherwise
  number of clusters returned by ext4_free_clusters_after_init() is not correct.
  (see ext4_bg_num_gdb_nometa())

Here is the stack backtrace:

(gdb) bt
 #0  ext4_get_group_info (group=0, sb=0xffff880079a10000) at ext4.h:2430
 #1  ext4_validate_block_bitmap (sb=sb@entry=0xffff880079a10000,
     desc=desc@entry=0xffff880056510000, block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:358
 #2  0xffffffff81232202 in ext4_wait_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0,
     bh=bh@entry=0xffff88007bf2b2d8) at balloc.c:476
 #3  0xffffffff81232eaf in ext4_read_block_bitmap (sb=sb@entry=0xffff880079a10000,
     block_group=block_group@entry=0) at balloc.c:489
 #4  0xffffffff81232fc0 in ext4_count_free_clusters (sb=sb@entry=0xffff880079a10000) at balloc.c:665
 #5  0xffffffff81259ffa in ext4_check_descriptors (first_not_zeroed=<synthetic pointer>,
     sb=0xffff880079a10000) at super.c:2143
 #6  ext4_fill_super (sb=sb@entry=0xffff880079a10000, data=<optimized out>,
     data@entry=0x0 <irq_stack_union>, silent=silent@entry=0) at super.c:3851
     ...

Signed-off-by: Azat Khuzhin <a3at.mail@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-07 10:54:20 -04:00
Theodore Ts'o
ed3654eb98 ext4: optimize Hurd tests when reading/writing inodes
Set a in-memory superblock flag to indicate whether the file system is
designed to support the Hurd.

Also, add a sanity check to make sure the 64-bit feature is not set
for Hurd file systems, since i_file_acl_high conflicts with a
Hurd-specific field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-24 14:09:06 -04:00
T Makphaibulchoke
9c191f701c ext4: each filesystem creates and uses its own mb_cache
This patch adds new interfaces to create and destory cache,
ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
the cache creation and destory calls from ex4_init_xattr() and
ext4_exitxattr() in fs/ext4/xattr.c.

fs/ext4/super.c has been changed so that when a filesystem is mounted
a cache is allocated and attched to its ext4_sb_info structure.

fs/mbcache.c has been changed so that only one slab allocator is
allocated and used by all mbcache structures.

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
2014-03-18 19:24:49 -04:00
Theodore Ts'o
38c03b3439 ext4: only call sync_filesystm() when remounting read-only
This is the only time it is required for ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-03-13 22:49:42 -04:00
Theodore Ts'o
02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Fabian Frederick
e67bc2b359 ext4: Add __init marking to init_inodecache
init_inodecache is only called by __init init_ext4_fs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-17 20:34:53 -05:00
Theodore Ts'o
2330141097 ext4: don't try to modify s_flags if the the file system is read-only
If an ext4 file system is created by some tool other than mke2fs
(perhaps by someone who has a pathalogical fear of the GPL) that
doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
and that file system is then mounted read-only, don't try to modify
the s_flags field.  Otherwise, if dm_verity is in use, the superblock
will change, causing an dm_verity failure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-02-12 12:16:04 -05:00
Jan Kara
30fac0f75d ext4: Do not reserve clusters when fs doesn't support extents
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.

This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 21:11:59 -05:00
Al Viro
9105bb149b ext4: fix del_timer() misuse for ->s_err_report
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU.  Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences.  AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 20:52:31 -05:00
Theodore Ts'o
dd1f723bf5 ext4: use prandom_u32() instead of get_random_bytes()
Many of the uses of get_random_bytes() do not actually need
cryptographically secure random numbers.  Replace those uses with a
call to prandom_u32(), which is faster and which doesn't consume
entropy from the /dev/random driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-11-08 00:14:53 -05:00
Theodore Ts'o
efbed4dc58 ext4: add ratelimiting to ext4 messages
In the case of a storage device that suddenly disappears, or in the
case of significant file system corruption, this can result in a huge
flood of messages being sent to the console.  This can overflow the
file system containing /var/log/messages, or if a serial console is
configured, this can slow down the system so much that a hardware
watchdog can end up triggering forcing a system reboot.

Google-Bug-Id: 7258357

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-10-17 21:11:01 -04:00
Linus Torvalds
2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Linus Torvalds
45d9a2220f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "Unfortunately, this merge window it'll have a be a lot of small piles -
  my fault, actually, for not keeping #for-next in anything that would
  resemble a sane shape ;-/

  This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
  cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
  components) + several long-standing patches from various folks.

  There definitely will be a lot more (starting with Miklos'
  check_submount_and_drop() series)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  direct-io: Handle O_(D)SYNC AIO
  direct-io: Implement generic deferred AIO completions
  add formats for dentry/file pathnames
  kvm eventfd: switch to fdget
  powerpc kvm: use fdget
  switch fchmod() to fdget
  switch epoll_ctl() to fdget
  switch copy_module_from_fd() to fdget
  git simplify nilfs check for busy subtree
  ibmasmfs: don't bother passing superblock when not needed
  don't pass superblock to hypfs_{mkdir,create*}
  don't pass superblock to hypfs_diag_create_files
  don't pass superblock to hypfs_vm_create_files()
  oprofile: get rid of pointless forward declarations of struct super_block
  oprofilefs_create_...() do not need superblock argument
  oprofilefs_mkdir() doesn't need superblock argument
  don't bother with passing superblock to oprofile_create_stats_files()
  oprofile: don't bother with passing superblock to ->create_files()
  don't bother passing sb to oprofile_create_files()
  coh901318: don't open-code simple_read_from_buffer()
  ...
2013-09-05 08:50:26 -07:00
Christoph Hellwig
7b7a8665ed direct-io: Implement generic deferred AIO completions
Add support to the core direct-io code to defer AIO completions to user
context using a workqueue.  This replaces opencoded and less efficient
code in XFS and ext4 (we save a memory allocation for each direct IO)
and will be needed to properly support O_(D)SYNC for AIO.

The communication between the filesystem and the direct I/O code requires
a new buffer head flag, which is a bit ugly but not avoidable until the
direct I/O code stops abusing the buffer_head structure for communicating
with the filesystems.

Currently this creates a per-superblock unbound workqueue for these
completions, which is taken from an earlier patch by Jan Kara.  I'm
not really convinced about this use and would prefer a "normal" global
workqueue with a high concurrency limit, but this needs further discussion.

JK: Fixed ext4 part, dynamic allocation of the workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-04 09:23:46 -04:00
Eric Sandeen
ad4eec6135 ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.

The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.

Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.

# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

and it'll mount even if the devnum has changed, as shown here:

# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1

Change the journal device number:

# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile 

And today it will fail:

# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal

But with this new mount option, we can specify the new path:

# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#

(which does update the encoded device number, incidentally):

# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device:	          0x0701

But best of all we can just always mount by journal-path, and
it'll always work:

# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#

So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 19:05:07 -04:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Piotr Sarna
6ae6514b33 ext4: fix mount/remount error messages for incompatible mount options
Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored.  Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.

Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:02:24 -04:00
Theodore Ts'o
59d9fa5c2e ext4: allow the mount options nodelalloc and data=journal
Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously.  This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.

Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:01:24 -04:00
Eric Sandeen
dd12ed144e ext4: destroy ext4_es_cachep on module unload
Without this, module can't be reloaded.

[  500.521980] kmem_cache_sanity_check (ext4_extent_status): Cache name already exists.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org  # v3.8+
2013-07-26 15:21:11 -04:00
Theodore Ts'o
ad065dd016 ext4: don't show usrquota/grpquota twice in /proc/mounts
We now print mount options in a generic fashion in
ext4_show_options(), so we shouldn't be explicitly printing the
{usr,grp}quota options in ext4_show_quota_options().

Without this patch, /proc/mounts can look like this:

 /dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
                                      ^^^^^^^^              ^^^^^^^^

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-11 18:54:37 -04:00
Theodore Ts'o
960fd856fd ext4: fix ext4_get_group_number()
The function ext4_get_group_number() was introduced as an optimization
in commit bd86298e60.  Unfortunately, this commit incorrectly
calculate the group number for file systems with a 1k block size (when
s_first_data_block is 1 instead of zero).  This could cause the
following kernel BUG:

[  568.877799] ------------[ cut here ]------------
[  568.877833] kernel BUG at fs/ext4/mballoc.c:3728!
[  568.877840] Oops: Exception in kernel mode, sig: 5 [#1]
[  568.877845] SMP NR_CPUS=32 NUMA pSeries
[  568.877852] Modules linked in: binfmt_misc
[  568.877861] CPU: 1 PID: 3516 Comm: fs_mark Not tainted 3.10.0-03216-g7c6809f-dirty #1
[  568.877867] task: c0000001fb0b8000 ti: c0000001fa954000 task.ti: c0000001fa954000
[  568.877873] NIP: c0000000002f42a4 LR: c0000000002f4274 CTR: c000000000317ef8
[  568.877879] REGS: c0000001fa956ed0 TRAP: 0700   Not tainted  (3.10.0-03216-g7c6809f-dirty)
[  568.877884] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 24000428  XER: 00000000
[  568.877902] SOFTE: 1
[  568.877905] CFAR: c0000000002b5464
[  568.877908]
GPR00: 0000000000000001 c0000001fa957150 c000000000c6a408 c0000001fb588000
GPR04: 0000000000003fff c0000001fa9571c0 c0000001fa9571c4 000138098c50625f
GPR08: 1301200000000000 0000000000000002 0000000000000001 0000000000000000
GPR12: 0000000024000422 c00000000f33a300 0000000000008000 c0000001fa9577f0
GPR16: c0000001fb7d0100 c000000000c29190 c0000000007f46e8 c000000000a14672
GPR20: 0000000000000001 0000000000000008 ffffffffffffffff 0000000000000000
GPR24: 0000000000000100 c0000001fa957278 c0000001fdb2bc78 c0000001fa957288
GPR28: 0000000000100100 c0000001fa957288 c0000001fb588000 c0000001fdb2bd10
[  568.877993] NIP [c0000000002f42a4] .ext4_mb_release_group_pa+0xec/0x1c0
[  568.877999] LR [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0
[  568.878004] Call Trace:
[  568.878008] [c0000001fa957150] [c0000000002f4274] .ext4_mb_release_group_pa+0xbc/0x1c0 (unreliable)
[  568.878017] [c0000001fa957200] [c0000000002fb070] .ext4_mb_discard_lg_preallocations+0x394/0x444
[  568.878025] [c0000001fa957340] [c0000000002fb45c] .ext4_mb_release_context+0x33c/0x734
[  568.878032] [c0000001fa957440] [c0000000002fbcf8] .ext4_mb_new_blocks+0x4a4/0x5f4
[  568.878039] [c0000001fa957510] [c0000000002ef56c] .ext4_ext_map_blocks+0xc28/0x1178
[  568.878047] [c0000001fa957640] [c0000000002c1a94] .ext4_map_blocks+0x2c8/0x490
[  568.878054] [c0000001fa957730] [c0000000002c536c] .ext4_writepages+0x738/0xc60
[  568.878062] [c0000001fa957950] [c000000000168a78] .do_writepages+0x5c/0x80
[  568.878069] [c0000001fa9579d0] [c00000000015d1c4] .__filemap_fdatawrite_range+0x88/0xb0
[  568.878078] [c0000001fa957aa0] [c00000000015d23c] .filemap_write_and_wait_range+0x50/0xfc
[  568.878085] [c0000001fa957b30] [c0000000002b8edc] .ext4_sync_file+0x220/0x3c4
[  568.878092] [c0000001fa957be0] [c0000000001f849c] .vfs_fsync_range+0x64/0x80
[  568.878098] [c0000001fa957c70] [c0000000001f84f0] .vfs_fsync+0x38/0x4c
[  568.878105] [c0000001fa957d00] [c0000000001f87f4] .do_fsync+0x54/0x90
[  568.878111] [c0000001fa957db0] [c0000000001f8894] .SyS_fsync+0x28/0x3c
[  568.878120] [c0000001fa957e30] [c000000000009c88] syscall_exit+0x0/0x7c
[  568.878125] Instruction dump:
[  568.878130] 60000000 813d0034 81610070 38000000 7f8b4800 419e001c 813f007c 7d2bfe70
[  568.878144] 7d604a78 7c005850 54000ffe 7c0007b4 <0b000000> e8a10076 e87f0090 7fa4eb78
[  568.878160] ---[ end trace 594d911d9654770b ]---

In addition fix the STD_GROUP optimization so that it works for
bigalloc file systems as well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Li Zhong <lizhongfs@gmail.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org  # 3.10
2013-07-05 23:11:16 -04:00
Joe Perches
e7c96e8e47 ext4: reduce object size when !CONFIG_PRINTK
Reduce the object size ~10% could be useful for embedded systems.

Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
arguments, passing " " to functions when !CONFIG_PRINTK and still
verifying format and arguments with no_printk.

$ size fs/ext4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
 264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old

    $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
    # CONFIG_PRINTK is not set
    CONFIG_EXT4_FS=y
    CONFIG_EXT4_USE_FOR_EXT23=y
    CONFIG_EXT4_FS_POSIX_ACL=y
    # CONFIG_EXT4_FS_SECURITY is not set
    # CONFIG_EXT4_DEBUG is not set

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Zheng Liu
d3922a777f ext4: improve extent cache shrink mechanism to avoid to burn CPU time
Now we maintain an proper in-order LRU list in ext4 to reclaim entries
from extent status tree when we are under heavy memory pressure.  For
keeping this order, a spin lock is used to protect this list.  But this
lock burns a lot of CPU time.  We can use the following steps to trigger
it.

  % cd /dev/shm
  % dd if=/dev/zero of=ext4-img bs=1M count=2k
  % mkfs.ext4 ext4-img
  % mount -t ext4 -o loop ext4-img /mnt
  % cd /mnt
  % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
  % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
  % perf record -a -g
  % perf report

This commit tries to fix this problem.  Now a new member called
i_touch_when is added into ext4_inode_info to record the last access
time for an inode.  Meanwhile we never need to keep a proper in-order
LRU list.  So this can avoid to burns some CPU time.  When we try to
reclaim some entries from extent status tree, we use list_sort() to get
a proper in-order list.  Then we traverse this list to discard some
entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
time of sorting this list.  When we traverse the list, we skip the inode
that is newer than this time, and move this inode to the tail of LRU
list.  When the head of the list is newer than s_es_last_sorted, we will
sort the LRU list again.

In this commit, we break the loop if s_extent_cache_cnt == 0 because
that means that all extents in extent status tree have been reclaimed.

Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
changed to save a local variable in these functions.

Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:37 -04:00
Jon Ernst
03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Dmitry Monakhov
4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov
06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Jan Kara
2e8fa54e3b ext4: split extent conversion lists to reserved & unreserved parts
Now that we have extent conversions with reserved transaction, we have
to prevent extent conversions without reserved transaction (from DIO
code) to block these (as that would effectively void any transaction
reservation we did).  So split lists, work items, and work queues to
reserved and unreserved parts.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 14:21:02 -04:00
Jan Kara
f2d50a65c9 ext4: deprecate max_writeback_mb_bump sysfs attribute
This attribute is now unused so deprecate it.  We still show the old
default value to keep some compatibility but we don't allow writing to
that attribute anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-04 12:51:16 -04:00
Paul Taysom
566370a2e5 ext4: suppress ext4 orphan messages on mount
Suppress the messages releating to processing the ext4 orphan list
("truncating inode" and "deleting unreferenced inode") unless the
debug option is on, since otherwise they end up taking up space in the
log that could be used for more useful information.

Tested by opening several files, unlinking them, then
crashing the system, rebooting the system and examining
/var/log/messages.

Addresses the problem described in http://crbug.com/220976

Signed-off-by: Paul Taysom <taysom@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-05-28 07:51:21 -04:00
Lukas Czerner
55f252c9f5 ext4: truncate_inode_pages() in orphan cleanup path
Currently we do not tell mm to zero out tail of the page before truncate
in orphan_cleanup(). This is ok, because the page should not be
uptodate, however this may eventually change and I might cause problems.

Call truncate_inode_pages() as precautionary measure. Thanks Jan Kara
for pointing this out.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2013-05-27 23:32:35 -04:00
Al Viro
4385bab128 make blkdev_put() return void
same story as with the previous patches - note that return
value of blkdev_close() is lost, since there's nowhere the
caller (__fput()) could return it to.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-07 02:16:31 -04:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Theodore Ts'o
9f203507ed ext4: mark all metadata I/O with REQ_META
As Dave Chinner pointed out at the 2013 LSF/MM workshop, it's
important that metadata I/O requests are marked as such to avoid
priority inversions caused by I/O bandwidth throttling.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-20 15:46:17 -04:00
Lukas Czerner
e1091b157c ext4: Use kstrtoul() instead of parse_strtoul()
In parse_strtoul() we're still using deprecated simple_strtoul().  Remove
parse_strtoul() altogether and replace it with kstrtoul()

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-11 23:37:19 -04:00
Theodore Ts'o
d6a771056b ext4: fix miscellaneous big endian warnings
None of these result in any bug, but they makes sparse complain.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 23:59:55 -04:00
Lukas Czerner
27dd438542 ext4: introduce reserved space
Currently in ENOSPC condition when writing into unwritten space, or
punching a hole, we might need to split the extent and grow extent tree.
However since we can not allocate any new metadata blocks we'll have to
zero out unwritten part of extent or punched out part of extent, or in
the worst case return ENOSPC even though use actually does not allocate
any space.

Also in delalloc path we do reserve metadata and data blocks for the
time we're going to write out, however metadata block reservation is
very tricky especially since we expect that logical connectivity implies
physical connectivity, however that might not be the case and hence we
might end up allocating more metadata blocks than previously reserved.
So in future, metadata reservation checks should be removed since we can
not assure that we do not under reserve.

And this is where reserved space comes into the picture. When mounting
the file system we slice off a little bit of the file system space (2%
or 4096 clusters, whichever is smaller) which can be then used for the
cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.

The number of reserved clusters can be set via sysfs, however it can
never be bigger than number of free clusters in the file system.

Note that this patch fixes the failure of xfstest 274 as expected.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-04-09 22:11:22 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Jan Kara
bcb1385096 ext4: fix deadlock with quota feature
We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
accounted even for quota files. Thus we could recurse back to quota code
when adding new blocks to quota file which can easily deadlock. Mark
hidden quota files properly.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 09:21:41 -04:00
Lukas Czerner
689110098c ext4: make ext4_block_in_group() much more efficient
Currently in when getting the block group number for a particular
block in ext4_block_in_group() we're using
ext4_get_group_no_and_offset() which uses do_div() to get the block
group and the remainer which is offset within the group.

We don't need all of that in ext4_block_in_group() as we only need to
figure out the group number.

This commit changes ext4_block_in_group() to calculate group number
directly. This shows as a big improvement with regards to cpu
utilization. Measuring fallocate -l 15T on fresh file system with perf
showed that 23% of cpu time was spend in the
ext4_get_group_no_and_offset(). With this change it completely
disappears from the list only bumping the occurrence of
ext4_init_block_bitmap() which is the biggest user of
ext4_block_in_group() by 4%. As the result of this change on my system
the fallocate call was approx. 10% faster.

However since there is '-g' option in mkfs which allow us setting
different groups size (mostly for developers) I've introduced new per
file system flag whether we have a standard block group size or
not. The flag is used to determine whether we can use the bit shift
optimization or not.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 22:12:52 -04:00
Dmitry Monakhov
a75ae78f08 ext4: unregister es_shrinker if mount failed
Otherwise destroyed ext_sb_info will be part of global shinker list
and result in the following OOPS:

JBD2: corrupted journal superblock
JBD2: recovery failed
EXT4-fs (dm-2): error loading journal
general protection fault: 0000 [#1] SMP
Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
mod
CPU 1
Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
Stack:
ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
Call Trace:
[<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
[<ffffffff81249381>] mount_bdev+0x331/0x340
[<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
[<ffffffff81362035>] ext4_mount+0x15/0x20
[<ffffffff8124869a>] mount_fs+0x9a/0x2e0
[<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
[<ffffffff81279c02>] do_new_mount+0x172/0x2e0
[<ffffffff8127aa56>] do_mount+0x376/0x380
[<ffffffff8127ab98>] sys_mount+0x138/0x150
[<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
8
RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
RSP <ffff88011d5cbcd8>

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-04-03 22:10:52 -04:00
Dmitry Monakhov
5d3ee20855 ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
2013-04-03 22:08:52 -04:00
Linus Torvalds
d3c926264a Fix a number of regression and other bugs in ext4, most of which were
relatively obscure cornercases or races that were found using
 regression tests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRSm5lAAoJENNvdpvBGATwZW8QAN7jMn7IaVCTXXblqgqba4uN
 KvLGRgK7R/n1rIhdHoxJHumwRQLTppVzjDCc8ePnWhdypzMZNuzUvs+OoCFdkDsW
 qf3CmL/p/R1oSiSzzFIs/7wGp7xBZ0l0BWZMFWd9EUg9cqoMBDA6KzcMF95fOtas
 KsjRL+BThacVldS7jyKFwE4BrpXd0Z5V9qZ6wjQPPoBx8sXF4iYA+CZVo5FUKBs8
 6I82LS1/PIYCe3IOSpCgyKXQqRzAYJANv1ndken5wW8jWT2R58e360OwZEVcpIN9
 /caov+F5OKfk4iOGq3b+vwRplNhAI2S6C4vhMbmS2GPWE8Fnr8gubyqNAIIs5R/y
 3zYHdqZESfuEF7K3QoAepiJhi3YIoRxXC1FxD7uxx7VBRhW2w8Ij5hlXhuSoh24M
 MUiXgCeIxQb+ZfUx0OHV++LSOHVccU4y7Z0X+LpXQa6tEMBuSgK6yCKsGkyr8APN
 gPMupTptgyUE3tFaCjqc7QKtmoeRAMSvzfqEyV6DlblIOe+3f/RJzRO222Xc4kxq
 D9t2tOuPoXsR+ivtS5pEcrZkE4Y2hkJbJzb7XXvfoETixYsuX6VkiPK/D68S9eRe
 VelqTM2lHPJi/3Wkle0p4pzWpEq70D8qZVp4TKLHMJCTQKpwUfopm5lvln87lc7w
 4JDORIx/ed1u8MMTJlmG
 =X3vc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a number of regression and other bugs in ext4, most of which were
  relatively obscure cornercases or races that were found using
  regression tests."

* tag 'ext4_for_linue' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits)
  ext4: fix data=journal fast mount/umount hang
  ext4: fix ext4_evict_inode() racing against workqueue processing code
  ext4: fix memory leakage in mext_check_coverage
  ext4: use s_extent_max_zeroout_kb value as number of kb
  ext4: use atomic64_t for the per-flexbg free_clusters count
  jbd2: fix use after free in jbd2_journal_dirty_metadata()
  ext4: reserve metadata block for every delayed write
  ext4: update reserved space after the 'correction'
  ext4: do not use yield()
  ext4: remove unused variable in ext4_free_blocks()
  ext4: fix WARN_ON from ext4_releasepage()
  ext4: fix the wrong number of the allocated blocks in ext4_split_extent()
  ext4: update extent status tree after an extent is zeroed out
  ext4: fix wrong m_len value after unwritten extent conversion
  ext4: add self-testing infrastructure to do a sanity check
  ext4: avoid a potential overflow in ext4_es_can_be_merged()
  ext4: invalidate extent status tree during extent migration
  ext4: remove unnecessary wait for extent conversion in ext4_fallocate()
  ext4: add warning to ext4_convert_unwritten_extents_endio
  ext4: disable merging of uninitialized extents
  ...
2013-03-21 17:56:10 -07:00
Eric W. Biederman
fa7614ddd6 fs: Readd the fs module aliases.
I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module.  It turns out I was wrong.  At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does.  So at some point we may be delete these aliases without
problems.  However that day is not today.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-12 18:55:21 -07:00
Theodore Ts'o
90ba983f68 ext4: use atomic64_t for the per-flexbg free_clusters count
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-11 23:39:59 -04:00
Eric W. Biederman
7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
Linus Torvalds
a7c1120d2d Various bug fixes for ext4. The most important is a fix for the new
extent cache's slab shrinker which can cause significant, user-visible
 pauses when the system is under memory pressure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRMpM3AAoJENNvdpvBGATwiXgP/3eSg3C+M0ZUeL6lH3aXRxQO
 PHxUL/Di5cfFs3GX4DksVzsD1KkTIz8B424AhdahrrgGh1jTx4/J23OrEdu9nK24
 JGU5hmowoCyG8PZG1kGMbX6EYcblYTx+O2tX/RInnRExm9ajkfxb0S1g0Vl340qw
 58WTSWfl2+J/3RnJ9TYX/qNVeCJdxLH3GkpFbvQbLGyylfM9hsUD5MZMAR1bpOJF
 U2vNdK3n65W0AtKhLo7TYnoJ4ll2PoFRvffS0rqhEpIAcRxpVsNThFJLBcOQ1a79
 6cCN5uhrJOlL5jLN/fYCViU1+03y7itCMJmtSpuyV8DtUGjf4r1tzlvWGeiSmpB9
 NprZ/MgO1ROnzO/gzPM2s4nWWeGZiGaf7vMDyScIDtqF1ckfHN17jqazuSJcybN8
 U83O9+KyhHkvr/+zqlySXiBX2MUSUdSE37CsMC7R+mAz7C46yjXEPuG8pLkLCWiG
 gjMD30D1f6+h+K646WN497+Crxl1CurEH+ON7k158cNvVNlX1FfFHUprRHeNUXkV
 tEKjiCUCf5WjNeFEc93nC/nDi4OIISD25N9LyHzp2CcV/XXRjpsrNPBFDAZjwgiK
 YVUQIwocVUVlRaACzrM9sDFtSELqNzy/GLuERITu1Mb2R4sMXIyvvJkjc+EuQS0F
 XVQ3BU5ypWyxJGrSGCPd
 =+vcC
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most important is a fix for the new
  extent cache's slab shrinker which can cause significant, user-visible
  pauses when the system is under memory pressure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  ext4: use percpu counter for extent cache count
  ext4: optimize ext4_es_shrink()
2013-03-02 19:33:21 -08:00
Jan Kara
9b2ff35753 ext4: enable quotas before orphan cleanup
When using quota feature we need to enable quotas before orphan cleanup
so that changes happening during it are properly reflected in quota
accounting.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 18:22:38 -05:00
Jan Kara
262b4662f4 ext4: don't allow quota mount options when quota feature enabled
So far we silently ignored when quota mount options were set while quota
feature was enabled.  But this can create confusion in userspace when
mount options are set but silently ignored and also creates opportunities
for bugs when we don't properly test all quota types.  Actually
ext4_mark_dquot_dirty() forgets to test for quota feature so it was
dependent on journaled quota options being set.  OTOH ext4_orphan_cleanup()
tries to enable journaled quota when quota options are specified which is
wrong when quota feature is enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 17:57:08 -05:00
Lukas Czerner
810da240f2 ext4: convert number of blocks to clusters properly
We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems.  However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-02 17:18:58 -05:00
Wei Yongjun
3e36a16375 ext4: fix possible memory leak in ext4_remount()
'orig_data' is malloced in ext4_remount() and should be freed
before leaving from the error handling cases, otherwise it will
cause memory leak.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-02 17:13:55 -05:00
Theodore Ts'o
1ac6466f25 ext4: use percpu counter for extent cache count
Use a percpu counter rather than atomic types for shrinker accounting.
There's no need for ultimate accuracy in the shrinker, so this
should come a little more cheaply.  The percpu struct is somewhat
large, but there was a big gap before the cache-aligned
s_es_lru_lock anyway, and it fits nicely in there.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-02 10:27:46 -05:00
Linus Torvalds
d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Zheng Liu
74cd15cd02 ext4: reclaim extents from extent status tree
Although extent status is loaded on-demand, we also need to reclaim
extent from the tree when we are under a heavy memory pressure because
in some cases fragmented extent tree causes status tree costs too much
memory.

Here we maintain a lru list in super_block.  When the extent status of
an inode is accessed and changed, this inode will be move to the tail
of the list.  The inode will be dropped from this list when it is
cleared.  In the inode, a counter is added to count the number of
cached objects in extent status tree.  Here only written/unwritten/hole
extent is counted because delayed extent doesn't be reclaimed due to
fiemap, bigalloc and seek_data/hole need it.  The counter will be
increased as a new extent is allocated, and it will be decreased as a
extent is freed.

In this commit we use normal shrinker framework to reclaim memory from
the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
to count the number of reclaimable extents.  ext4_es_shrink() tries to
reclaim written/unwritten/hole extents from extent status tree.  The
inode that has been shrunk is moved to the tail of lru list.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:32:55 -05:00
Zheng Liu
69eb33dc24 ext4: remove single extent cache
Single extent cache could be removed because we have extent status tree
as a extent cache, and it would be better.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
2013-02-18 00:31:07 -05:00
Theodore Ts'o
9924a92a8c ext4: pass context information to jbd2__journal_start()
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.

The recommended way for finding the longer-running handles is:

   T=/sys/kernel/debug/tracing
   EVENT=$T/events/jbd2/jbd2_handle_stats
   echo "interval > 5" > $EVENT/filter
   echo 1 > $EVENT/enable

   ./run-my-fs-benchmark

   cat $T/trace > /tmp/problem-handles

This will list handles that were active for longer than 20ms.  Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation.  Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:

postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
   tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
   dirtied_blocks 0

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 21:59:22 -05:00
Theodore Ts'o
722887ddc8 ext4: move the jbd2 wrapper functions out of super.c
Move the jbd2 wrapper functions which start and stop handles out of
super.c, where they don't really logically belong, and into
ext4_jbd2.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-08 13:00:31 -05:00
Theodore Ts'o
8dc0aa8cf0 ext4: check incompatible mount options while mounting ext2/3
Check for incompatible mount options when using the ext4 file system
driver to mount ext2 or ext3 file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:38:39 -05:00
Jan Kara
e33e60eaed ext4: print error when argument of inode_readahead_blk is invalid
If argument of inode_readahead_blk is too big, we just bail out
without printing any error. Fix this since it could confuse users.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:14:31 -05:00
Jan Kara
5f3633e36b ext4: make mount option parsing loop more logical
The loop looking for correct mount option entry is more logical if it is
written rewritten as an empty loop looking for correct option entry and then
code handling the option. It also saves one level of indentation for a lot of
code so we can join a couple of split lines.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 23:09:36 -05:00
Jan Kara
0efb3b2300 ext4: move several mount options to standard handling loop
Several mount option (resuid, resgid, journal_dev, journal_ioprio) are
currently handled before we enter standard option handling loop. I don't
see a reason for this so move them to normal handling loop to make things
more regular.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-02 22:52:19 -05:00
Guo Chao
b1deefc99e ext4: remove unnecessary NULL pointer check
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:41:02 -05:00
Jan Kara
84c17543ab ext4: move work from io_end to inode
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself.  This also allows for a small simplification in
processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:43:46 -05:00
Jan Kara
36ade451a5 ext4: Always use ext4_bio_write_page() for writeout
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:30:52 -05:00
Chen Gang
03dafb5f59 ext4: fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24 23:24:58 -05:00
Theodore Ts'o
72ba74508b ext4: release sysfs kobject when failing to enable quotas on mount
In addition, print the error returned from ext4_enable_quotas()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-01-24 23:24:54 -05:00
Theodore Ts'o
7f5118629f ext4: trigger the lazy inode table initialization after resize
After we have finished extending the file system, we need to trigger a
the lazy inode table thread to zero out the inode tables.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-13 08:41:45 -05:00
Theodore Ts'o
721e3eba21 ext4: lock i_mutex when truncating orphan inodes
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is
called without i_mutex being taken.  It had previously not been taken
during orphan cleanup since races weren't possible at that point in
the mount process, but as a result of this c278531d39, we will now see
a kernel WARN_ON in this case.  Take the i_mutex in
ext4_orphan_cleanup() to suppress this warning.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:48 -05:00
Michael Tokarev
d096ad0f79 ext4: do not try to write superblock on ro remount w/o journal
When a journal-less ext4 filesystem is mounted on a read-only block
device (blockdev --setro will do), each remount (for other, unrelated,
flags, like suid=>nosuid etc) results in a series of scary messages
from kernel telling about I/O errors on the device.

This is becauese of the following code ext4_remount():

       if (sbi->s_journal == NULL)
                ext4_commit_super(sb, 1);

at the end of remount procedure, which forces writing (flushing) of
a superblock regardless whenever it is dirty or not, if the filesystem
is readonly or not, and whenever the device itself is readonly or not.

We only need call ext4_commit_super when the file system had been
previously mounted read/write.

Thanks to Eric Sandeen for help in diagnosing this issue.

Signed-off-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-25 14:08:16 -05:00
Eric Sandeen
0875a2b448 ext4: include journal blocks in df overhead calcs
To more accurately calculate overhead for "bsd" style
df reporting, we should count the journal blocks as
overhead as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Eric Whitney <enwlinux@gmail.com>
2012-12-25 13:56:01 -05:00
Jan Kara
261cb20cb2 ext4: check dioread_nolock on remount
Currently we allow enabling dioread_nolock mount option on remount for
filesystems where blocksize < PAGE_CACHE_SIZE.  This isn't really
supported so fix the bug by moving the check for blocksize !=
PAGE_CACHE_SIZE into parse_options(). Change the original PAGE_SIZE to
PAGE_CACHE_SIZE along the way because that's what we are really
interested in.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2012-12-20 00:07:18 -05:00
Carlos Maiolino
9a4c801947 ext4: ensure Inode flags consistency are checked at build time
Flags being used by atomic operations in inode flags (e.g.
ext4_test_inode_flag(), should be consistent with that actually stored
in inodes, i.e.: EXT4_XXX_FL.

It ensures that this consistency is checked at build-time, not at
run-time.

Currently, the flags consistency are being checked at run-time, but,
there is no real reason to not do a build-time check instead of a
run-time check. The code is comparing macro defined values with enum
type variables, where both are constants, so, there is no problem in
comparing constants at build-time.

enum variables are treated as constants by the C compiler, according
to the C99 specs (see www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf 
sec. 6.2.5, item 16), so, there is no real problem in comparing an
enumeration type at build time

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 16:30:45 -05:00
Tao Ma
939da10844 ext4: Remove CONFIG_EXT4_FS_XATTR
Ted has sent out a RFC about removing this feature. Eric and Jan
confirmed that both RedHat and SUSE enable this feature in all their
product.  David also said that "As far as I know, it's enabled in all
Android kernels that use ext4."  So it seems OK for us.

And what's more, as inline data depends its implementation on xattr,
and to be frank, I don't run any test again inline data enabled while
xattr disabled.  So I think we should add inline data and remove this
config option in the same release.

[ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
  isn't much in the grand scheme of things.  Since no one seems to be
  testing this configuration except for some automated compile farms, on
  balance we are better removing this config option, and so that it is
  effectively always enabled. -- tytso ]

Cc: David Brown <davidb@codeaurora.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 16:30:43 -05:00
Guo Chao
6b280c913e ext4: remove redundant initialization in ext4_fill_super()
We use kzalloc() to allocate sbi, no need to zero its field.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-10 14:06:04 -05:00
Guo Chao
a789f49c92 ext4: remove redundant code in ext4_alloc_inode()
inode_init_always() will initialize inode->i_data.writeback_index
anyway, no need to do this in ext4_alloc_inode().

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2012-12-10 14:06:04 -05:00
Theodore Ts'o
4a092d7379 ext4: rationalize ext4_extents.h inclusion
Previously, ext4_extents.h was being included at the end of ext4.h,
which was bad for a number of reasons: (a) it was not being included
in the expected place, and (b) it caused the header to be included
multiple times.  There were #ifdef's to prevent this from causing any
problems, but it still was unnecessary.

By moving the function declarations that were in ext4_extents.h to
ext4.h, which is standard practice for where the function declarations
for the rest of ext4.h can be found, we can remove ext4_extents.h from
being included in ext4.h at all, and then we can only include
ext4_extents.h where it is needed in ext4's source files.

It should be possible to move a few more things into ext4.h, and
further reduce the number of source files that need to #include
ext4_extents.h, but that's a cleanup for another day.

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Reported-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-28 13:03:30 -05:00
Vahram Martirosyan
766f44d46a ext4: fixed potential NULL dereference in ext4_calculate_overhead()
The memset operation before check can cause a BUG if the memory
allocation failed.  Since we are using get_zeroed_age, there is no
need to use memset anyway.

Found by the Spruce system in cooperation with the KEDR Framework.

Signed-off-by: Vahram Martirosyan <vmartirosyan@linuxtesting.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-28 12:44:16 -05:00
Zheng Liu
51865fda28 ext4: let ext4 maintain extent status tree
This patch lets ext4 maintain extent status tree.

Currently it only tracks delay extent status in extent status tree.  When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree.  When a delay extent is written out or invalidated, it will
be removed from this tree.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:32 -05:00
Zheng Liu
9a26b66175 ext4: initialize extent status tree
Let ext4 initialize extent status tree of an inode.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 21:57:30 -05:00
Lukas Czerner
07aa2ea138 ext4: fix error handling in ext4_fill_super()
There are some places in ext4_fill_super() where we would not return
proper error code if something fails. The confusion is caused probably
due to the fact that we have two "kind-of" return variables 'ret'and
'err'.

'ret' is used to return error code from ext4_fill_super() where err is
used to store return values from other functions within ext4_fill_super().
However some places were missing the obligatory 'ret = err'. We could
put the assignment where it is missing, but we can have better "future
proof" solution. Or we could convert the code to use just one, but it
would require more rewrites.

This commit fixes the problem by returning value from 'err' variable if
it is set and 'ret' otherwise in error handling branch of the
ext4_fill_super(). The reasoning is that 'ret' value is often set to
default "-EINVAL" or explicit value, where 'err' is used to store
return value from other functions and should be otherwise zero.

https://bugzilla.kernel.org/show_bug.cgi?id=48431

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 15:16:54 -05:00
Lukas Czerner
79add3a3f7 ext4: notify when discard is not supported
Notify user when mounting the file system with -o discard option, but
the device does not support discard. Obviously we do not want to fail
the mount or disable the options, because the underlying device might
change in future even without file system remount.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 13:28:29 -05:00
Zhao Hongjiang
d339450cca ext4: get rid of redundant code in ext4_fill_super()
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 12:07:33 -05:00
Eric Sandeen
b72f78cb63 ext4: fix overhead calculations in ext4_stats, again
"overhead" was a write-only variable in this function after commit
952fc18e; we set it to 0 for minixdf, or to sbi->s_overhead if !minixdf,
but never read it again after that.

We need to use it, not sbi->s_overhead, when subtracting out overhead
for f_blocks, or we get the wrong answer for minixdf.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 10:33:36 -05:00
Linus Torvalds
e589db7a6a Various bug fixes for ext4. The most serious of them fixes a security
bug (CVE-2012-4508) which leads to stale data exposure when we have
 fallocate racing against writes to files undergoing delayed
 allocation.  We also have two fixes for the metadata checksum feature,
 the most serious of which can cause the superblock to have a invalid
 checksum after a power failure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQhcJVAAoJENNvdpvBGATwlc0QAJ5eRVSXoQ9DL/rpycZtWsiR
 1HofZCBbeVJq7JkazypYZPV+ncm2Nxljx61EBMpReDgx+hgJS8VD7BcjxblXT1gK
 cvIk7tYXS1E5++TWZzQd0v3GMDoMJsfzb0Ao6vefaVgqh07MKE9Zvx0L8JR4tsH1
 YlRs2/ZALFqqMficemXpDuWRRoBTEcYkvaW9PtUIpeuk9i71iSCDiHvi0mRy4dYe
 nLftjBOjcsIuK0I7DfUYrbZNQuYacFcFTM5foE6lhdT+tlL1/od2M00IpopSSjF8
 7RoqV351FqL74Stu71wDp+q9n8t8bR9gnvEuDisHXXH6PKIYo83vawvuDKtP05lt
 lF0l2nKy/QorQtUNRnrWiRshPNEplmKM1yfRXwzfq5CX4Mjox1PM9g1AfMT/Pzbq
 wNPMqtiaNnVzfcSP94MTExKMR5axFgeFsIwuCtPVNDAUEbEFuwKARIeFjCGxYYsr
 81rIKD4lgvJjaHChtE/NzslQysMmr6qiZa17s+NteCwNRJX7U4xN99SO2BXSW7lW
 xGb1ZjdESiBZGzsmuOXqAQw7KWIRS7bQ+s4dewEbqQJomPD3NQUKsRVo1wdWeUqI
 6SI1YsBYEDPiViiohsFyn1Zl3BHgIvypWMW/ChhKtYsmEJapTnJ3SR3a+eafFZ99
 HgCMCF1i0KN1tvMDrLRn
 =qL29
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most serious of them fixes a security
  bug (CVE-2012-4508) which leads to stale data exposure when we have
  fallocate racing against writes to files undergoing delayed
  allocation.  We also have two fixes for the metadata checksum feature,
  the most serious of which can cause the superblock to have a invalid
  checksum after a power failure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Avoid underflow in ext4_trim_fs()
  ext4: Checksum the block bitmap properly with bigalloc enabled
  ext4: fix undefined bit shift result in ext4_fill_flex_info
  ext4: fix metadata checksum calculation for the superblock
  ext4: race-condition protection for ext4_convert_unwritten_extents_endio
  ext4: serialize fallocate with ext4_convert_unwritten_extents
2012-10-23 08:48:26 +03:00
Lukas Czerner
76495ec1d4 ext4: fix undefined bit shift result in ext4_fill_flex_info
The result of the bit shift expression in
'1 << sbi->s_log_groups_per_flex' can be undefined in the case that
s_log_groups_per_flex is 31 because the result of the shift is bigger
than INT_MAX. In reality this probably should not cause much problems
since we'll end up with INT_MIN which will then be converted into
'unsigned int' type, but nevertheless according to the ISO C99 the
result is actually undefined.

Fix this by changing the left operand to 'unsigned int' type.

Note that the commit d50f2ab6f0 already
tried to fix the undefined behaviour, but this was missed.

Thanks to Laszlo Ersek for pointing this out and suggesting the fix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
2012-10-15 12:56:49 -04:00
Theodore Ts'o
06db49e68a ext4: fix metadata checksum calculation for the superblock
The function ext4_handle_dirty_super() was calculating the superblock
on the wrong block data.  As a result, when the superblock is modified
while it is mounted (most commonly, when inodes are added or removed
from the orphan list), the superblock checksum would be wrong.  We
didn't notice because the superblock *was* being correctly calculated
in ext4_commit_super(), and this would get called when the file system
was unmounted.  So the problem only became obvious if the system
crashed while the file system was mounted.

Fix this by removing the poorly designed function signature for
ext4_superblock_csum_set(); if it only took a single argument, the
pointer to a struct superblock, the ambiguity which caused this
mistake would have been impossible.

Reported-by: George Spelvin <linux@horizon.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-10-10 01:06:58 -04:00
Linus Torvalds
6432f21284 The big new feature added this time is supporting online resizing
using the meta_bg feature.  This allows us to resize file systems
 which are greater than 16TB.  In addition, the speed of online
 resizing has been improved in general.
 
 We also fix a number of races, some of which could lead to deadlocks,
 in ext4's Asynchronous I/O and online defrag support, thanks to good
 work by Dmitry Monakhov.
 
 There are also a large number of more minor bug fixes and cleanups
 from a number of other ext4 contributors, quite of few of which have
 submitted fixes for the first time.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQbxMXAAoJENNvdpvBGATwlg4QAJZ4mHNSL2eaaxjRtTbL1pAz
 +FVXpJ3lhw1lSfE9hJGqPVE8EfU2fWjIqxEI7dgh95Tukc5pUnPAQ2/hBz8ZA0qq
 o0AFMk3mRnvCEh6HsZfumsV83eqpR3k/zEy4uFH+KtxBskPe2sEKy3B7qOxvgdKW
 Gh8B2WqF2BpIj9WIT1P9G6xsxZW64EMHTbWcgRhuoRD7bakDNnwQ3kElz/TJQU5q
 bM/5wE7pqKwU2J1L0Ho0mxDi0f/BbXeJdA9k1tQy2KM1pZwHtpj4Ls0qmfoi49GE
 KyZqQOXlFbAz/9tidPDceY5KoRRQm1MwZ+1MimQX1P+40cs/w3pNu3yiibcaXIru
 UZ63AQMCj5JHMcFNVi20sVCwjU/ibNtEO75cfDD4bzPgHJvfCj73EbHTLl21nbTu
 izIMffhJEHmRnmRXiiortYVuI4b19oIfnXg7eclrJoUWSuGwKKsJOc5nMjDqidG4
 B7Gq4TD89sGkIYzx+50E+ll2ispcBN0BQnGqp4k2BzgDyEHhuFYk7VuVQvJgCGTi
 eobzQJj7JUXPWxyemcAVkQTtUq4vVbkm/IwS+/GA9b9Z80X8hR8x6EVHUW5lX3qC
 YHoBSCU4XKZXXWqzx0fIVCXyKKFiBzM+OXcgHOKH90vK8k6kPmPODhNCxvV3pITU
 jfl9q+X1dY4SpybZjLt5
 =iYeV
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The big new feature added this time is supporting online resizing
  using the meta_bg feature.  This allows us to resize file systems
  which are greater than 16TB.  In addition, the speed of online
  resizing has been improved in general.

  We also fix a number of races, some of which could lead to deadlocks,
  in ext4's Asynchronous I/O and online defrag support, thanks to good
  work by Dmitry Monakhov.

  There are also a large number of more minor bug fixes and cleanups
  from a number of other ext4 contributors, quite of few of which have
  submitted fixes for the first time."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
  ext4: fix ext4_flush_completed_IO wait semantics
  ext4: fix mtime update in nodelalloc mode
  ext4: fix ext_remove_space for punch_hole case
  ext4: punch_hole should wait for DIO writers
  ext4: serialize truncate with owerwrite DIO workers
  ext4: endless truncate due to nonlocked dio readers
  ext4: serialize unlocked dio reads with truncate
  ext4: serialize dio nonlocked reads with defrag workers
  ext4: completed_io locking cleanup
  ext4: fix unwritten counter leakage
  ext4: give i_aiodio_unwritten a more appropriate name
  ext4: ext4_inode_info diet
  ext4: convert to use leXX_add_cpu()
  ext4: ext4_bread usage audit
  fs: reserve fallocate flag codepoint
  ext4: remove redundant offset check in mext_check_arguments()
  ext4: don't clear orphan list on ro mount with errors
  jbd2: fix assertion failure in commit code due to lacking transaction credits
  ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
  ext4: enable FITRIM ioctl on bigalloc file system
  ...
2012-10-08 06:36:39 +09:00
Linus Torvalds
aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Kirill A. Shutemov
8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Linus Torvalds
437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Dmitry Monakhov
e27f41e1b7 ext4: give i_aiodio_unwritten a more appropriate name
AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-28 23:24:52 -04:00
Dmitry Monakhov
f45ee3a1ea ext4: ext4_inode_info diet
Generic inode has unused i_private pointer which may be used as cur_aio_dio
storage.

TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
      to have concurent AIO_DIO requests.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-28 23:21:09 -04:00
Wei Yongjun
ba39ebb614 ext4: convert to use leXX_add_cpu()
Convert cpu_to_leXX(leXX_to_cpu(E1) + E2) to use leXX_add_cpu().

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-27 09:37:53 -04:00
Eric Sandeen
c25f9bc614 ext4: don't clear orphan list on ro mount with errors
If the file system contains errors and it is being mounted read-only,
don't clear the orphan list.  We should minimize changes to the file
system if it is mounted read-only.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-26 23:30:12 -04:00
Herton Ronaldo Krzesinski
50df9fd55e ext4: fix crash when accessing /proc/mounts concurrently
The crash was caused by a variable being erronously declared static in
token2str().

In addition to /proc/mounts, the problem can also be easily replicated
by accessing /proc/fs/ext4/<partition>/options in parallel:

$ cat /proc/fs/ext4/<partition>/options > options.txt

... and then running the following command in two different terminals:

$ while diff /proc/fs/ext4/<partition>/options options.txt; do true; done

This is also the cause of the following a crash while running xfstests
#234, as reported in the following bug reports:

	https://bugs.launchpad.net/bugs/1053019
	https://bugzilla.kernel.org/show_bug.cgi?id=47731

Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Brad Figg <brad.figg@canonical.com>
Cc: stable@vger.kernel.org
2012-09-23 22:49:12 -04:00
Eric W. Biederman
4c376dcae8 userns: Convert struct dquot dq_id to be a struct kqid
Change struct dquot dq_id to a struct kqid and remove the now
unecessary dq_type.

Make minimal changes to dquot, quota_tree, quota_v1, quota_v2, ext3,
ext4, and ocfs2 to deal with the change in quota structures and
signatures.  The ocfs2 changes are larger than most because of the
extensive tracing throughout the ocfs2 quota code that prints out
dq_id.

quota_tree.c:get_index is modified to take a struct kqid instead of a
qid_t because all of it's callers pass in dquot->dq_id and it allows
me to introduce only a single conversion.

The rest of the changes are either just replacing dq_type with dq_id.type,
adding conversions to deal with the change in type and occassionally
adding qid_eq to allow quota id comparisons in a user namespace safe way.

Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Theodore Tso <tytso@mit.edu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-09-18 01:01:41 -07:00
Theodore Ts'o
bc0b75f77a ext4: do not enable delalloc by default for ext2
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-17 22:54:36 -04:00
Theodore Ts'o
5e7bbef19c ext4: advertise the fact that the kernel supports meta_bg resizing
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-13 12:11:40 -04:00
Theodore Ts'o
117fff10d7 ext4: grow the s_flex_groups array as needed when resizing
Previously, we allocated the s_flex_groups array to the maximum size
that the file system could be resized.  There was two problems with
this approach.  First, it wasted memory in the common case where the
file system was not resized.  Secondly, once we start allowing online
resizing using the meta_bg scheme, there is no maximum size that the
file system can be resized.  So instead, we need to grow the
s_flex_groups at inline resize time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-05 01:29:50 -04:00
Sachin Kamat
caecd0af8f ext4: replace plain integer with NULL in super.c
Fixes the following sparse warning:
fs/ext4/super.c:1672:45: warning: Using plain integer as NULL pointer

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-18 22:29:18 -04:00
Theodore Ts'o
07724f9897 ext4: drop lock_super()/unlock_super()
We don't need lock_super()/unlock_super() any more, since the places
where it is used, we are protected by the s_umount r/w semaphore.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
2012-08-17 19:08:42 -04:00
Linus Torvalds
ef824bfba2 The following are all bug fixes and regressions. The most notable are
the ones which cause problems for ext4 on RAID --- a performance
 problem when mounting very large filesystems, and a kernel OOPS when
 doing an rm -rf on large directory hierarchies on fast devices.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQLlVSAAoJENNvdpvBGATwhOYQAKx22YF9+SHnnVv2GtCQrsE6
 N3acE/FoDYiO1/LRa5M6NDO3ZIL4Vqi6409LYFQky/SQL8ziM5CBeLOBD9qPTE1L
 AGrzWn4vzZcjEf90ZEBN99fS5Uj1A9Axz2denxy7Hfb2HcSXfcuuVQXpdS42cFo8
 gwtlX4jxmPkbjRlAdeqYVBNuWpTZ0a+UyLc4A6v3aLcbzPSJvPYmI7mKCksiSySU
 yt0atjMDb56blAQJ2TITdAZN6rQShNzyok2pPfxaLusl5g0Gtjq8sSEPof1PQ1zq
 gDFc+kpZvUyPdwQzV3IL8+TodFFZ0x/2OhqoYCTKajROLHGjQFsdkb8EJbnNeDff
 EDxIjeVJR0kzSuSNWu+n5028lmEd9Sk7Ykr37cHeUxG8/0SADUqSDQYNhvbOPQsj
 iq5dwF79tKjuMqjJrABuWA1ZNgHBISXgyBmHLXgEk3LrgucT9UIg8Zlkhq480SYO
 JXhmkO2Ka4UwkVUShoWgRtEzRgUxhINBShs6g67zwm6slS4s2CWHnqhUn6EQe6+r
 DY/hoUA8KbdG3Cf5iJBFM2kUO68CDIXeJjocA7JvlouRgQSxmkOceIuk7DusAitM
 nHJKAtSNgC/z7yMoNi7YN0S5YYcCebmO1MEPzYSpPH07YwLJVNmh9Fk6BIfb7vi3
 vJSQMBrgGrbaXrnhBA7z
 =Zulv
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "The following are all bug fixes and regressions.  The most notable are
  the ones which cause problems for ext4 on RAID --- a performance
  problem when mounting very large filesystems, and a kernel OOPS when
  doing an rm -rf on large directory hierarchies on fast devices."

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix kernel BUG on large-scale rm -rf commands
  ext4: fix long mount times on very big file systems
  ext4: don't call ext4_error while block group is locked
  ext4: avoid kmemcheck complaint from reading uninitialized memory
  ext4: make sure the journal sb is written in ext4_clear_journal_err()
2012-08-17 08:04:47 -07:00
Theodore Ts'o
0e376b1e3c ext4: return an error if kset_create_and_add fails in ext4_init_fs()
In the very unlikely case that kset_create_and_add() fails when the
ext4.ko module is being loaded (or during kernel startup) set err so
that it's clear that the module load failed.

https://bugzilla.kernel.org/show_bug.cgi?id=27912

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 10:04:17 -04:00
Zheng Liu
67a5da564f ext4: make the zero-out chunk size tunable
Currently in ext4 the length of zero-out chunk is set to 7 file system
blocks.  But if an inode has uninitailized extents from using
fallocate to preallocate space, and the workload issues many random
writes, this can cause a fragmented extent tree that will
unnecessarily grow the extent tree.

So create a new sysfs tunable, extent_max_zeroout_kb, which controls
the maximum size where blocks will be zeroed out instead of creating a
new uninitialized extent.  The default of this has been sent to 32kb.

CC: Zach Brown <zab@zabbo.net>
CC: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 09:54:17 -04:00
Theodore Ts'o
df981d03ee ext4: add max_dir_size_kb mount option
Very large directories can cause significant performance problems, or
perhaps even invoke the OOM killer, if the process is running in a
highly constrained memory environment (whether it is VM's with a small
amount of memory or in a small memory cgroup).

So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size.  We do this
via a new mount option, max_dir_size_kb.  If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.

Google-Bug-Id: 6863013

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 09:48:17 -04:00
Theodore Ts'o
0548bbb853 ext4: fix long mount times on very big file systems
Commit 8aeb00ff85: "ext4: fix overhead calculation used by
ext4_statfs()" introduced a O(n**2) calculation which makes very large
file systems take forever to mount.  Fix this with an optimization for
non-bigalloc file systems.  (For bigalloc file systems the overhead
needs to be set in the the superblock.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-08-17 09:23:00 -04:00
Theodore Ts'o
7e731bc9a1 ext4: avoid kmemcheck complaint from reading uninitialized memory
Commit 03179fe923 introduced a kmemcheck complaint in
ext4_da_get_block_prep() because we save and restore
ei->i_da_metadata_calc_last_lblock even though it is left
uninitialized in the case where i_da_metadata_calc_len is zero.

This doesn't hurt anything, but silencing the kmemcheck complaint
makes it easier for people to find real bugs.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631
(which is marked as a regression).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-08-05 23:28:16 -04:00
Theodore Ts'o
d796c52ef0 ext4: make sure the journal sb is written in ext4_clear_journal_err()
After we transfer set the EXT4_ERROR_FS bit in the file system
superblock, it's not enough to call jbd2_journal_clear_err() to clear
the error indication from journal superblock --- we need to call
jbd2_journal_update_sb_errno() as well.  Otherwise, when the root file
system is mounted read-only, the journal is replayed, and the error
indicator is transferred to the superblock --- but the s_errno field
in the jbd2 superblock is left set (since although we cleared it in
memory, we never flushed it out to disk).

This can end up confusing e2fsck.  We should make e2fsck more robust
in this case, but the kernel shouldn't be leaving things in this
confused state, either.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-08-05 19:04:57 -04:00
Artem Bityutskiy
7652bdfcb5 ext4: nuke write_super from comments
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from ext3.

Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:33 +04:00
Linus Torvalds
a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Jan Kara
8e8ad8a57c ext4: Convert to new freezing mechanism
We remove most of frozen checks since upper layer takes care of blocking all
writes. We have to handle protection in ext4_page_mkwrite() in a special way
because we cannot use generic block_page_mkwrite(). Also we add a freeze
protection to ext4_evict_inode() so that iput() of unlinked inode cannot modify
a frozen filesystem (we cannot easily instrument ext4_journal_start() /
ext4_journal_stop() with freeze protection because we are missing the
superblock pointer in ext4_journal_stop() in nojournal mode).

CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 09:45:48 +04:00
Linus Torvalds
173f865474 The usual collection of bug fixes and optimizations. Perhaps of
greatest note is a speed up for parallel, non-allocating DIO writes,
 since we no longer take the i_mutex lock in that case.  For bug fixes,
 we fix an incorrect overhead calculation which caused slightly
 incorrect results for df(1) and statfs(2).  We also fixed bugs in the
 metadata checksum feature.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQE1SzAAoJENNvdpvBGATwzOMQAKxVkaTqkMc+cUahLLUkFdGQ
 xnmigHufVuOvv+B8l1p6vbYx+qOftztqXF1WC41j8mTfEDs19jKXv3LI57TJ+bIW
 a/Kp1CjMicRs/HGhFPNWp1D+N8J95MTFP6Y9biXSmBBvefg2NSwxpg48yZtjUy1/
 Zl0414AqMvTJyqKKOUa++oyl3XlawnzDZ+6a7QPKsrAaDOU5977W4y2tZkNFk84d
 qRjTfaiX13aVe7XupgQHe/jtk40Sj38pyGVGiAGlHOZhZtYKE6MzB8OreGiMTy8z
 rmJz/CQUsQ8+sbKYhAcDru+bMrKghbO0u78XRz9CY+YpVFF/Xs2QiXoV0ZOGkIm6
 eokq7jEV0K+LtDEmM3PkmUPgXfYw5damTv8qWuBUFd4wtVE5x/DmK8AJVMidCAUN
 GkVR+rEbbEi7RCwsuac/aKB8baVQCTiJ5tfNTgWh9zll+9GZSk+U71Pp0KdcJGiS
 nxitAZ+20hZN2CQctlmaGbCPTPYCWU4hQ3IuMdTlQTQAs8S0y1FtylTRsXcC1eVR
 i1hBS/dVw5PVCaqoX79zYrByUymgX0ZaYY6seRT6+U9xPGDCSNJ9mjXZL3fttnOC
 d5gsx/pbMIAv52G5Hj6DfsXR2JFmmxsaIzsLtRvKi9q89d84XaZfbUsHYjn4Neym
 5lTKaSQHU71cKCxrStHC
 =VAVB
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The usual collection of bug fixes and optimizations.  Perhaps of
  greatest note is a speed up for parallel, non-allocating DIO writes,
  since we no longer take the i_mutex lock in that case.

  For bug fixes, we fix an incorrect overhead calculation which caused
  slightly incorrect results for df(1) and statfs(2).  We also fixed
  bugs in the metadata checksum feature."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: undo ext4_calc_metadata_amount if we fail to claim space
  ext4: don't let i_reserved_meta_blocks go negative
  ext4: fix hole punch failure when depth is greater than 0
  ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
  ext4: weed out ext4_write_super
  ext4: remove unnecessary superblock dirtying
  ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
  ext4: remove useless marking of superblock dirty
  ext4: fix ext4 mismerge back in January
  ext4: remove dynamic array size in ext4_chksum()
  ext4: remove unused variable in ext4_update_super()
  ext4: make quota as first class supported feature
  ext4: don't take the i_mutex lock when doing DIO overwrites
  ext4: add a new nolock flag in ext4_map_blocks
  ext4: split ext4_file_write into buffered IO and direct IO
  ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()
  ext4: fix out-of-date comments in extents.c
  ext4: use s_csum_seed instead of i_csum_seed for xattr block
  ext4: use proper csum calculation in ext4_rename
  ext4: fix overhead calculation used by ext4_statfs()
  ...
2012-07-27 20:52:25 -07:00
Artem Bityutskiy
4d47603d97 ext4: weed out ext4_write_super
We do not depend on VFS's '->write_super()' anymore and do not need
the 's_dirt' flag anymore, so weed out 'ext4_write_super()' and
's_dirt'.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:35:31 -04:00
Artem Bityutskiy
58c5873a76 ext4: remove unnecessary superblock dirtying
This patch changes the 'ext4_handle_dirty_super()' function which
submits the superblock for I/O in the following cases:

1. When creating the first large file on a file system without
   EXT4_FEATURE_RO_COMPAT_LARGE_FILE feature.
2. When re-sizing the file-system.
3. When creating an xattr on a file-system without the
   EXT4_FEATURE_COMPAT_EXT_ATTR feature.

If the file-system has journal enabled, the superblock is written via
the journal. We do not modify this path.

If the file-system has no journal, this function, falls back to just
marking the superblock as dirty using the 's_dirt' superblock
flag. This means that it delays the actual superblock I/O submission
by 5 seconds (default setting).  Namely, the 'sync_supers()' kernel
thread will call 'ext4_write_super()' later and will actually submit
the superblock for I/O.

And this is the behavior this patch modifies: we stop using 's_dirt'
and just mark the superblock buffer as dirty right away. Indeed, all 3
cases above are extremely rare and it does not add any value to delay
the I/O submission for them.

Note: 'ext4_handle_dirty_super()' executes
'__ext4_handle_dirty_super()' with 'now = 0'. This patch basically
makes the 'now' argument unneeded and it will be deleted in one of the
next patches.

This patch also removes 's_dirt' condition on the unmount path because
we never set it anymore, so we should not test it.

Tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-07-22 20:33:31 -04:00
Aditya Kali
7c319d3285 ext4: make quota as first class supported feature
This patch adds support for quotas as a first class feature in ext4;
which is to say, the quota files are stored in hidden inodes as file
system metadata, instead of as separate files visible in the file system
directory hierarchy.

It is based on the proposal at:                                                                                                           
https://ext4.wiki.kernel.org/index.php/Design_For_1st_Class_Quota_in_Ext4

This patch introduces a new feature - EXT4_FEATURE_RO_COMPAT_QUOTA
which, when turned on, enables quota accounting at mount time
iteself. Also, the quota inodes are stored in two additional superblock
fields.  Some changes introduced by this patch that should be pointed
out are:

1) Two new ext4-superblock fields - s_usr_quota_inum and
   s_grp_quota_inum for storing the quota inodes in use.
2) Default quota inodes are: inode#3 for tracking userquota and inode#4
   for tracking group quota. The superblock fields can be set to use
   other inodes as well.
3) If the QUOTA feature and corresponding quota inodes are set in
   superblock, the quota usage tracking is turned on at mount time. On
   'quotaon' ioctl, the quota limits enforcement is turned
   on. 'quotaoff' ioctl turns off only the limits enforcement in this
   case.
4) When QUOTA feature is in use, the quota mount options 'quota',
   'usrquota', 'grpquota' are ignored by the kernel.
5) mke2fs or tune2fs can be used to set the QUOTA feature and initialize
   quota inodes. The default reserved inodes will not be visible to user
   as regular files.
6) The quota-tools will need to be modified to support hidden quota
   files on ext4. E2fsprogs will also include support for creating and
   fixing quota files.
7) Support is only for the new V2 quota file format.

Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-22 20:21:31 -04:00
Jan Kara
a117782571 quota: Move quota syncing to ->sync_fs method
Since the moment writes to quota files are using block device page cache and
space for quota structures is reserved at the moment they are first accessed we
have no reason to sync quota before inode writeback. In fact this order is now
only harmful since quota information can easily change during inode writeback
(either because conversion of delayed-allocated extents or simply because of
allocation of new blocks for simple filesystems not using page_mkwrite).

So move syncing of quota information after writeback of inodes into ->sync_fs
method. This way we do not have to use ->quota_sync callback which is primarily
intended for use by quotactl syscall anyway and we get rid of calling
->sync_fs() twice unnecessarily. We skip quota syncing for OCFS2 since it does
proper quota journalling in all cases (unlike ext3, ext4, and reiserfs which
also support legacy non-journalled quotas) and thus there are no dirty quota
structures.

CC: "Theodore Ts'o" <tytso@mit.edu>
CC: Joel Becker <jlbec@evilplan.org>
CC: reiserfs-devel@vger.kernel.org
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Dave Kleikamp <shaggy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:58:34 +04:00
Theodore Ts'o
952fc18ef9 ext4: fix overhead calculation used by ext4_statfs()
Commit f975d6bcc7 introduced bug which caused ext4_statfs() to
miscalculate the number of file system overhead blocks.  This causes
the f_blocks field in the statfs structure to be larger than it should
be.  This would in turn cause the "df" output to show the number of
data blocks in the file system and the number of data blocks used to
be larger than they should be.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-07-09 16:27:05 -04:00
Linus Torvalds
4edebed866 Ext4 updates for 3.5
The major new feature added in this update is Darrick J. Wong's
 metadata checksum feature, which adds crc32 checksums to ext4's
 metadata fields.  There is also the usual set of cleanups and bug
 fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPyNleAAoJENNvdpvBGATwtLMP/i3WsPyTvxmYP6HttHXQb8Jk
 GYCoTQ5bZMuTbOwOGg3w137cXWBv5uuPpxIk79YVLHSWx6HuanlGIa7/VnPKIaLu
 2ihuvVfnrDqpwQ4MJaSq4R1Eka9JCwZ7HbYYo+fYOVobxgw588JVV9VVI9EdKRGz
 z11UkW8iHE0f6Xa5gOhdAMkR0uaPnxwJX/qHZYiHuognRivuwMglqWJSiMr8nQmo
 A2GmeoLehhW+k65IqgTCmSW6ZgFTvZdk6bskQIij3fOYHW3hHn/gcLFtmLTIZ/B5
 LZdg/lngPYve+R/UyypliGKi+pv1qNEiTiBm0rrBgsdZFkBdGj0soSvGZzeK+Mp4
 Q1vAmOBPYPFzs6nVzPst2n/osryyykFCK6TgSGZ50dosJ0NO8cBeDdX/gh9JKD2R
 yQUMUltOCCSj/eWU4iwqZ0T3FXRiH/+S3XMHznoKJiwUyGDBNQy4+Yg2k2WzUXrz
 Cu5t5BwNG2WNP7y5Et/wmUIzpC7VPId4qYmGyHe7OwTxSJgW+6f7GVkHfjWcDMuv
 pGgEUiInbMmLajP3v2/LKfVU4hXLZy4uJbhoBgDdeIpZrnPifJG/MwDOS4W+dLVT
 tDzgO1SAh3/E4jATreZ5bjzD/HGsfe1OX09UH3Pbc1EcgkrLnyrQXFwdHshdVu4A
 cxMoKNPVCQJySb1UrLkO
 =SdJJ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull Ext4 updates from Theodore Ts'o:
 "The major new feature added in this update is Darrick J Wong's
  metadata checksum feature, which adds crc32 checksums to ext4's
  metadata fields.

  There is also the usual set of cleanups and bug fixes."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: hole-punch use truncate_pagecache_range
  jbd2: use kmem_cache_zalloc wrapper instead of flag
  ext4: remove mb_groups before tearing down the buddy_cache
  ext4: add ext4_mb_unload_buddy in the error path
  ext4: don't trash state flags in EXT4_IOC_SETFLAGS
  ext4: let getattr report the right blocks in delalloc+bigalloc
  ext4: add missing save_error_info() to ext4_error()
  ext4: add debugging trigger for ext4_error()
  ext4: protect group inode free counting with group lock
  ext4: use consistent ssize_t type in ext4_file_write()
  ext4: fix format flag in ext4_ext_binsearch_idx()
  ext4: cleanup in ext4_discard_allocated_blocks()
  ext4: return ENOMEM when mounts fail due to lack of memory
  ext4: remove redundundant "(char *) bh->b_data" casts
  ext4: disallow hard-linked directory in ext4_lookup
  ext4: fix potential integer overflow in alloc_flex_gd()
  ext4: remove needs_recovery in ext4_mb_init()
  ext4: force ro mount if ext4_setup_super() fails
  ext4: fix potential NULL dereference in ext4_free_inodes_counts()
  ext4/jbd2: add metadata checksumming to the list of supported features
  ...
2012-06-01 10:12:15 -07:00
Theodore Ts'o
f3fc0210c0 ext4: add missing save_error_info() to ext4_error()
The ext4_error() function is missing a call to save_error_info().
Since this is the function which marks the file system as containing
an error, this oversight (which was introduced in 2.6.36) is quite
significant, and should be backported to older stable kernels with
high urgency.

Reported-by: Ken Sumrall <ksumrall@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ksumrall@google.com
Cc: stable@kernel.org
2012-05-30 23:00:16 -04:00
Theodore Ts'o
2c0544b235 ext4: add debugging trigger for ext4_error()
Make it easy to test whether or not the error handling subsystem in
ext4 is working correctly.  This allows us to simulate an ext4_error()
by echoing a string to /sys/fs/ext4/<dev>/trigger_fs_error.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ksumrall@google.com
2012-05-30 22:56:46 -04:00
Theodore Ts'o
2cde417de0 ext4: return ENOMEM when mounts fail due to lack of memory
This is a port of the ext3 commit: 4569cd1b0d

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:49:54 -04:00
Theodore Ts'o
2716b80284 ext4: remove redundundant "(char *) bh->b_data" casts
The b_data field of the buffer_head is already a char *, so there's no
point casting it to a char *.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:47:52 -04:00
Akira Fujita
9d99012ff2 ext4: remove needs_recovery in ext4_mb_init()
needs_recovery in ext4_mb_init() is not used, remove it.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 14:19:25 -04:00
Eric Sandeen
7e84b62164 ext4: force ro mount if ext4_setup_super() fails
If ext4_setup_super() fails i.e. due to a too-high revision,
the error is logged in dmesg but the fs is not mounted RO as
indicated.

Tested by:

# mkfs.ext4 -r 4 /dev/sdb6
# mount /dev/sdb6 /mnt/test
# dmesg | grep "too high"
[164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode
# grep sdb6 /proc/mounts
/dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0

Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-05-28 14:17:25 -04:00
Linus Torvalds
90324cc1b1 avoid iput() from flusher thread
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
 uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
 +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
 KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
 ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
 tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
 Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
 NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
 /9c4zxxekjHZnn2oooEa
 =oLXf
 -----END PGP SIGNATURE-----

Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux

Pull writeback tree from Wu Fengguang:
 "Mainly from Jan Kara to avoid iput() in the flusher threads."

* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Avoid iput() from flusher thread
  vfs: Rename end_writeback() to clear_inode()
  vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
  writeback: Refactor writeback_single_inode()
  writeback: Remove wb->list_lock from writeback_single_inode()
  writeback: Separate inode requeueing after writeback
  writeback: Move I_DIRTY_PAGES handling
  writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  writeback: Move clearing of I_SYNC into inode_sync_complete()
  writeback: initialize global_dirty_limit
  fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
  mm: page-writeback.c: local functions should not be exposed globally
2012-05-28 09:54:45 -07:00
Darrick J. Wong
25ed6e8a54 jbd2: enable journal clients to enable v2 checksumming
Add in the necessary code so that journal clients can enable the new
journal checksumming features.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-27 07:48:56 -04:00
Linus Torvalds
ece78b7df7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, ext3 and quota fixes from Jan Kara:
 "Interesting bits are:
   - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since
     quota code does not need i_mutex anymore in any unusual way.
   - backport (from ext4) of a fix of a checkpointing bug (missing cache
     flush) that could lead to fs corruption on power failure

  The rest are just random small fixes & cleanups."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: trivial fix to comment for ext2_free_blocks
  ext2: remove the redundant comment for ext2_export_ops
  ext3: return 32/64-bit dir name hash according to usage type
  quota: Get rid of nested I_MUTEX_QUOTA locking subclass
  quota: Use precomputed value of sb_dqopt in dquot_quota_sync
  ext2: Remove i_mutex use from ext2_quota_write()
  reiserfs: Remove i_mutex use from reiserfs_quota_write()
  ext4: Remove i_mutex use from ext4_quota_write()
  ext3: Remove i_mutex use from ext3_quota_write()
  quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG
  jbd: Write journal superblock with WRITE_FUA after checkpointing
  jbd: protect all log tail updates with j_checkpoint_mutex
  jbd: Split updating of journal superblock and marking journal empty
  ext2: do not register write_super within VFS
  ext2: Remove s_dirt handling
  ext2: write superblock only once on unmount
  ext3: update documentation with barrier=1 default
  ext3: remove max_debt in find_group_orlov()
  jbd: Refine commit writeout logic
2012-05-25 08:14:59 -07:00
Linus Torvalds
644473e9c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace enhancements from Eric Biederman:
 "This is a course correction for the user namespace, so that we can
  reach an inexpensive, maintainable, and reasonably complete
  implementation.

  Highlights:
   - Config guards make it impossible to enable the user namespace and
     code that has not been converted to be user namespace safe.

   - Use of the new kuid_t type ensures the if you somehow get past the
     config guards the kernel will encounter type errors if you enable
     user namespaces and attempt to compile in code whose permission
     checks have not been updated to be user namespace safe.

   - All uids from child user namespaces are mapped into the initial
     user namespace before they are processed.  Removing the need to add
     an additional check to see if the user namespace of the compared
     uids remains the same.

   - With the user namespaces compiled out the performance is as good or
     better than it is today.

   - For most operations absolutely nothing changes performance or
     operationally with the user namespace enabled.

   - The worst case performance I could come up with was timing 1
     billion cache cold stat operations with the user namespace code
     enabled.  This went from 156s to 164s on my laptop (or 156ns to
     164ns per stat operation).

   - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
     Most uid/gid setting system calls treat these value specially
     anyway so attempting to use -1 as a uid would likely cause
     entertaining failures in userspace.

   - If setuid is called with a uid that can not be mapped setuid fails.
     I have looked at sendmail, login, ssh and every other program I
     could think of that would call setuid and they all check for and
     handle the case where setuid fails.

   - If stat or a similar system call is called from a context in which
     we can not map a uid we lie and return overflowuid.  The LFS
     experience suggests not lying and returning an error code might be
     better, but the historical precedent with uids is different and I
     can not think of anything that would break by lying about a uid we
     can't map.

   - Capabilities are localized to the current user namespace making it
     safe to give the initial user in a user namespace all capabilities.

  My git tree covers all of the modifications needed to convert the core
  kernel and enough changes to make a system bootable to runlevel 1."

Fix up trivial conflicts due to nearby independent changes in fs/stat.c

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
  userns:  Silence silly gcc warning.
  cred: use correct cred accessor with regards to rcu read lock
  userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
  userns: Convert cgroup permission checks to use uid_eq
  userns: Convert tmpfs to use kuid and kgid where appropriate
  userns: Convert sysfs to use kgid/kuid where appropriate
  userns: Convert sysctl permission checks to use kuid and kgids.
  userns: Convert proc to use kuid/kgid where appropriate
  userns: Convert ext4 to user kuid/kgid where appropriate
  userns: Convert ext3 to use kuid/kgid where appropriate
  userns: Convert ext2 to use kuid/kgid where appropriate.
  userns: Convert devpts to use kuid/kgid where appropriate
  userns: Convert binary formats to use kuid/kgid where appropriate
  userns: Add negative depends on entries to avoid building code that is userns unsafe
  userns: signal remove unnecessary map_cred_ns
  userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
  userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
  userns: Convert stat to return values mapped from kuids and kgids
  userns: Convert user specfied uids and gids in chown into kuids and kgid
  userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
  ...
2012-05-23 17:42:39 -07:00
Theodore Ts'o
f32aaf2d2b ext4: enable the 64-bit jbd2 feature based on the 64-bit ext4 feature
Previously we were only enabling the 64-bit jbd2 feature if the number
of blocks in the file system was greater 2**32-1.  The problem with
this is that it makes it harder to test the 64-bit journal code paths
with small file systems, since a small test file system would with the
64-bit ext4 feature enable would use a 64-bit file system on-disk data
structures, but use a 32-bit journal.

This would also cause problems when trying to do an online resize to
grow the filesystem above the 2**32-1 boundary.  Fortunately the patch
to support online resize for 64-bit file systems hasn't been merged
yet, so this problem hasn't arisen in practice.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-21 11:42:02 -04:00
Eric W. Biederman
08cefc7ab8 userns: Convert ext4 to user kuid/kgid where appropriate
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15 14:59:27 -07:00
Jan Kara
0b7f7cefae ext4: Remove i_mutex use from ext4_quota_write()
We don't need i_mutex in ext4_quota_write() because writes to quota file
are serialized by dqio_mutex anyway. Changes to quota files outside of quota
code are forbidded and enforced by NOATIME and IMMUTABLE bits.

Signed-off-by: Jan Kara <jack@suse.cz>
2012-05-15 23:34:38 +02:00
Jan Kara
dbd5768f87 vfs: Rename end_writeback() to clear_inode()
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-06 13:43:41 +08:00
Darrick J. Wong
feb0ab32a5 ext4: make block group checksums use metadata_csum algorithm
metadata_csum supersedes uninit_bg.  Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set.  Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:45:10 -04:00
Darrick J. Wong
a9c4731780 ext4: calculate and verify superblock checksum
Calculate and verify the superblock checksum.  Since the UUID and
block group number are embedded in each copy of the superblock, we
need only checksum the entire block.  Refactor some of the code to
eliminate open-coding of the checksum update call.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:29:10 -04:00
Darrick J. Wong
0441984a33 ext4: load the crc32c driver if necessary
Obtain a reference to the cryptoapi and crc32c if we mount a
filesystem with metadata checksumming enabled.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:27:10 -04:00
Darrick J. Wong
d25425f8e0 ext4: record the checksum algorithm in use in the superblock
Record the type of checksum algorithm we're using for metadata in the
superblock, in case we ever want/need to change the algorithm.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:25:10 -04:00
Eldad Zack
db7e5c668e super.c: unused variable warning without CONFIG_QUOTA
sb info is only checked with quota support.

fs/ext4/super.c: In function ‘parse_options’:
fs/ext4/super.c:1600:23: warning: unused variable ‘sbi’ [-Wunused-variable]

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2012-04-23 21:44:41 -04:00
Theodore Ts'o
57f73c2c89 ext4: fix handling of journalled quota options
Commit 26092bf5 broke handling of journalled quota mount options by
trying to parse argument of every mount option as a number.  Fix this
by dealing with the quota options before we call match_int().

Thanks to Jan Kara for discovering this regression.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2012-04-16 18:55:26 -04:00
Theodore Ts'o
9cd70b347e ext4: address scalability issue by removing extent cache statistics
Andi Kleen and Tim Chen have reported that under certain circumstances
the extent cache statistics are causing scalability problems due to
cache line bounces.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-04-16 12:16:20 -04:00
Linus Torvalds
69e1aaddd6 Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes
The changes to export dirty_writeback_interval are from Artem's s_dirt
 cleanup patch series.  The same is true of the change to remove the
 s_dirt helper functions which never got used by anyone in-tree.  I've
 run these changes by Al Viro, and am carrying them so that Artem can
 more easily fix up the rest of the file systems during the next merge
 window.  (Originally we had hopped to remove the use of s_dirt from
 ext4 during this merge window, but his patches had some bugs, so I
 ultimately ended dropping them from the ext4 tree.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABCAAGBQJPb39rAAoJENNvdpvBGATwVz8P/3V1NqSsk20VJOLbmEE45GxL
 GDzQJ6OsFG0UiQk6ISSrSdwxfav/KTCGySsU9UtAoOdPcBwnnsf8S7wc6OggwwuC
 hBFGwwFzk6YSQaZ58sUxWRGeOJuP/FPem6Id6buC4DQ1KIcznP/hEEgEnh/ir4Ec
 vrsfexY93TR8BE2Mi23v2epDVLU0B6bY/w9nDqbTXif3xN/gh/ypoHHouuM6Bs2n
 TyWHOwD15NwfnvRHd8PfDDqQM/D29x3QI0FMrWj9McpwIz4d4cBfhN4LQ/G+yLDY
 izv5DM10GbinwHPrsOTGVAW3KIdSS9rP3jCJGVuOrJZ9ufGXosvHuIYVhI7J3SBK
 JhBu6QEsN1IsvlVYpz9q8mqVKaDXQLsz2eaTw+i4yfmyOk1kOX7nIEOxYFF78G+V
 Of/W1SpIpJQaXvLHRcDj9fDj0fZTciUZA8v7/HOFS+co2dzIl0iZbcfBFp0/56RY
 sWdQoeRlx1ciVDPR+w2TQO5w3VWQw1gT5aqux0NiPj0XFoiUHScxgNGAYbqENMQw
 v9chvyDMlorqj0rF/Vey5SssgEDi7MTdYuYTi4YyMqr7pcvOJaO85pf+wH9g2eKW
 XhW33PhPGuwCJDP5Pg8Y0Z2Hp/Q3DCqhLqhGfTyAs/NG9+hR4wgp3VWb8CUqhA1t
 C/yzNeOYqScAefCzQx2V
 =+9zk
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates for 3.4 from Ted Ts'o:
 "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

  The changes to export dirty_writeback_interval are from Artem's s_dirt
  cleanup patch series.  The same is true of the change to remove the
  s_dirt helper functions which never got used by anyone in-tree.  I've
  run these changes by Al Viro, and am carrying them so that Artem can
  more easily fix up the rest of the file systems during the next merge
  window.  (Originally we had hopped to remove the use of s_dirt from
  ext4 during this merge window, but his patches had some bugs, so I
  ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
  vfs: remove unused superblock helpers
  mm: export dirty_writeback_interval
  ext4: remove useless s_dirt assignment
  ext4: write superblock only once on unmount
  ext4: do not mark superblock as dirty unnecessarily
  ext4: correct ext4_punch_hole return codes
  ext4: remove restrictive checks for EOFBLOCKS_FL
  ext4: always set then trimmed blocks count into len
  ext4: fix trimmed block count accunting
  ext4: fix start and len arguments handling in ext4_trim_fs()
  ext4: update s_free_{inodes,blocks}_count during online resize
  ext4: change some printk() calls to use ext4_msg() instead
  ext4: avoid output message interleaving in ext4_error_<foo>()
  ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
  ext4: add no_printk argument validation, fix fallout
  ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
  ext4: give more helpful error message in ext4_ext_rm_leaf()
  ext4: remove unused code from ext4_ext_map_blocks()
  ext4: rewrite punch hole to use ext4_ext_remove_space()
  jbd2: cleanup journal tail after transaction commit
  ...
2012-03-28 10:02:55 -07:00
Artem Bityutskiy
182f514f88 ext4: remove useless s_dirt assignment
Clean-up ext4 a tiny bit by removing useless s_dirt assignment in
'ext4_fill_super()' because a bit later we anyway call
'ext4_setup_super()' which writes the superblock to the media
unconditionally.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:30:06 -04:00
Artem Bityutskiy
a8e25a8324 ext4: write superblock only once on unmount
In some rather rare cases it is possible that ext4 may the superblock
to the media twice. This patch makes sure this does not happen. This
should speed up unmounting in those cases.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 22:29:15 -04:00
Al Viro
07c0c5d8b8 ext4: initialization of ext4_li_mtx needs to be done earlier
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 22:05:02 -04:00
Al Viro
48fde701af switch open-coded instances of d_make_root() to new helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20 21:29:35 -04:00
Theodore Ts'o
92b9781658 ext4: change some printk() calls to use ext4_msg() instead
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:41:49 -04:00
Joe Perches
d9ee81da93 ext4: avoid output message interleaving in ext4_error_<foo>()
Using KERN_CONT means that messages from multiple threads may be
interleaved.  Avoid this by using a single printk call in
ext4_error_inode and ext4_error_file.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:15:43 -04:00
Theodore Ts'o
f70486055e ext4: try to deprecate noacl and noxattr_user mount options
No other file system allows ACL's and extended attributes to be
enabled or disabled via a mount option.  So let's try to deprecate
these options from ext4.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 22:06:20 -05:00
Theodore Ts'o
c7198b9c1e ext4: ignore mount options supported by ext2/3 (but have since been removed)
Users who tried to use the ext4 file system driver is being used for
the ext2 or ext3 file systems (via the CONFIG_EXT4_USE_FOR_EXT23
option) could have failed mounts if their /etc/fstab contains options
recognized by ext2 or ext3 but which have since been removed in ext4.

So teach ext4 to recognize them and give a warning that the mount
option was removed.

Report: https://bbs.archlinux.org/profile.php?id=33804

Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Baechler <thomas@archlinux.org>
Cc: Tobias Powalowski <tobias.powalowski@googlemail.com>
Cc: Dave Reisner <d@falconindy.com>
2012-03-04 22:00:53 -05:00
Theodore Ts'o
66acdcf4ea ext4: add debugging /proc file showing file system options
Now that /proc/mounts is consistently showing only those mount options
which need to be specified in /etc/fstab or on the mount command line,
it is useful to have file which shows exactly which file system
options are enabled.  This can be useful when debugging a user
problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 20:21:38 -05:00
Theodore Ts'o
5a916be1b3 ext4: make ext4_show_options() be table-driven
Consistently show mount options which are the non-default, so that
/proc/mounts accurately shows the mount options that would be
necessary to mount the file system in its current mode of operation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 19:27:31 -05:00
Theodore Ts'o
2adf6da837 ext4: move ext4_show_options() after parse_options()
This commit is strictly a code movement so in preparation of changing
ext4_show_options to be table driven.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:50 -05:00
Theodore Ts'o
26092bf524 ext4: use a table-driven handler for mount options
By using a table-drive approach, we shave about 100 lines of code from
ext4, and make the code a bit more regular and factored out.  This
will also make it possible in a future patch to use this table for
displaying the mount options that were specified in /proc/mounts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 23:20:47 -05:00
Theodore Ts'o
72578c33c4 ext4: unify handling of mount options which have been removed
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 18:04:40 -05:00
Theodore Ts'o
39ef17f1b0 ext4: simplify handling of the errors=* mount options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 17:56:23 -05:00
Theodore Ts'o
c64db50e76 ext4: remove the I_VERSION mount flag and use the super_block flag instead
There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:23:11 -05:00
Theodore Ts'o
ee4a3fcd1d ext4: remove Opt_ignore
This is completely unused so let's just get rid of it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:14:24 -05:00
Theodore Ts'o
87f26807e9 ext4: remove deprecation warnings for minix_df and grpid
People complained about removing both of these features, so per
Linus's dictate, we won't be able to remove them.  Sigh...

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 00:03:21 -05:00
Eric Sandeen
661aa52057 ext4: remove the resize mount option
The resize mount option seems to be of limited value,
especially in the age of online resize2fs.  Nuke it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Eric Sandeen
43e625d84f ext4: remove the journal=update mount option
The V2 journal format was introduced around ten years ago,
for ext3. It seems highly unlikely that anyone will need this
migration option for ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:04 -05:00
Bobi Jam
18aadd47f8 ext4: expand commit callback and
The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Theodore Ts'o
ff9cb1c4ee Merge branch 'for_linus' into for_linus_merged
Conflicts:
	fs/ext4/ioctl.c
2012-01-10 11:54:07 -05:00
Xi Wang
d50f2ab6f0 ext4: fix undefined behavior in ext4_fill_flex_info()
Commit 503358ae01 ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.

	sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
	groups_per_flex = 1 << sbi->s_log_groups_per_flex;

	if (groups_per_flex < 2) { ... }

This patch fixes two potential issues in the previous commit.

1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount.  That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0.  This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.

2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways.  Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.

	groups_per_flex = 1 << sbi->s_log_groups_per_flex;
	if (groups_per_flex == 0 || groups_per_flex == 1) {

We compile the code snippet using Clang 3.0 and GCC 4.6.  Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original.  GCC keeps the check, but
there is no guarantee that future versions will do the same.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-01-10 11:51:10 -05:00
Al Viro
94bf608a18 ext4: fix failure exits
a) leaking root dentry is bad
b) in case of failed ext4_mb_init() we don't want to do ext4_mb_release()
c) OTOH, in the same case we *do* want ext4_ext_release()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-09 15:57:20 -05:00
Linus Torvalds
eb59c505f8 Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
  PM / Hibernate: Implement compat_ioctl for /dev/snapshot
  PM / Freezer: fix return value of freezable_schedule_timeout_killable()
  PM / shmobile: Allow the A4R domain to be turned off at run time
  PM / input / touchscreen: Make st1232 use device PM QoS constraints
  PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
  PM / shmobile: Remove the stay_on flag from SH7372's PM domains
  PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
  PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
  PM: Drop generic_subsys_pm_ops
  PM / Sleep: Remove forward-only callbacks from AMBA bus type
  PM / Sleep: Remove forward-only callbacks from platform bus type
  PM: Run the driver callback directly if the subsystem one is not there
  PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
  PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
  PM / Sleep: Merge internal functions in generic_ops.c
  PM / Sleep: Simplify generic system suspend callbacks
  PM / Hibernate: Remove deprecated hibernation snapshot ioctls
  PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
  ARM: S3C64XX: Implement basic power domain support
  PM / shmobile: Use common always on power domain governor
  ...

Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
2012-01-08 13:10:57 -08:00
Al Viro
34c80b1d93 vfs: switch ->show_options() to struct dentry *
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:19:54 -05:00
Al Viro
d8c9584ea2 vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-06 23:16:53 -05:00
Ben Hutchings
1d526fc91b ext4: Report max_batch_time option correctly
Currently the value reported for max_batch_time is really the
value of min_batch_time.

Reported-by: Russell Coker <russell@coker.com.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-04 21:22:51 -05:00
Al Viro
6b520e0565 vfs: fix the stupidity with i_dentry in inode destructors
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else.  Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:52:40 -05:00
Rafael J. Wysocki
b00f4dc5ff Merge branch 'master' into pm-sleep
* master: (848 commits)
  SELinux: Fix RCU deref check warning in sel_netport_insert()
  binary_sysctl(): fix memory leak
  mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
  ipmi_watchdog: restore settings when BMC reset
  oom: fix integer overflow of points in oom_badness
  memcg: keep root group unchanged if creation fails
  nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
  nilfs2: unbreak compat ioctl
  cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
  evm: prevent racing during tfm allocation
  evm: key must be set once during initialization
  mmc: vub300: fix type of firmware_rom_wait_states module parameter
  Revert "mmc: enable runtime PM by default"
  mmc: sdhci: remove "state" argument from sdhci_suspend_host
  x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
  IB/qib: Correct sense on freectxts increment and decrement
  RDMA/cma: Verify private data length
  cgroups: fix a css_set not found bug in cgroup_attach_proc
  oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
  Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
  ...

Conflicts:
	kernel/cgroup_freezer.c
2011-12-21 21:59:45 +01:00
Zheng Liu
5635a62b83 ext4: add missing space to ext4_msg output in ext4_fill_super()
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-18 16:13:58 -05:00
Theodore Ts'o
fc6cb1cda5 ext4: display the correct mount option in /proc/mounts for [no]init_itable
/proc/mounts was showing the mount option [no]init_inode_table when
the correct mount option that will be accepted by parse_options() is
[no]init_itable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-12-12 22:06:18 -05:00
Rafael J. Wysocki
986b11c3ee Merge branch 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into pm-freezer
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
  freezer: fix wait_event_freezable/__thaw_task races
  freezer: kill unused set_freezable_with_signal()
  dmatest: don't use set_freezable_with_signal()
  usb_storage: don't use set_freezable_with_signal()
  freezer: remove unused @sig_only from freeze_task()
  freezer: use lock_task_sighand() in fake_signal_wake_up()
  freezer: restructure __refrigerator()
  freezer: fix set_freezable[_with_signal]() race
  freezer: remove should_send_signal() and update frozen()
  freezer: remove now unused TIF_FREEZE
  freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
  cgroup_freezer: prepare for removal of TIF_FREEZE
  freezer: clean up freeze_processes() failure path
  freezer: kill PF_FREEZING
  freezer: test freezable conditions while holding freezer_lock
  freezer: make freezing indicate freeze condition in effect
  freezer: use dedicated lock instead of task_lock() + memory barrier
  freezer: don't distinguish nosig tasks on thaw
  freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
  freezer: rename thaw_process() to __thaw_task() and simplify the implementation
  ...
2011-11-23 21:09:02 +01:00
Tejun Heo
a0acae0e88 freezer: unexport refrigerator() and update try_to_freeze() slightly
There is no reason to export two functions for entering the
refrigerator.  Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.

* Rename refrigerator() to __refrigerator() and make it return bool
  indicating whether it scheduled out for freezing.

* Update try_to_freeze() to return bool and relay the return value of
  __refrigerator() if freezing().

* Convert all refrigerator() users to try_to_freeze().

* Update documentation accordingly.

* While at it, add might_sleep() to try_to_freeze().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
2011-11-21 12:32:22 -08:00
Richard Weinberger
2397256d62 ext4: Remove kernel_lock annotations
The BKL is gone, these annotations are useless.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:50:09 -05:00
Theodore Ts'o
eb513689c9 ext4: ignore journalled data options on remount if fs has no journal
This avoids a confusing failure in the init scripts when the
/etc/fstab has data=writeback or data=journal but the file system does
not have a journal.  So check for this case explicitly, and warn the
user that we are ignoring the (pointless, since they have no journal)
data=* mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-11-07 10:47:42 -05:00
Eric Gouriou
6f91bc5fda ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.

In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.

Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.

Signed-off-by: Eric Gouriou <egouriou@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-27 11:43:23 -04:00
Fabrice Jouhaud
d44651d0f9 ext4: fix ext4 so it works without CONFIG_PROC_FS
This fixes a bug which was introduced in dd68314ccf.  The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.

Signed-off-by: Fabrice Jouhaud <yargil@free.fr>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 16:26:03 -04:00
Lukas Czerner
4113c4caa4 ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.

This is a part of the effort to reduce number of ext4 options hence the
test matrix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:34:47 -04:00
Tao Ma
dcf2d804ed ext4: Free resources in some error path in ext4_fill_super
Some of the error path in ext4_fill_super don't release the
resouces properly. So this patch just try to release them
in the right way.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 12:10:11 -04:00
Theodore Ts'o
5dee54372c ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:10:51 -04:00
Theodore Ts'o
021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Aditya Kali
7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o
f975d6bcc7 ext4: teach ext4_statfs() to deal with clusters if bigalloc is enabled
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:00:51 -04:00
Theodore Ts'o
24aaa8ef4e ext4: convert the free_blocks field in s_flex_groups to be free_clusters
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:58:51 -04:00
Theodore Ts'o
5704265188 ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:56:51 -04:00
Theodore Ts'o
bab08ab964 ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl.  This simplifies the initial implementation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:36:51 -04:00
Theodore Ts'o
281b599597 ext4: read-only support for bigalloc file systems
This adds supports for bigalloc file systems.  It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:34:51 -04:00
Theodore Ts'o
7c2e70879f ext4: add ext4-specific kludge to avoid an oops after the disk disappears
The del_gendisk() function uninitializes the disk-specific data
structures, including the bdi structure, without telling anyone
else.  Once this happens, any attempt to call mark_buffer_dirty()
(for example, by ext4_commit_super), will cause a kernel OOPS.

Fix this for now until we can fix things in an architecturally correct
way.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:28:51 -04:00
Theodore Ts'o
56889787cf ext4: improve handling of conflicting mount options
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).

Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported.  Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.

Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03 18:22:38 -04:00
Jiaying Zhang
2581fdc810 ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Also moves ext4_flush_completed_IO and ext4_ioend_wait from
ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
ext4_evict_inode() is called before ext4_destroy_inode() and in
ext4_evict_inode(), we may call ext4_truncate() without holding
i_mutex lock. As a result, there is a race between flush_completed_IO
that is called from ext4_ext_truncate() and ext4_end_io_work, which
may cause corruption on an io_end structure. This change moves
ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
to ext4_evict_inode() to resolve the race between ext4_truncate() and
ext4_end_io_work during inode deletion.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-08-13 12:17:13 -04:00
Mathias Krause
db9481c047 ext4: use kzalloc in ext4_kzalloc()
Commit 9933fc0i (ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and
ext4_kvfree()) intruduced wrappers around k*alloc/vmalloc but introduced
a typo for ext4_kzalloc() by not using kzalloc() but kmalloc().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-03 14:57:11 -04:00
Theodore Ts'o
f18a5f21c2 ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:38 -04:00
Theodore Ts'o
9933fc0ac1 ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:02 -04:00
Yongqiang Yang
8f82f840ec ext4: prevent parallel resizers by atomic bit ops
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.

To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:35:44 -04:00
Dan Ehrenberg
3eb0865843 ext4: ignore a stripe width of 1
If the stripe width was set to 1, then this patch will ignore
that stripe width and ext4 will act as if the stripe width
were 0 with respect to optimizing allocations.

Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-17 21:18:51 -04:00
Theodore Ts'o
12706394bc ext4: add tracepoint for ext4_journal_start
This will help debug who is responsible for starting a jbd2 transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-10 22:37:50 -04:00
Lukas Czerner
f17722f917 ext4: Fix max file size and logical block counting of extent format file
Kazuya Mio reported that he was able to hit BUG_ON(next == lblock)
in ext4_ext_put_gap_in_cache() while creating a sparse file in extent
format and fill the tail of file up to its end. We will hit the BUG_ON
when we write the last block (2^32-1) into the sparse file.

The root cause of the problem lies in the fact that we specifically set
s_maxbytes so that block at s_maxbytes fit into on-disk extent format,
which is 32 bit long. However, we are not storing start and end block
number, but rather start block number and length in blocks. It means
that in order to cover extent from 0 to EXT_MAX_BLOCK we need
EXT_MAX_BLOCK+1 to fit into len (because we counting block 0 as well) -
and it does not.

The only way to fix it without changing the meaning of the struct
ext4_extent members is, as Kazuya Mio suggested, to lower s_maxbytes
by one fs block so we can cover the whole extent we can get by the
on-disk extent format.

Also in many places EXT_MAX_BLOCK is used as length instead of maximum
logical block number as the name suggests, it is all a bit messy. So
this commit renames it to EXT_MAX_BLOCKS and change its usage in some
places to actually be maximum number of blocks in the extent.

The bug which this commit fixes can be reproduced as follows:

 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-2))
 sync
 dd if=/dev/zero of=/mnt/mp1/file bs=<blocksize> count=1 seek=$((2**32-1))

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-06 00:05:17 -04:00
Linus Torvalds
f8d613e2a6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
  xen: cleancache shim to Xen Transcendent Memory
  ocfs2: add cleancache support
  ext4: add cleancache support
  btrfs: add cleancache support
  ext3: add cleancache support
  mm/fs: add hooks to support cleancache
  mm: cleancache core ops functions and config
  fs: add field to superblock to support cleancache
  mm/fs: cleancache documentation

Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
2011-05-26 10:50:56 -07:00
Dan Magenheimer
7abc52c2ed ext4: add cleancache support
This seventh patch of eight in this cleancache series "opts-in"
cleancache for ext4.  Filesystems must explicitly enable cleancache
by calling cleancache_init_fs anytime an instance of the filesystem
is mounted. For ext4, all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs
hook which must be called on unmount.

Details and a FAQ can be found in Documentation/vm/cleancache.txt

[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
2011-05-26 10:02:03 -06:00
Johann Lombardi
c5e06d101a ext4: add support for multiple mount protection
Prevent an ext4 filesystem from being mounted multiple times.
A sequence number is stored on disk and is periodically updated (every 5
seconds by default) by a mounted filesystem.
At mount time, we now wait for s_mmp_update_interval seconds to make sure
that the MMP sequence does not change.
In case of failure, the nodename, bdevname and the time at which the MMP
block was last updated is displayed.

Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:31:25 -04:00
Kazuya Mio
d02a9391f7 ext4: ensure f_bfree returned by ext4_statfs() is non-negative
I found the issue that the number of free blocks went negative.
# stat -f /mnt/mp1/
  File: "/mnt/mp1/"
    ID: e175ccb83a872efe Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 258022     Free: -15        Available: -13122
Inodes: Total: 65536      Free: 63029

f_bfree in struct statfs will go negative when the filesystem has
few free blocks. Because the number of dirty blocks is bigger than
the number of free blocks in the following two cases.

CASE 1:
ext4_da_writepages
  mpage_da_map_and_submit
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
        <--- interrupt statfs systemcall --->
        ext4_da_update_reserve_space
            percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                            used + ei->i_allocated_meta_blocks);

CASE 2:
ext4_write_begin
  __block_write_begin
    ext4_map_blocks
      ext4_ext_map_blocks
        ext4_mb_new_blocks
          ext4_mb_diskspace_used
            percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
            <--- interrupt statfs systemcall --->
            percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);

To avoid the issue, this patch ensures that f_bfree is non-negative.

Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
2011-05-24 18:30:07 -04:00
Vivek Haldar
77f4135f2a ext4: count hits/misses of extent cache and expose in sysfs
The number of hits and misses for each filesystem is exposed in
/sys/fs/ext4/<dev>/extent_cache_{hits, misses}.

Tested: fsstress, manual checks.
Signed-off-by: Vivek Haldar <haldar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 21:24:16 -04:00
Theodore Ts'o
373cd5c53d ext4: don't show mount options in /proc/mounts if there is no journal
After creating an ext4 file system without a journal:

  # mke2fs -t ext4 -O ^has_journal /dev/sda
  # mount -t ext4 /dev/sda /test

the /proc/mounts will show:
"/dev/sda /test ext4 rw,relatime,user_xattr,acl,barrier=1,data=writeback 0 0"
which can fool users into thinking that the fs is using writeback mode.

So don't set the writeback option when the journal has not been
enabled; we don't depend on the writeback option being set, since
ext4_should_writeback_data() in ext4_jbd2.h tests to see if the
journal is not present before returning true.

Reported-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-22 16:12:35 -04:00
Lukas Czerner
1bb933fb1f ext4: fix possible use-after-free in ext4_remove_li_request()
We need to take reference to the s_li_request after we take a mutex,
because it might be freed since then, hence result in accessing old
already freed memory. Also we should protect the whole
ext4_remove_li_request() because ext4_li_info might be in the process of
being freed in ext4_lazyinit_thread().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:29 -04:00
Lukas Czerner
51ce651156 ext4: fix the mount option "init_itable=n" to work as expected for n=0
For some reason, when we set the mount option "init_itable=0" it
behaves as we would set init_itable=20 which is not right at all.
Basically when we set it to zero we are saying to lazyinit thread not
to wait between zeroing the inode table (except of cond_resched()) so
this commit fixes that and removes the unnecessary condition.  The 'n'
should be also properly used on remount.

When the n is not set at all, it means that the default miltiplier
EXT4_DEF_LI_WAIT_MULT is set instead.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:55:16 -04:00
Lukas Czerner
e1290b3e62 ext4: Remove unnecessary wait_event ext4_run_lazyinit_thread()
For some reason we have been waiting for lazyinit thread to start in the
ext4_run_lazyinit_thread() but it is not needed since it was jus
unnecessary complexity, so get rid of it. We can also remove li_task and
li_wait_task since it is not used anymore.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:51 -04:00
Lukas Czerner
4ed5c033c1 ext4: Use schedule_timeout_interruptible() for waiting in lazyinit thread
In order to make lazyinit eat approx. 10% of io bandwidth at max, we
are sleeping between zeroing each single inode table. For that purpose
we are using timer which wakes up thread when it expires. It is set
via add_timer() and this may cause troubles in the case that thread
has been woken up earlier and in next iteration we call add_timer() on
still running timer hence hitting BUG_ON in add_timer(). We could fix
that by using mod_timer() instead however we can use
schedule_timeout_interruptible() for waiting and hence simplifying
things a lot.

This commit exchange the old "waiting mechanism" with simple
schedule_timeout_interruptible(), setting the time to sleep. Hence we
do not longer need li_wait_daemon waiting queue and others, so get rid
of it.

Addresses-Red-Hat-Bugzilla: #699708

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2011-05-20 13:49:04 -04:00
Tao Ma
ed3ce80a52 ext4: don't warn about mnt_count if it has been disabled
Currently, if we mkfs a new ext4 volume with s_max_mnt_count set to
zero, and mount it for the first time, we will get the warning:

	maximal mount count reached, running e2fsck is recommended

It is really misleading. So change the check so that it won't warn in
that case.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-18 13:29:57 -04:00
Amir Goldstein
0b26859027 ext4: fix oops in ext4_quota_off()
If quota is not enabled when ext4_quota_off() is called, we must not
dereference quota file inode since it is NULL.  Check properly for
this.

This fixes a bug in commit 21f976975c (ext4: remove unnecessary
[cm]time update of quota file), which was merged for 2.6.39-rc3.

Reported-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-16 09:59:13 -04:00
Amerigo Wang
66bb82798d ext4: remove redundant #ifdef in super.c
There is already an #ifdef CONFIG_QUOTA some lines above,
so this one is totally useless.

Signed-off-by: WANG Cong <amwang@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:30:41 -04:00
Tao Ma
55ff3840a2 ext4: remove redundant check for first_not_zeroed in ext4_register_li_request
We have checked first_not_zeroed == ngroups already above, so remove
this redundant check.

sbi->s_li_request = NULL above is also removed since it is NULL
already.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:28:41 -04:00
Theodore Ts'o
2035e77605 ext4: check for ext[23] file system features when mounting as ext[23]
Provide better emulation for ext[23] mode by enforcing that the file
system does not have any unsupported file system features as defined
by ext[23] when emulating the ext[23] file system driver when
CONFIG_EXT4_USE_FOR_EXT23 is defined.

This causes the file system type information in /proc/mounts to be
correct for the automatically mounted root file system.  This also
means that "mount -t ext2 /dev/sda /mnt" will fail if /dev/sda
contains an ext3 or ext4 file system, just as one would expect if the
original ext2 file system driver were in use.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-18 17:29:14 -04:00
Linus Torvalds
a97b52022a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix data corruption regression by reverting commit 6de9843dab
  ext4: Allow indirect-block file to grow the file size to max file size
  ext4: allow an active handle to be started when freezing
  ext4: sync the directory inode in ext4_sync_parent()
  ext4: init timer earlier to avoid a kernel panic in __save_error_info
  jbd2: fix potential memory leak on transaction commit
  ext4: fix a double free in ext4_register_li_request
  ext4: fix credits computing for indirect mapped files
  ext4: remove unnecessary [cm]time update of quota file
  jbd2: move bdget out of critical section
2011-04-11 15:45:47 -07:00
Yongqiang Yang
be4f27d324 ext4: allow an active handle to be started when freezing
ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen.  Otherwise, deadlock is easy to happen, below
is a situation.

================================================
     freeze         |       truncate
================================================
                    |  ext4_ext_truncate()
    freeze_super()  |   starts a handle
    sets s_frozen   |
                    |  ext4_ext_truncate()
                    |  holds i_data_sem
  ext4_freeze()     |
  waits for updates |
                    |  ext4_free_blocks()
                    |  calls dquot_free_block()
                    |
                    |  dquot_free_blocks()
                    |  calls ext4_dirty_inode()
                    |
                    |  ext4_dirty_inode()
                    |  trys to start an active
                    |  handle
                    |
                    |  block due to s_frozen
================================================

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Amir Goldstein <amir73il@users.sf.net>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2011-04-10 22:06:07 -04:00
Tao Ma
0449641130 ext4: init timer earlier to avoid a kernel panic in __save_error_info
During mount, when we fail to open journal inode or root inode, the
__save_error_info will mod_timer. But actually s_err_report isn't
initialized yet and the kernel oops. The detailed information can
be found https://bugzilla.kernel.org/show_bug.cgi?id=32082.

The best way is to check whether the timer s_err_report is initialized
or not. But it seems that in include/linux/timer.h, we can't find a
good function to check the status of this timer, so this patch just
move the initializtion of s_err_report earlier so that we can avoid
the kernel panic. The corresponding del_timer is also added in the
error path.

Reported-by: Sami Liedes <sliedes@cc.hut.fi>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-05 19:55:28 -04:00
Tao Ma
46e4690bbd ext4: fix a double free in ext4_register_li_request
In ext4_register_li_request, we malloc a ext4_li_request and
inserts it into ext4_li_info->li_request_list. In case of any
error later, we free it in the end.  But if we have some error
in ext4_run_lazyinit_thread, the whole li_request_list will be
dropped and freed in it. So we will double free this ext4_li_request.

This patch just sets elr to NULL after it is inserted to the list
so that the latter kfree won't double free it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-04-04 16:00:49 -04:00
Jan Kara
21f976975c ext4: remove unnecessary [cm]time update of quota file
It is not necessary to update [cm]time of quota file on each quota
file write and it wastes journal space and IO throughput with inode
writes. So just remove the updating from ext4_quota_write() and only
update times when quotas are being turned off. Userspace cannot get
anything reliable from quota files while they are used by the kernel
anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-04 15:33:39 -04:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds
ae005cbed1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (43 commits)
  ext4: fix a BUG in mb_mark_used during trim.
  ext4: unused variables cleanup in fs/ext4/extents.c
  ext4: remove redundant set_buffer_mapped() in ext4_da_get_block_prep()
  ext4: add more tracepoints and use dev_t in the trace buffer
  ext4: don't kfree uninitialized s_group_info members
  ext4: add missing space in printk's in __ext4_grp_locked_error()
  ext4: add FITRIM to compat_ioctl.
  ext4: handle errors in ext4_clear_blocks()
  ext4: unify the ext4_handle_release_buffer() api
  ext4: handle errors in ext4_rename
  jbd2: add COW fields to struct jbd2_journal_handle
  jbd2: add the b_cow_tid field to journal_head struct
  ext4: Initialize fsync transaction ids in ext4_new_inode()
  ext4: Use single thread to perform DIO unwritten convertion
  ext4: optimize ext4_bio_write_page() when no extent conversion is needed
  ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
  ext4: use the nblocks arg to ext4_truncate_restart_trans()
  ext4: fix missing iput of root inode for some mount error paths
  ext4: make FIEMAP and delayed allocation play well together
  ext4: suppress verbose debugging information if malloc-debug is off
  ...

Fi up conflicts in fs/ext4/super.c due to workqueue changes
2011-03-25 09:57:41 -07:00
Robin Dong
21149d611e ext4: add missing space in printk's in __ext4_grp_locked_error()
When we do performence-testing on ext4 filesystem, we observed a
warning like this:

EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd

instead, it should be

"group 2598, 25901 blocks in bitmap, 26057 in gd"

Reviewed-by: Coly Li <bosong.ly@taobao.com>
Cc: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 20:39:22 -04:00
Linus Torvalds
bd2895eead Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix build failure introduced by s/freezeable/freezable/
  workqueue: add system_freezeable_wq
  rds/ib: use system_wq instead of rds_ib_fmr_wq
  net/9p: replace p9_poll_task with a work
  net/9p: use system_wq instead of p9_mux_wq
  xfs: convert to alloc_workqueue()
  reiserfs: make commit_wq use the default concurrency level
  ocfs2: use system_wq instead of ocfs2_quota_wq
  ext4: convert to alloc_workqueue()
  scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
  scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
  misc/iwmc3200top: use system_wq instead of dedicated workqueues
  i2o: use alloc_workqueue() instead of create_workqueue()
  acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
  fs/aio: aio_wq isn't used in memory reclaim path
  input/tps6507x-ts: use system_wq instead of dedicated workqueue
  cpufreq: use system_wq instead of dedicated workqueues
  wireless/ipw2x00: use system_wq instead of dedicated workqueues
  arm/omap: use system_wq in mailbox
  workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
2011-03-16 08:20:19 -07:00
Aneesh Kumar K.V
f2fa2ffc20 ext4: Copy fs UUID to superblock
File system UUID is made available to application
via  /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-15 02:21:45 -04:00
Mingming Cao
198868f35d ext4: Use single thread to perform DIO unwritten convertion
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch.  Per
filesystem is enough, we don't need per cpu threads to work on
conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2011-03-05 11:52:45 -05:00
Amir Goldstein
d39195c33b ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.

This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-28 00:53:45 -05:00
Manish Katiyar
32a9bb57d7 ext4: fix missing iput of root inode for some mount error paths
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.

https://bugzilla.kernel.org/show_bug.cgi?id=26752

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-27 20:42:06 -05:00
Theodore Ts'o
6fd7a46781 ext4: enable mblk_io_submit by default
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 13:53:09 -05:00
Eric Sandeen
ea66333694 ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.

Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel.  At some point
we can start to deprecate the options, perhaps.

For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:51:51 -05:00
Lukas Czerner
0b75a84012 ext4: mark file-local functions and variables as static
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov
5dbd571d87 ext4: allow inode_readahead_blks=0 (linux-2.6.37)
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):

# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks 
bash: echo: write error: Invalid argument

On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.

This patch fixes the problem by checking for zero explicitly.

Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:33:21 -05:00
Peter Huewe
7dc576158d ext4: Fix sparse warning: Using plain integer as NULL pointer
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:01:42 -05:00
Tejun Heo
43d133c18b Merge branch 'master' into for-2.6.39 2011-02-21 09:43:56 +01:00
Eric Sandeen
e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
Theodore Ts'o
dd68314ccf ext4: fix up ext4 error handling
Make sure we the correct cleanup happens if we die while trying to
load the ext4 file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:49 -05:00
Lukas Czerner
8f021222c1 ext4: unregister features interface on module unload
Ext4 features interface was not properly unregistered which led to
problems while unloading/reloading ext4 module. This commit fixes that by
adding proper kobject unregistration code into ext4_exit_fs() as well as
fail-path of ext4_init_fs()

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-02-03 14:33:33 -05:00
Eric Sandeen
8f1f745331 ext4: fix panic on module unload when stopping lazyinit thread
https://bugzilla.kernel.org/show_bug.cgi?id=27652

If the lazyinit thread is running, the teardown function
ext4_destroy_lazyinit_thread() has problems:

        ext4_clear_request_list();
        while (ext4_li_info->li_task) {
                wake_up(&ext4_li_info->li_wait_daemon);
                wait_event(ext4_li_info->li_wait_task,
                           ext4_li_info->li_task == NULL);
        }

Clearing the request list will cause the thread to exit and free
ext4_li_info, so then we're waiting on something which is getting
freed.

Fix this up by making the thread respond to kthread_stop, and exit,
without the need to wait for that exit in some other homegrown way.

Cc: stable@kernel.org
Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-03 14:33:15 -05:00
Tejun Heo
fd89d5f203 ext4: convert to alloc_workqueue()
Convert create_workqueue() to alloc_workqueue().  This is an identity
conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
2011-02-01 11:42:42 +01:00
Linus Torvalds
4843456c5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Fix deadlock during path resolution
2011-01-21 07:33:37 -08:00
Linus Torvalds
275220f0fc Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
  block: ensure that completion error gets properly traced
  blktrace: add missing probe argument to block_bio_complete
  block cfq: don't use atomic_t for cfq_group
  block cfq: don't use atomic_t for cfq_queue
  block: trace event block fix unassigned field
  block: add internal hd part table references
  block: fix accounting bug on cross partition merges
  kref: add kref_test_and_get
  bio-integrity: mark kintegrityd_wq highpri and CPU intensive
  block: make kblockd_workqueue smarter
  Revert "sd: implement sd_check_events()"
  block: Clean up exit_io_context() source code.
  Fix compile warnings due to missing removal of a 'ret' variable
  fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
  block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
  cfq-iosched: don't check cfqg in choose_service_tree()
  fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
  cdrom: export cdrom_check_events()
  sd: implement sd_check_events()
  sr: implement sr_check_events()
  ...
2011-01-13 10:45:01 -08:00
Jan Kara
f00c9e44ad quota: Fix deadlock during path resolution
As Al Viro pointed out path resolution during Q_QUOTAON calls to quotactl
is prone to deadlocks. We hold s_umount semaphore for reading during the
path resolution and resolution itself may need to acquire the semaphore
for writing when e. g. autofs mountpoint is passed.

Solve the problem by performing the resolution before we get hold of the
superblock (and thus s_umount semaphore). The whole thing is complicated
by the fact that some filesystems (OCFS2) ignore the path argument. So to
distinguish between filesystem which want the path and which do not we
introduce new .quota_on_meta callback which does not get the path. OCFS2
then uses this callback instead of old .quota_on.

CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Christoph Hellwig <hch@lst.de>
CC: Ted Ts'o <tytso@mit.edu>
CC: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2011-01-12 19:14:55 +01:00
Linus Torvalds
e9688f6aca Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (44 commits)
  ext4: fix trimming starting with block 0 with small blocksize
  ext4: revert buggy trim overflow patch
  ext4: don't pass entire map to check_eofblocks_fl
  ext4: fix memory leak in ext4_free_branches
  ext4: remove ext4_mb_return_to_preallocation()
  ext4: flush the i_completed_io_list during ext4_truncate
  ext4: add error checking to calls to ext4_handle_dirty_metadata()
  ext4: fix trimming of a single group
  ext4: fix uninitialized variable in ext4_register_li_request
  ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
  ext4: drop i_state_flags on architectures with 64-bit longs
  ext4: reorder ext4_inode_info structure elements to remove unneeded padding
  ext4: drop ec_type from the ext4_ext_cache structure
  ext4: use ext4_lblk_t instead of sector_t for logical blocks
  ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
  ext4: fix 32bit overflow in ext4_ext_find_goal()
  ext4: add more error checks to ext4_mkdir()
  ext4: ext4_ext_migrate should use NULL not 0
  ext4: Use ext4_error_file() to print the pathname to the corrupted inode
  ext4: use IS_ERR() to check for errors in ext4_error_file
  ...
2011-01-11 14:37:31 -08:00
Andrew Morton
6c5a6cb998 ext4: fix uninitialized variable in ext4_register_li_request
fs/ext4/super.c: In function 'ext4_register_li_request':
fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function

It looks buggy to me, too.

Cc: Lukas Czerner <lczerner@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:30:17 -05:00
Theodore Ts'o
8aefcd557d ext4: dynamically allocate the jbd2_inode in ext4_inode_info as necessary
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing.  This allows us to further slim down the ext4_inode_info
structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:29:43 -05:00
Theodore Ts'o
f232109773 ext4: replace i_delalloc_reserved_flag with EXT4_STATE_DELALLOC_RESERVED
Remove the short element i_delalloc_reserved_flag from the
ext4_inode_info structure and replace it a new bit in i_state_flags.
Since we have an ext4_inode_info for every ext4 inode cached in the
inode cache, any savings we can produce here is a very good thing from
a memory utilization perspective.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:12:36 -05:00
Theodore Ts'o
f7c21177af ext4: Use ext4_error_file() to print the pathname to the corrupted inode
Where the file pointer is available, use ext4_error_file() instead of
ext4_error_inode().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-01-10 12:10:55 -05:00
Dan Carpenter
f9a62d090c ext4: use IS_ERR() to check for errors in ext4_error_file
d_path() returns an ERR_PTR and it doesn't return NULL.  This is in
ext4_error_file() and no one actually calls ext4_error_file().

Signed-off-by: Dan Carpenter <error27@gmail.com>
2011-01-10 12:10:50 -05:00
Nick Piggin
fa0d7e3de6 fs: icache RCU free inodes
RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
  permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
  to take i_lock no longer need to take sb_inode_list_lock to walk the list in
  the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
  page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:26 +11:00
Joe Perches
0ff2ea7d84 ext4: Use printf extension %pV
Using %pV reduces the number of printk calls and eliminates any
possible message interleaving from other printk calls.

In function __ext4_grp_locked_error also added KERN_CONT to some
printks.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:43:19 -05:00
Joe Perches
94de56ab20 ext4: Use vzalloc in ext4_fill_flex_info()
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-19 22:21:02 -05:00
Theodore Ts'o
a2595b8aa6 ext4: Add second mount options field since the s_mount_opt is full up
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:30:48 -05:00
Theodore Ts'o
673c610033 ext4: Move struct ext4_mount_options from ext4.h to super.c
Move the ext4_mount_options structure definition from ext4.h, since it
is only used in super.c.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:28:48 -05:00
Theodore Ts'o
fd8c37eccd ext4: Simplify the usage of clear_opt() and set_opt() macros
Change clear_opt() and set_opt() to take a superblock pointer instead
of a pointer to EXT4_SB(sb)->s_mount_opt.  This makes it easier for us
to support a second mount option field.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-15 20:26:48 -05:00
Theodore Ts'o
1449032be1 ext4: Turn off multiple page-io submission by default
Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 10000000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson <jnelson@jamponi.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-12-14 15:27:50 -05:00
Lukas Czerner
93bb41f4f8 fs: Do not dispatch FITRIM through separate super_operation
There was concern that FITRIM ioctl is not common enough to be included
in core vfs ioctl, as Christoph Hellwig pointed out there's no real point
in dispatching this out to a separate vector instead of just through
->ioctl.

So this commit removes ioctl_fstrim() from vfs ioctl and trim_fs
from super_operation structure.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 21:18:35 -05:00
Darrick J. Wong
5a9ae68a34 ext4: ext4_fill_super shouldn't return 0 on corruption
At the start of ext4_fill_super, ret is set to -EINVAL, and any failure path
out of that function returns ret.  However, the generic_check_addressable
clause sets ret = 0 (if it passes), which means that a subsequent failure (e.g.
a group checksum error) returns 0 even though the mount should fail.  This
causes vfs_kern_mount in turn to think that the mount succeeded, leading to an
oops.

A simple fix is to avoid using ret for the generic_check_addressable check,
which was last changed in commit 30ca22c70e.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-19 09:56:44 -05:00
Dan Carpenter
f4c8cc652d ext4: missing unlock in ext4_clear_request_list()
If the the li_request_list was empty then it returned with the lock
held.  Instead of adding a "goto unlock" I just removed that special
case and let it go past the empty list_for_each_safe().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-17 21:46:25 -05:00
Tejun Heo
d4d7762995 block: clean up blkdev_get() wrappers and their users
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.

All users are converted.  Most conversions are mechanical and don't
introduce any behavior difference.  There are several exceptions.

* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
  reason to OR it explicitly on blkdev_put().

* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
  sb->s_mode.

* With the above changes, sb->s_mode now always should contain
  FMODE_EXCL.  WARN_ON_ONCE() added to kill_block_super() to detect
  errors.

The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:18 +01:00
Tejun Heo
e525fd89d3 block: make blkdev_get/put() handle exclusive access
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.

* blkdev_get/put() are the primary open and close functions.

* bd_claim/release() deal with exclusive open.

* open/close_bdev_exclusive() are combination of open and claim and
  the other way around, respectively.

* bd_link/unlink_disk_holder() to create and remove holder/slave
  symlinks.

* open_by_devnum() wraps bdget() + blkdev_get().

The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access.  Reorganize the interface such that,

* blkdev_get() is extended to include exclusive access management.
  @holder argument is added and, if is @FMODE_EXCL specified, it will
  gain exclusive access atomically w.r.t. other exclusive accesses.

* blkdev_put() is similarly extended.  It now takes @mode argument and
  if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
  the last exclusive claim is released, the holder/slave symlinks are
  removed automatically.

* bd_claim/release() and close_bdev_exclusive() are no longer
  necessary and either made static or removed.

* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
  is no longer necessary and removed.

* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
  and blkdev_get().  It also has an unexpected extra bdev_read_only()
  test which probably should be moved into blkdev_get().

* open_by_devnum() is modified to take @holder argument and pass it to
  blkdev_get().

Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should).  This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.

open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features.  Well, let's leave them for another day.

Most conversions are straight-forward.  drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-11-13 11:55:17 +01:00
Theodore Ts'o
7ff9c073dd ext4: Add new ext4 inode tracepoints
Add ext4_evict_inode, ext4_drop_inode, ext4_mark_inode_dirty, and
ext4_begin_ordered_truncate()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-08 13:51:33 -05:00
Dmitry Monakhov
87009d86dc ext4: do not try to grab the s_umount semaphore in ext4_quota_off
It's not needed to sync the filesystem, and it fixes a lock_dep complaint.

Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2010-11-08 13:47:33 -05:00
Theodore Ts'o
f7ad6d2e92 ext4: handle writeback of inodes which are being freed
The following BUG can occur when an inode which is getting freed when
it still has dirty pages outstanding, and it gets deleted (in this
because it was the target of a rename).  In ordered mode, we need to
make sure the data pages are written just in case we crash before the
rename (or unlink) is committed.  If the inode is being freed then
when we try to igrab the inode, we end up tripping the BUG_ON at
fs/ext4/page-io.c:146.

To solve this problem, we need to keep track of the number of io
callbacks which are pending, and avoid destroying the inode until they
have all been completed.  That way we don't have to bump the inode
count to keep the inode from being destroyed; an approach which
doesn't work because the count could have already been dropped down to
zero before the inode writeback has started (at which point we're not
allowed to bump the count back up to 1, since it's already started
getting freed).

Thanks to Dave Chinner for suggesting this approach, which is also
used by XFS.

  kernel BUG at /scratch_space/linux-2.6/fs/ext4/page-io.c:146!
  Call Trace:
   [<ffffffff811075b1>] ext4_bio_write_page+0x172/0x307
   [<ffffffff811033a7>] mpage_da_submit_io+0x2f9/0x37b
   [<ffffffff811068d7>] mpage_da_map_and_submit+0x2cc/0x2e2
   [<ffffffff811069b3>] mpage_add_bh_to_extent+0xc6/0xd5
   [<ffffffff81106c66>] write_cache_pages_da+0x2a4/0x3ac
   [<ffffffff81107044>] ext4_da_writepages+0x2d6/0x44d
   [<ffffffff81087910>] do_writepages+0x1c/0x25
   [<ffffffff810810a4>] __filemap_fdatawrite_range+0x4b/0x4d
   [<ffffffff810815f5>] filemap_fdatawrite_range+0xe/0x10
   [<ffffffff81122a2e>] jbd2_journal_begin_ordered_truncate+0x7b/0xa2
   [<ffffffff8110615d>] ext4_evict_inode+0x57/0x24c
   [<ffffffff810c14a3>] evict+0x22/0x92
   [<ffffffff810c1a3d>] iput+0x212/0x249
   [<ffffffff810bdf16>] dentry_iput+0xa1/0xb9
   [<ffffffff810bdf6b>] d_kill+0x3d/0x5d
   [<ffffffff810be613>] dput+0x13a/0x147
   [<ffffffff810b990d>] sys_renameat+0x1b5/0x258
   [<ffffffff81145f71>] ? _atomic_dec_and_lock+0x2d/0x4c
   [<ffffffff810b2950>] ? cp_new_stat+0xde/0xea
   [<ffffffff810b29c1>] ? sys_newlstat+0x2d/0x38
   [<ffffffff810b99c6>] sys_rename+0x16/0x18
   [<ffffffff81002a2b>] system_call_fastpath+0x16/0x1b

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
2010-11-08 13:43:33 -05:00
Theodore Ts'o
ce7e010aef ext4: initialize the percpu counters before replaying the journal
We now initialize the percpu counters before replaying the journal,
but after the journal, we recalculate the global counters, to deal
with the possibility of the per-blockgroup counts getting updated by
the journal replay.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-03 12:03:21 -04:00
Theodore Ts'o
b2c78cd09b ext4: "ret" may be used uninitialized in ext4_lazyinit_thread()
Newer GCC's reported the following build warning:

   fs/ext4/super.c: In function 'ext4_lazyinit_thread':
   fs/ext4/super.c:2702: warning: 'ret' may be used uninitialized in this function

Fix it by removing the need for the ret variable in the first place.

Signed-off-by: "Lukas Czerner" <lczerner@redhat.com>
Reported-by: "Stefan Richter" <stefanr@s5r6.in-berlin.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:19:30 -04:00
Lukas Czerner
f4245bd4eb ext4: fix lazyinit hang after removing request
When the request has been removed from the list and no other request
has been issued, we will end up with next wakeup scheduled to
MAX_JIFFY_OFFSET which is bad. So check for that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-11-02 14:07:17 -04:00
Al Viro
152a083666 new helper: mount_bdev()
... and switch of the obvious get_sb_bdev() users to ->mount()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:16:13 -04:00
Theodore Ts'o
a107e5a3a4 Merge branch 'next' into upstream-merge
Conflicts:
	fs/ext4/inode.c
	fs/ext4/mballoc.c
	include/trace/events/ext4.h
2010-10-27 23:44:47 -04:00
Nicolas Kaiser
beed5ecbaa ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 22:08:42 -04:00
Theodore Ts'o
1f109d5a17 ext4: make various ext4 functions be static
These functions have no need to be exported beyond file context.

No functions needed to be moved for this commit; just some function
declarations changed to be static and removed from header files.

(A similar patch was submitted by Eric Sandeen, but I wanted to handle
code movement in separate patches to make sure code changes didn't
accidentally get dropped.)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
5dabfc78dc ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()
This is a cleanup to avoid namespace leaks out of fs/ext4

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:14 -04:00
Theodore Ts'o
7f93cff90f ext4: fix kernel oops if the journal superblock has a non-zero j_errno
Commit 84061e0 fixed an accounting bug only to introduce the
possibility of a kernel OOPS if the journal has a non-zero j_errno
field indicating that the file system had detected a fs inconsistency.
After the journal replay, if the journal superblock indicates that the
file system has an error, this indication is transfered to the file
system and then ext4_commit_super() is called to write this to the
disk.

But since the percpu counters are now initialized after the journal
replay, the call to ext4_commit_super() will cause a kernel oops since
it needs to use the percpu counters the ext4 superblock structure.

The fix is to skip setting the ext4 free block and free inode fields
if the percpu counter has not been set.

Thanks to Ken Sumrall for reporting and analyzing the root causes of
this bug.

Addresses-Google-Bug: #3054080

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:13 -04:00
Lukas Czerner
27ee40df2b ext4: add batched_discard into ext4 feature list
Should be applied on the top of "lazy inode table initialization"
and "batched discard support" patch-sets.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Lukas Czerner
7360d1731e ext4: Add batched discard support for ext4
Walk through allocation groups and trim all free extents. It can be
invoked through FITRIM ioctl on the file system. The main idea is to
provide a way to trim the whole file system if needed, since some SSD's
may suffer from performance loss after the whole device was filled (it
does not mean that fs is full!).

It search for free extents in allocation groups specified by Byte range
start -> start+len. When the free extent is within this range, blocks
are marked as used and then trimmed. Afterwards these blocks are marked
as free in per-group bitmap.

Since fstrim is a long operation it is good to have an ability to
interrupt it by a signal. This was added by Dmitry Monakhov.
Thanks Dimitry.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:12 -04:00
Theodore Ts'o
bd2d0210cf ext4: use bio layer instead of buffer layer in mpage_da_submit_io
Call the block I/O layer directly instad of going through the buffer
layer.  This should give us much better performance and scalability,
as well as lowering our CPU utilization when doing buffered writeback.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:10 -04:00
Maciej Żenczykowski
c41303ced6 ext4: don't update sb journal_devnum when RO dev
An ext4 filesystem on a read-only device, with an external journal
which is at a different device number then recorded in the superblock
will fail to honor the read-only setting of the device and trigger
a superblock update (write).

For example:
  - ext4 on a software raid which is in read-only mode
  - external journal on a read-write device which has changed device num
  - attempt to mount with -o journal_dev=<new_number>
  - hits BUG_ON(mddev->ro = 1) in md.c

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:06 -04:00
Lukas Czerner
857ac889cc ext4: add interface to advertise ext4 features in sysfs
User-space should have the opportunity to check what features doest ext4
support in each particular copy. This adds easy interface by creating new
"features" directory in sys/fs/ext4/. In that directory files
advertising feature names can be created.

Add lazy_itable_init to the feature list.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Lukas Czerner
bfff68738f ext4: add support for lazy inode table initialization
When the lazy_itable_init extended option is passed to mke2fs, it
considerably speeds up filesystem creation because inode tables are
not zeroed out.  The fact that parts of the inode table are
uninitialized is not a problem so long as the block group descriptors,
which contain information regarding how much of the inode table has
been initialized, has not been corrupted However, if the block group
checksums are not valid, e2fsck must scan the entire inode table, and
the the old, uninitialized data could potentially cause e2fsck to
report false problems.

Hence, it is important for the inode tables to be initialized as soon
as possble.  This commit adds this feature so that mke2fs can safely
use the lazy inode table initialization feature to speed up formatting
file systems.

This is done via a new new kernel thread called ext4lazyinit, which is
created on demand and destroyed, when it is no longer needed.  There
is only one thread for all ext4 filesystems in the system. When the
first filesystem with inititable mount option is mounted, ext4lazyinit
thread is created, then the filesystem can register its request in the
request list.

This thread then walks through the list of requests picking up
scheduled requests and invoking ext4_init_inode_table(). Next schedule
time for the request is computed by multiplying the time it took to
zero out last inode table with wait multiplier, which can be set with
the (init_itable=n) mount option (default is 10).  We are doing
this so we do not take the whole I/O bandwidth. When the thread is no
longer necessary (request list is empty) it frees the appropriate
structures and exits (and can be created later later by another
filesystem).

We do not disturb regular inode allocations in any way, it just do not
care whether the inode table is, or is not zeroed. But when zeroing, we
have to skip used inodes, obviously. Also we should prevent new inode
allocations from the group, while zeroing is on the way. For that we
take write alloc_sem lock in ext4_init_inode_table() and read alloc_sem
in the ext4_claim_inode, so when we are unlucky and allocator hits the
group which is currently being zeroed, it just has to wait.

This can be suppresed using the mount option no_init_itable.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:05 -04:00
Sergey Senozhatsky
a1c6c5698d ext4: fix NULL pointer dereference in print_daily_error_info()
Fix NULL pointer dereference in print_daily_error_info, when   
called on unmounted fs (EXT4_SB(sb) returns NULL), by removing error 
reporting timer in ext4_put_super.

Google-Bug-Id: 3017663

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-10-27 21:30:04 -04:00
Linus Torvalds
79f14b7c56 Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits)
  BKL: remove BKL from freevxfs
  BKL: remove BKL from qnx4
  autofs4: Only declare function when CONFIG_COMPAT is defined
  autofs: Only declare function when CONFIG_COMPAT is defined
  ncpfs: Lock socket in ncpfs while setting its callbacks
  fs/locks.c: prepare for BKL removal
  BKL: Remove BKL from ncpfs
  BKL: Remove BKL from OCFS2
  BKL: Remove BKL from squashfs
  BKL: Remove BKL from jffs2
  BKL: Remove BKL from ecryptfs
  BKL: Remove BKL from afs
  BKL: Remove BKL from USB gadgetfs
  BKL: Remove BKL from autofs4
  BKL: Remove BKL from isofs
  BKL: Remove BKL from fat
  BKL: Remove BKL from ext2 filesystem
  BKL: Remove BKL from do_new_mount()
  BKL: Remove BKL from cgroup
  BKL: Remove BKL from NTFS
  ...
2010-10-22 10:52:01 -07:00
Jan Blunck
f2143c4e2e BKL: Remove BKL from ext4 filesystem
The BKL is still used in ext4_put_super(), ext4_fill_super() and
ext4_remount(). All three calles are protected against concurrent calls by
the s_umount rw semaphore of struct super_block.

Therefore the BKL is protecting nothing in this case.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04 21:10:38 +02:00
Jan Blunck
db71922217 BKL: Explicitly add BKL around get_sb/fill_super
This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
       don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-04 21:10:10 +02:00
Patrick J. LoPresti
30ca22c70e ext3/ext4: Factor out disk addressability check
As part of adding support for OCFS2 to mount huge volumes, we need to
check that the sector_t and page cache of the system are capable of
addressing the entire volume.

An identical check already appears in ext3 and ext4.  This patch moves
the addressability check into its own function in fs/libfs.c and
modifies ext3 and ext4 to invoke it.

[Edited to -EINVAL instead of BUG_ON() for bad blocksize_bits -- Joel]

Signed-off-by: Patrick LoPresti <lopresti@gmail.com>
Cc: linux-ext4@vger.kernel.org
Acked-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-09-10 08:41:42 -07:00
Linus Torvalds
5f248c9c25 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
  no need for list_for_each_entry_safe()/resetting with superblock list
  Fix sget() race with failing mount
  vfs: don't hold s_umount over close_bdev_exclusive() call
  sysv: do not mark superblock dirty on remount
  sysv: do not mark superblock dirty on mount
  btrfs: remove junk sb_dirt change
  BFS: clean up the superblock usage
  AFFS: wait for sb synchronization when needed
  AFFS: clean up dirty flag usage
  cifs: truncate fallout
  mbcache: fix shrinker function return value
  mbcache: Remove unused features
  add f_flags to struct statfs(64)
  pass a struct path to vfs_statfs
  update VFS documentation for method changes.
  All filesystems that need invalidate_inode_buffers() are doing that explicitly
  convert remaining ->clear_inode() to ->evict_inode()
  Make ->drop_inode() just return whether inode needs to be dropped
  fs/inode.c:clear_inode() is gone
  fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
  ...

Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-10 11:26:52 -07:00
Al Viro
0930fcc1ee convert ext4 to ->evict_inode()
pretty much brute-force...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:30 -04:00
Linus Torvalds
09dc942c2a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
  ext4: Adding error check after calling ext4_mb_regular_allocator()
  ext4: Fix dirtying of journalled buffers in data=journal mode
  ext4: re-inline ext4_rec_len_(to|from)_disk functions
  jbd2: Remove t_handle_lock from start_this_handle()
  jbd2: Change j_state_lock to be a rwlock_t
  jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
  ext4: Add mount options in superblock
  ext4: force block allocation on quota_off
  ext4: fix freeze deadlock under IO
  ext4: drop inode from orphan list if ext4_delete_inode() fails
  ext4: check to make make sure bd_dev is set before dereferencing it
  jbd2: Make barrier messages less scary
  ext4: don't print scary messages for allocation failures post-abort
  ext4: fix EFBIG edge case when writing to large non-extent file
  ext4: fix ext4_get_blocks references
  ext4: Always journal quota file modifications
  ext4: Fix potential memory leak in ext4_fill_super
  ext4: Don't error out the fs if the user tries to make a file too big
  ext4: allocate stripe-multiple IOs on stripe boundaries
  ext4: move aio completion after unwritten extent conversion
  ...

Fix up conflicts in fs/ext4/inode.c as per Ted.

Fix up xfs conflicts as per earlier xfs merge.
2010-08-07 13:03:53 -07:00
Theodore Ts'o
a931da6ac9 jbd2: Change j_state_lock to be a rwlock_t
Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores.  So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-03 21:35:12 -04:00
Theodore Ts'o
8b67f04ab9 ext4: Add mount options in superblock
Allow mount options to be stored in the superblock.  Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 23:14:20 -04:00
Dmitry Monakhov
ca0e05e4b1 ext4: force block allocation on quota_off
Perform full sync procedure so that any delayed allocation blocks are
allocated so quota will be consistent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:48:36 -04:00
Eric Sandeen
437f88cc03 ext4: fix freeze deadlock under IO
Commit 6b0310fbf0 caused a regression resulting in deadlocks
when freezing a filesystem which had active IO; the vfs_check_frozen
level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
through.  Duh.

Changing the test to FREEZE_TRANS should let the normal freeze
syncing get through the fs, but still block any transactions from
starting once the fs is completely frozen.

I tested this by running fsstress in the background while periodically
snapshotting the fs and running fsck on the result.  I ran into
occasional deadlocks, but different ones.  I think this is a
fine fix for the problem at hand, and the other deadlocky things
will need more investigation.

Reported-by: Phillip Susi <psusi@cfl.rr.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:33:29 -04:00
Theodore Ts'o
f613dfcb33 ext4: check to make make sure bd_dev is set before dereferencing it
There are some drivers which may not set bdev->bd_dev.  So make sure
it is non-NULL before dereferencing it.

Google-Bug-Id: 1773557

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Jan Kara
62d2b5f2dc ext4: Always journal quota file modifications
When journaled quota options are not specified, we do writes
to quota files just in data=ordered mode. This actually causes
warnings from JBD2 about dirty journaled buffer because ext4_getblk
unconditionally treats a block allocated by it as metadata. Since
quota actually is filesystem metadata, the easiest way to get rid
of the warning is to always treat quota writes as metadata...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Cyrill Gorcunov
dcc7dae3cb ext4: Fix potential memory leak in ext4_fill_super
Under heavy memory pressure we may hit out of memory
situation and as result kstrdup'ed options will not be
freed. Fix it.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Theodore Ts'o
66e61a9e95 ext4: Once a day, printk file system error information to dmesg
This allows us to grab any file system error messages by scraping
/var/log/messages.  This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
1c13d5c087 ext4: Save error information to the superblock for analysis
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:03 -04:00
Theodore Ts'o
c398eda0e4 ext4: Pass line numbers to ext4_error() and friends
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:40 -04:00
Theodore Ts'o
90c7201b97 ext4: Pass line number to ext4_journal_abort_handle()
This allows the error messages to include the line number

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 14:53:24 -04:00
Theodore Ts'o
e29136f80e ext4: Enhance ext4_grp_locked_error() to take block and function numbers
Also use a macro definition so that __func__ and __LINE__ is implicit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 12:54:28 -04:00
Theodore Ts'o
c67d859e39 ext4: clean up ext4_abort() so __func__ is now implicit
Use a macro definition for ext4_abort() to clean up the .c files a wee
bit.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-29 11:07:07 -04:00
Jiri Kosina
f1bbbb6912 Merge branch 'master' into for-next 2010-06-16 18:08:13 +02:00
Uwe Kleine-König
421f91d21a fix typos concerning "initiali[zs]e"
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-06-16 18:05:05 +02:00
Christoph Hellwig
206f7ab4f4 ext4: remove vestiges of nobh support
The nobh option was only supported for writeback mode, but given that all
write paths actually create buffer heads it effectively was a no-op already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-06-14 14:42:49 -04:00
Linus Torvalds
d28619f156 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  quota: Convert quota statistics to generic percpu_counter
  ext3 uses rb_node = NULL; to zero rb_root.
  quota: Fixup dquot_transfer
  reiserfs: Fix resuming of quotas on remount read-write
  pohmelfs: Remove dead quota code
  ufs: Remove dead quota code
  udf: Remove dead quota code
  quota: rename default quotactl methods to dquot_
  quota: explicitly set ->dq_op and ->s_qcop
  quota: drop remount argument to ->quota_on and ->quota_off
  quota: move unmount handling into the filesystem
  quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
  quota: move remount handling into the filesystem
  ocfs2: Fix use after free on remount read-only

Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c
2010-05-30 09:11:11 -07:00
Christoph Hellwig
287a80958c quota: rename default quotactl methods to dquot_
Follow the dquot_* style used elsewhere in dquot.c.

[Jan Kara: Fixed up missing conversion of ext2]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:10:17 +02:00
Christoph Hellwig
307ae18a56 quota: drop remount argument to ->quota_on and ->quota_off
Remount handling has fully moved into the filesystem, so all this is
superflous now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
e0ccfd959c quota: move unmount handling into the filesystem
Currently the VFS calls into the quotactl interface for unmounting
filesystems.  This means filesystems with their own quota handling
can't easily distinguish between user-space originating quotaoff
and an unount.  Instead move the responsibily of the unmount handling
into the filesystem to be consistent with all other dquot handling.

Note that we do call dquot_disable a lot later now, e.g. after
a sync_filesystem.  But this is fine as the quota code does all its
writes via blockdev's mapping and that is synced even later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:09:12 +02:00
Christoph Hellwig
0f0dd62fdd quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
Instead of having wrappers in the VFS namespace export the dquot_suspend
and dquot_resume helpers directly.  Also rename vfs_quota_disable to
dquot_disable while we're at it.

[Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:40 +02:00
Christoph Hellwig
c79d967de3 quota: move remount handling into the filesystem
Currently do_remount_sb calls into the dquot code to tell it about going
from rw to ro and ro to rw.  Move this code into the filesystem to
not depend on the dquot code in the VFS - note ocfs2 already ignores
these calls and handles remount by itself.  This gets rid of overloading
the quotactl calls and allows to unify the VFS and XFS codepaths in
that area later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-05-24 14:06:39 +02:00
Theodore Ts'o
60e6679e28 ext4: Drop whitespace at end of lines
This patch was generated using:

#!/usr/bin/perl -i
while (<>) {
    s/[ 	]+$//;
    print;
}

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-17 07:00:00 -04:00
Dmitry Monakhov
12e9b89200 ext4: Use bitops to read/modify i_flags in struct ext4_inode_info
At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.

https://bugzilla.kernel.org/show_bug.cgi?id=15792

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 22:00:00 -04:00
Jan Kara
39a4bade8c ext4: Show journal_checksum option
We failed to show journal_checksum option in /proc/mounts. Fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 17:00:00 -04:00
Curt Wohlgemuth
fbe845ddf3 ext4: Remove extraneous newlines in ext4_msg() calls
Addresses-Google-Bug: #2562325

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 13:00:00 -04:00
Curt Wohlgemuth
d4c402d9fd ext4: Print mount options in when mounting and add a remount message
This adds a "re-mounted" message to ext4_remount(), and both it and
the mount message in ext4_fill_super() now have the original mount
options data string.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 12:00:00 -04:00
Dmitry Monakhov
84061e07c5 ext4: init statistics after journal recovery
Currently block/inode/dir counters initialized before journal was
recovered. In fact after journal recovery this info will probably
change. And freeblocks it critical for correct delalloc mode
accounting.

https://bugzilla.kernel.org/show_bug.cgi?id=15768

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 08:00:00 -04:00
Eric Sandeen
6b0310fbf0 ext4: don't return to userspace after freezing the fs with a mutex held
ext4_freeze() used jbd2_journal_lock_updates() which takes
the j_barrier mutex, and then returns to userspace.  The
kernel does not like this:

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
 #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0

Use vfs_check_frozen() added to ext4_journal_start_sb() and
ext4_force_commit() instead.

Addresses-Red-Hat-Bugzilla: #568503

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-05-16 02:00:00 -04:00
Linus Torvalds
39f1cd635c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fixed inode allocator to correctly track a flex_bg's used_dirs
  ext4: Don't use delayed allocation by default when used instead of ext3
  ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
  ext4: Fix estimate of # of blocks needed to write indirect-mapped files
2010-03-25 14:10:53 -07:00
Jan Kara
ba69f9ab7d ext4: Don't use delayed allocation by default when used instead of ext3
When ext4 driver is used to mount a filesystem instead of the ext3 file
system driver (through CONFIG_EXT4_USE_FOR_EXT23), do not enable delayed
allocation by default since some ext3 users and application writers have
developed unfortunate expectations about the safety of writing files on
systems subject to sudden and violent death without using fsync().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:18:37 -04:00
Theodore Ts'o
37f328eb60 ext4: Fix spelling of CONTIG_FS_EXT3 to CONFIG_FS_EXT3
Oops.  (Blush.)

Thanks to Sedat Dilek for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-24 20:06:41 -04:00
Linus Torvalds
c32da02342 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
  doc: fix typo in comment explaining rb_tree usage
  Remove fs/ntfs/ChangeLog
  doc: fix console doc typo
  doc: cpuset: Update the cpuset flag file
  Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
  Remove drivers/parport/ChangeLog
  Remove drivers/char/ChangeLog
  doc: typo - Table 1-2 should refer to "status", not "statm"
  tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
  No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
  devres/irq: Fix devm_irq_match comment
  Remove reference to kthread_create_on_cpu
  tree-wide: Assorted spelling fixes
  tree-wide: fix 'lenght' typo in comments and code
  drm/kms: fix spelling in error message
  doc: capitalization and other minor fixes in pnp doc
  devres: typo fix s/dev/devm/
  Remove redundant trailing semicolons from macros
  fix typo "definetly" -> "definitely" in comment
  tree-wide: s/widht/width/g typo in comments
  ...

Fix trivial conflict in Documentation/laptops/00-INDEX
2010-03-12 16:04:50 -08:00
Jiri Kosina
318ae2edc3 Merge branch 'for-next' into for-linus
Conflicts:
	Documentation/filesystems/proc.txt
	arch/arm/mach-u300/include/mach/debug-macro.S
	drivers/net/qlge/qlge_ethtool.c
	drivers/net/qlge/qlge_main.c
	drivers/net/typhoon.c
2010-03-08 16:55:37 +01:00
Emese Revfy
52cf25d0ab Driver core: Constify struct sysfs_ops in struct kobj_type
Constify struct sysfs_ops.

This is part of the ops structure constification
effort started by Arjan van de Ven et al.

Benefits of this constification:

 * prevents modification of data that is shared
   (referenced) by many other structure instances
   at runtime

 * detects/prevents accidental (but not intentional)
   modification attempts on archs that enforce
   read-only kernel data at runtime

 * potentially better optimized code as the compiler
   can assume that the const data cannot be changed

 * the compiler/linker move const data into .rodata
   and therefore exclude them from false sharing

Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-03-07 17:04:49 -08:00
Linus Torvalds
e213e26ab3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
  quota: stop using QUOTA_OK / NO_QUOTA
  dquot: cleanup dquot initialize routine
  dquot: move dquot initialization responsibility into the filesystem
  dquot: cleanup dquot drop routine
  dquot: move dquot drop responsibility into the filesystem
  dquot: cleanup dquot transfer routine
  dquot: move dquot transfer responsibility into the filesystem
  dquot: cleanup inode allocation / freeing routines
  dquot: cleanup space allocation / freeing routines
  ext3: add writepage sanity checks
  ext3: Truncate allocated blocks if direct IO write fails to update i_size
  quota: Properly invalidate caches even for filesystems with blocksize < pagesize
  quota: generalize quota transfer interface
  quota: sb_quota state flags cleanup
  jbd: Delay discarding buffers in journal_unmap_buffer
  ext3: quota_write cross block boundary behaviour
  quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
  quota: split out compat_sys_quotactl support from quota.c
  quota: split out netlink notification support from quota.c
  quota: remove invalid optimization from quota_sync_all
  ...

Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
2010-03-05 13:20:53 -08:00
Christoph Hellwig
871a293155 dquot: cleanup dquot initialize routine
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
9f75475802 dquot: cleanup dquot drop routine
Get rid of the drop dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_drop helper to __dquot_drop
and vfs_dq_drop to dquot_drop to have a consistent namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:30 +01:00
Christoph Hellwig
257ba15ced dquot: move dquot drop responsibility into the filesystem
Currently clear_inode calls vfs_dq_drop directly.  This means
we tie the quota code into the VFS.  Get rid of that and make the
filesystem responsible for the drop inside the ->clear_inode
superblock operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
b43fa8284d dquot: cleanup dquot transfer routine
Get rid of the transfer dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.

Rename the now static low-level dquot_transfer helper to __dquot_transfer
and vfs_dq_transfer to dquot_transfer to have a consistent namespace,
and make the new dquot_transfer return a normal negative errno value
which all callers expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:29 +01:00
Christoph Hellwig
63936ddaa1 dquot: cleanup inode allocation / freeing routines
Get rid of the alloc_inode and free_inode dquot operations - they are
always called from the filesystem and if a filesystem really needs
their own (which none currently does) it can just call into it's
own routine directly.

Also get rid of the vfs_dq_alloc/vfs_dq_free wrappers and always
call the lowlevel dquot_alloc_inode / dqout_free_inode routines
directly, which now lose the number argument which is always 1.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Christoph Hellwig
5dd4056db8 dquot: cleanup space allocation / freeing routines
Get rid of the alloc_space, free_space, reserve_space, claim_space and
release_rsv dquot operations - they are always called from the filesystem
and if a filesystem really needs their own (which none currently does)
it can just call into it's own routine directly.

Move shared logic into the common __dquot_alloc_space,
dquot_claim_space_nodirty and __dquot_free_space low-level methods,
and rationalize the wrappers around it to move as much as possible
code into the common block for CONFIG_QUOTA vs not.  Also rename
all these helpers to be named dquot_* instead of vfs_dq_*.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-03-05 00:20:28 +01:00
Dmitry Monakhov
67eeb5685d ext4: Fix ext4_quota_write cross block boundary behaviour
We always assume what dquot update result in changes in one data block
But ext4_quota_write() function may handle cross block boundary writes
In fact if this ever happen it will result in incorrect journal
credits reservation, and later a BUG_ON.  As soon this never happen
the boundary cross loop is NOOP.  In order to make things straight
let's remove this loop and assert cross boundary condition.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 08:08:51 -05:00
Frank Mayhar
273df556b6 ext4: Convert BUG_ON checks to use ext4_error() instead
Convert a bunch of BUG_ONs to emit a ext4_error() message and return
EIO.  This is a first pass and most notably does _not_ cover
mballoc.c, which is a morass of void functions.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 11:46:09 -05:00
Jiaying Zhang
744692dc05 ext4: use ext4_get_block_write in buffer write
Allocate uninitialized extent before ext4 buffer write and
convert the extent to initialized after io completes.
The purpose is to make sure an extent can only be marked
initialized after it has been written with new data so
we can safely drop the i_mutex lock in ext4 DIO read without
exposing stale data. This helps to improve multi-thread DIO
read performance on high-speed disks.

Skip the nobh and data=journal mount cases to make things simple for now.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 16:14:02 -05:00
Jiaying Zhang
c7064ef13b ext4: mechanical rename some of the direct I/O get_block's identifiers
This commit renames some of the direct I/O's block allocation flags,
variables, and functions introduced in Mingming's "Direct IO for holes
and fallocate" patches so that they can be used by ext4's buffered
write path as well.  Also changed the related function comments
accordingly to cover both direct write and buffered write cases.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-02 13:28:44 -05:00
Dmitry Monakhov
437ca0fda3 ext4: deprecate obsoleted mount options
Declare following list of mount options as deprecated:
 - bsddf, miniddf
 - grpid, bsdgroups, nogrpid, sysvgroups

Declare following list of default mount options as deprecated:
 - bsdgroups

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 22:29:21 -05:00
Dmitry Monakhov
56c50f11f4 ext4: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:28:41 -05:00
Dmitry Monakhov
482a74258f ext4: mount flags manipulation cleanup
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 11:35:32 -05:00
Eric Sandeen
12062dddda ext4: move __func__ into a macro for ext4_warning, ext4_error
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 14:19:27 -05:00
Thadeu Lima de Souza Cascardo
d6b198bc8a fix ext3/ext4 comment typo compain -> complain
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-02-05 12:22:36 +01:00
Theodore Ts'o
9d0be50230 ext4: Calculate metadata requirements more accurately
In the past, ext4_calc_metadata_amount(), and its sub-functions
ext4_ext_calc_metadata_amount() and ext4_indirect_calc_metadata_amount()
badly over-estimated the number of metadata blocks that might be
required for delayed allocation blocks.  This didn't matter as much
when functions which managed the reserved metadata blocks were more
aggressive about dropping reserved metadata blocks as delayed
allocation blocks were written, but unfortunately they were too
aggressive.  This was fixed in commit 0637c6f, but as a result the
over-estimation by ext4_calc_metadata_amount() would lead to reserving
2-3 times the number of pending delayed allocation blocks as
potentially required metadata blocks.  So if there are 1 megabytes of
blocks which have been not yet been allocation, up to 3 megabytes of
space would get reserved out of the user's quota and from the file
system free space pool until all of the inode's data blocks have been
allocated.

This commit addresses this problem by much more accurately estimating
the number of metadata blocks that will be required.  It will still
somewhat over-estimate the number of blocks needed, since it must make
a worst case estimate not knowing which physical blocks will be
needed, but it is much more accurate than before.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-01 02:41:30 -05:00
Andrew Morton
a6b43e3825 ext4: fix unsigned long long printk warning in super.c
sparc64 allmodconfig:

fs/ext4/super.c: In function `lifetime_write_kbytes_show':
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)
fs/ext4/super.c:2174: warning: long long unsigned int format, long unsigned int arg (arg 4)

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-23 07:48:08 -05:00
Theodore Ts'o
51b7e3c9fb ext4: add module aliases for ext2 and ext3
Add module aliases for ext2 and ext3 when CONFIG_EXT4_USE_FOR_EXT23 is
set.  This makes the existing user-space stuff like mkinitrd working
as is.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-21 10:56:09 -05:00
Dmitry Monakhov
a9e7f44720 ext4: Convert to generic reserved quota's space management.
This patch also fixes write vs chown race condition.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-23 13:33:55 +01:00
André Goddard Rosa
e7d2860b69 tree-wide: convert open calls to remove spaces to skip_spaces() lib function
Makes use of skip_spaces() defined in lib/string.c for removing leading
spaces from strings all over the tree.

It decreases lib.a code size by 47 bytes and reuses the function tree-wide:
   text    data     bss     dec     hex filename
  64688     584     592   65864   10148 (TOTALS-BEFORE)
  64641     584     592   65817   10119 (TOTALS-AFTER)

Also, while at it, if we see (*str && isspace(*str)), we can be sure to
remove the first condition (*str) as the second one (isspace(*str)) also
evaluates to 0 whenever *str == 0, making it redundant. In other words,
"a char equals zero is never a space".

Julia Lawall tried the semantic patch (http://coccinelle.lip6.fr) below,
and found occurrences of this pattern on 3 more files:
    drivers/leds/led-class.c
    drivers/leds/ledtrig-timer.c
    drivers/video/output.c

@@
expression str;
@@

( // ignore skip_spaces cases
while (*str &&  isspace(*str)) { \(str++;\|++str;\) }
|
- *str &&
isspace(*str)
)

Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com>
Cc: Julia Lawall <julia@diku.dk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:32 -08:00
Linus Torvalds
3126c136bc Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (21 commits)
  ext3: PTR_ERR return of wrong pointer in setup_new_group_blocks()
  ext3: Fix data / filesystem corruption when write fails to copy data
  ext4: Support for 64-bit quota format
  ext3: Support for vfsv1 quota format
  quota: Implement quota format with 64-bit space and inode limits
  quota: Move definition of QFMT_OCFS2 to linux/quota.h
  ext2: fix comment in ext2_find_entry about return values
  ext3: Unify log messages in ext3
  ext2: clear uptodate flag on super block I/O error
  ext2: Unify log messages in ext2
  ext3: make "norecovery" an alias for "noload"
  ext3: Don't update the superblock in ext3_statfs()
  ext3: journal all modifications in ext3_xattr_set_handle
  ext2: Explicitly assign values to on-disk enum of filetypes
  quota: Fix WARN_ON in lookup_one_len
  const: struct quota_format_ops
  ubifs: remove manual O_SYNC handling
  afs: remove manual O_SYNC handling
  kill wait_on_page_writeback_range
  vfs: Implement proper O_SYNC semantics
  ...
2009-12-11 15:31:13 -08:00
Jan Kara
5a20bdfcdc ext4: Support for 64-bit quota format
Add support for new 64-bit quota format. It is enough to add proper
mount options handling. The rest is done by the generic code.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-12-10 15:02:54 +01:00
Theodore Ts'o
a214238d3b ext4: Do not override ext2 or ext3 if built they are built as modules
The CONFIG_EXT4_USE_FOR_EXT23 option must not try to take over the
ext2 or ext3 file systems if the those file system drivers are
configured to be built as mdoules.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-09 21:09:58 -05:00
Eric Sandeen
15121c18a2 ext4: Fix optional-arg mount options
We have 2 mount options, "barrier" and "auto_da_alloc" which may or
may not take a 1/0 argument.  This causes the ext4 superblock mount
code to subtract uninitialized pointers and pass the result to
kmalloc, which results in very noisy failures.

Per Ted's suggestion, initialize the args struct so that
we know whether match_token() found an argument for the
option, and skip match_int() if not.

Also, return error (0) from parse_options if we thought
we found an argument, but match_int() Fails.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 20:17:55 -05:00
Jan Kara
b436b9bef8 ext4: Wait for proper transaction commit on fsync
We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 23:51:10 -05:00
Josef Bacik
d4edac314e ext4: wait for log to commit when umounting
There is a potential race when a transaction is committing right when
the file system is being umounting.  This could reduce in a race
because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super
before the commit code calls a callback so the mballoc code can
release freed blocks in the transaction, resulting in a panic trying
to access the freed s_group_info.

The fix is to wait for the transaction to finish committing before we
shutdown the multiblock allocator.  

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-08 21:48:58 -05:00
Theodore Ts'o
24b584240a ext4: Use ext4 file system driver for ext2/ext3 file system mounts
Add a new config option, CONFIG_EXT4_USE_FOR_EXT23 which if enabled,
will cause ext4 to be used for either ext2 or ext3 file system mounts
when ext2 or ext3 is not enabled in the configuration.

This allows minimalist kernel fanatics to drop to file system drivers
from their compiled kernel with out losing functionality.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-12-07 14:08:51 -05:00
Eric Sandeen
e3bb52ae2b ext4: make "norecovery" an alias for "noload"
Users on the linux-ext4 list recently complained about differences
across filesystems w.r.t. how to mount without a journal replay.

In the discussion it was noted that xfs's "norecovery" option is
perhaps more descriptively accurate than "noload," so let's make
that an alias for ext4.

Also show this status in /proc/mounts

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:28:50 -05:00
Eric Sandeen
5328e63531 ext4: make trim/discard optional (and off by default)
It is anticipated that when sb_issue_discard starts doing
real work on trim-capable devices, we may see issues.  Make
this mount-time optional, and default it to off until we know
that things are working out OK.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-19 14:25:42 -05:00
Theodore Ts'o
3f8fb9490e ext4: don't update the superblock in ext4_statfs()
commit a71ce8c6c9 updated ext4_statfs()
to update the on-disk superblock counters, but modified this buffer
directly without any journaling of the change.  This is one of the
accesses that was causing the crc errors in journal replay as seen in
kernel.org bugzilla #14354.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:52 -05:00
Theodore Ts'o
cf40db137c ext4: remove failed journal checksum check
Now that we are checking for failed journal checksums in the jbd2
layer, we don't need to check in the ext4 mount path --- since a
checksum fail will result in ext4_load_journal() returning an error,
causing the file system to refuse to be mounted until e2fsck can deal
with the problem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-11-22 21:00:01 -05:00
Theodore Ts'o
503358ae01 ext4: avoid divide by zero when trying to mount a corrupted file system
If s_log_groups_per_flex is greater than 31, then groups_per_flex will
will overflow and cause a divide by zero error.  This can cause kernel
BUG if such a file system is mounted.

Thanks to Nageswara R Sastry for analyzing the failure and providing
an initial patch.

http://bugzilla.kernel.org/show_bug.cgi?id=14287

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-11-23 07:24:46 -05:00
Linus Torvalds
d4da6c9ccf Revert "ext4: Remove journal_checksum mount option and enable it by default"
This reverts commit d0646f7b63, as
requested by Eric Sandeen.

It can basically cause an ext4 filesystem to miss recovery (and thus get
mounted with errors) if the journal checksum does not match.

Quoth Eric:

   "My hand-wavy hunch about what is happening is that we're finding a
    bad checksum on the last partially-written transaction, which is
    not surprising, but if we have a wrapped log and we're doing the
    initial scan for head/tail, and we abort scanning on that bad
    checksum, then we are essentially running an unrecovered filesystem.

    But that's hand-wavy and I need to go look at the code.

    We lived without journal checksums on by default until now, and at
    this point they're doing more harm than good, so we should revert
    the default-changing commit until we can fix it and do some good
    power-fail testing with the fixes in place."

See

	http://bugzilla.kernel.org/show_bug.cgi?id=14354

for all the gory details.

Requested-by: Eric Sandeen <sandeen@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Alexey Fisher <bug-track@fisher-privat.net>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mathias Burén <mathias.buren@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-02 10:15:27 -08:00
Eric Sandeen
f0e2dfa7f3 ext4: drop ext4dev compat
Kconfig & super.c promised it'd be gone by 2.6.31, so it's
about time to drop it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-10-01 02:21:07 -04:00
Theodore Ts'o
296c355cd6 ext4: Use tracepoints for mb_history trace file
The /proc/fs/ext4/<dev>/mb_history was maintained manually, and had a
number of problems: it required a largish amount of memory to be
allocated for each ext4 filesystem, and the s_mb_history_lock
introduced a CPU contention problem.  

By ripping out the mb_history code and replacing it with ftrace
tracepoints, and we get more functionality: timestamps, event
filtering, the ability to correlate mballoc history with other ext4
tracepoints, etc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-30 00:32:42 -04:00
Theodore Ts'o
90576c0b9a ext4, jbd2: Drop unneeded printks at mount and unmount time
There are a number of kernel printk's which are printed when an ext4
filesystem is mounted and unmounted.  Disable them to economize space
in the system logs.  In addition, disabling the mballoc stats by
default saves a number of unneeded atomic operations for every block
allocation or deallocation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 15:51:30 -04:00
Curt Wohlgemuth
d3d1faf6a7 ext4: Handle nested ext4_journal_start/stop calls without a journal
This patch fixes a problem with handling nested calls to
ext4_journal_start/ext4_journal_stop, when there is no journal present.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 11:01:03 -04:00
Mingming Cao
8d5d02e6b1 ext4: async direct IO for holes and fallocate support
For async direct IO that covers holes or fallocate, the end_io
callback function now queued the convertion work on workqueue but
don't flush the work rightaway as it might take too long to afford.

But when fsync is called after all the data is completed, user expects
the metadata also being updated before fsync returns.

Thus we need to flush the conversion work when fsync() is called.
This patch keep track of a listed of completed async direct io that
has a work queued on workqueue.  When fsync() is called, it will go
through the list and do the conversion.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2009-09-28 15:48:29 -04:00
Mingming Cao
4c0425ff68 ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O
Currently the DIO VFS code passes create = 0 when writing to the
middle of file.  It does this to avoid block allocation for holes, so
as not to expose stale data out when there is a parallel buffered read
(which does not hold the i_mutex lock).  Direct I/O writes into holes
falls back to buffered IO for this reason.

Since preallocated extents are treated as holes when doing a
get_block() look up (buffer is not mapped), direct IO over fallocate
also falls back to buffered IO.  Thus ext4 actually silently falls
back to buffered IO in above two cases, which is undesirable.

To fix this, this patch creates unitialized extents when a direct I/O
write into holes in sparse files, and registering an end_io callback which
converts the uninitialized extent to an initialized extent after the
I/O is completed.

Singed-Off-By: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-28 15:48:41 -04:00
Theodore Ts'o
55138e0bc2 ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks
Work around problems in the writeback code to force out writebacks in
larger chunks than just 4mb, which is just too small.  This also works
around limitations in the ext4 block allocator, which can't allocate
more than 2048 blocks at a time.  So we need to defeat the round-robin
characteristics of the writeback code and try to write out as many
blocks in one inode before allowing the writeback code to move on to
another inode.  We add a a new per-filesystem tunable,
max_writeback_mb_bump, which caps this to a default of 128mb per
inode.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-29 13:31:31 -04:00
Alexey Dobriyan
0d54b217a2 const: make struct super_block::s_qcop const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan
61e225dc34 const: make struct super_block::dq_op const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Eric Sandeen
fb0a387dcd ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.

This should address RH Bug 519471, ext4 bitmap allocator 
must limit blocks to < 2^32

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
  so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
  is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
  search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
  preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs

No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes.  Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:45:10 -04:00
Theodore Ts'o
3661d28615 ext4: Fix include/trace/events/ext4.h to work with Systemtap
Using relative pathnames in #include statements interacts badly with
SystemTap, since the fs/ext4/*.h header files are not packaged up as
part of a distribution kernel's header files.  Since systemtap doesn't
use TP_fast_assign(), we can use a blind structure definition and then
make sure the needed header files are defined before the ext4 source
files #include the trace/events/ext4.h header file.

https://bugzilla.redhat.com/show_bug.cgi?id=512478

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-14 22:59:50 -04:00
Theodore Ts'o
7ad9bb651f ext4: Fix initalization of s_flex_groups
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg.  By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.  

The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.

Thanks to Damien Guibouret for pointing out this problem!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-11 16:51:28 -04:00
Theodore Ts'o
71290b368a ext4: Don't update superblock write time when filesystem is read-only
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-10 17:31:04 -04:00
Theodore Ts'o
d0646f7b63 ext4: Remove journal_checksum mount option and enable it by default
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-05 12:50:43 -04:00
Eric Sandeen
a13fb1a453 ext4: Add feature set check helper for mount & remount paths
A user reported that although his root ext4 filesystem was mounting
fine, other filesystems would not mount, with the:

"Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF"

error on his 32-bit box built without CONFIG_LBDAF.  This is because
the test at mount time for this situation was not being re-checked
on remount, and the normal boot process makes an ro->rw transition,
so this was being missed.

Refactor to make a common helper function to test the filesystem
features against the type of mount request (RO vs. RW) so that we 
stay consistent.

Addresses Red-Hat-Bugzilla: #517650

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-18 00:20:23 -04:00
Eric Sandeen
bf43d84b18 ext4: reject too-large filesystems on 32-bit kernels
ext4 will happily mount a > 16T filesystem on a 32-bit box, but
this is not safe; writes to the block device will wrap past 16T
and the page cache can't index past 16T (232 index * 4k pages).

Adding another test to the existing "too many sectors" test
should do the trick.

Add a comment, a relevant return value, and fix the reference
to the CONFIG_LBD(AF) option as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-08-17 23:48:51 -04:00
Theodore Ts'o
78f1ddbb49 ext4: Avoid null pointer dereference when decoding EROFS w/o a journal
We need to check to make sure a journal is present before checking the
journal flags in ext4_decode_error().

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 23:09:47 -04:00
Al Viro
d4bfe2f76d switch ext4 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Linus Torvalds
31583d6acf Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix kernel-doc parameter name typo in blk-settings.c:
  block: rename CONFIG_LBD to CONFIG_LBDAF
  block: Fix bounce_pfn setting
  hd: stop defining MAJOR_NR
2009-06-19 17:43:04 -07:00
Bartlomiej Zolnierkiewicz
90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Andreas Dilger
11013911da ext4: teach the inode allocator to use a goal inode number
Enhance the inode allocator to take a goal inode number as a
paremeter; if it is specified, it takes precedence over Orlov or
parent directory inode allocation algorithms.

The extents migration function uses the goal inode number so that the
extent trees allocated the migration function use the correct flex_bg.
In the future, the goal inode functionality will also be used to
allocate an adjacent inode for the extended attributes.

Also, for testing purposes the goal inode number can be specified via
/sys/fs/{dev}/inode_goal.  This can be useful for testing inode
allocation beyond 2^32 blocks on very large filesystems.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:45:35 -04:00
Theodore Ts'o
4ab2f15b7f ext4: move the abort flag from s_mount_opts to s_mount_flags
We're running out of space in the mount options word, and
EXT4_MOUNT_ABORT isn't really a mount option, but a run-time flag.  So
move it to become EXT4_MF_FS_ABORTED in s_mount_flags.

Also remove bogus ext2_fs.h / ext4.h simultaneous #include protection,
which can never happen.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:36 -04:00
Theodore Ts'o
7f4520cc62 ext4: change s_mount_opt to be an unsigned int
We can only fit 32 options in s_mount_opt because an unsigned long is
32-bits on a x86 machine.  So use an unsigned int to save space on
64-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:41 -04:00
Alessio Igor Bogani
337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Christoph Hellwig
ebc1ac1645 ->write_super lock_super pushdown
Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

 * bfs, nilfs2 - no other uses of s_lock and have internal locks in
	->write_super
 * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
 * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
 	->write_super
 * xfs - no other uses of s_lock and uses internal lock (buffer lock on
	superblock buffer) to serialize ->write_super.  Also xfs_fs_write_super
	is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Al Viro
bbd6851a32 Push lock_super() into the ->remount_fs() of filesystems that care about it
Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own.  Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Christoph Hellwig
6cfd014842 push BKL down into ->put_super
Move BKL into ->put_super from the only caller.  A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment.  Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
a9e220f832 No need to do lock_super() for exclusion in generic_shutdown_super()
We can't run into contention on it.  All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held.  So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it.  The rest of filesystems do not care at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Christoph Hellwig
8c85e12512 remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.

Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.

Exceptions:

 - affs already has identical copy & pasted code at the beginning of
   affs_put_super so no need to do it twice.
 - xfs does the right thing without it and I have changes pending for
   the xfs tree touching this are so I don't really need conflicts
   here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Eric Sandeen
b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Andreas Dilger
0b8e58a140 ext4: super.c whitespace cleanup
Cleanup of whitespace and formatting.  Initially driven by confusing indents
for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd
cleanup some other 80-column wrapping and other indenting problems at the
same time.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03 17:59:28 -04:00
Theodore Ts'o
88b6edd17c ext4: Clean up calls to ext4_get_group_desc()
If the caller isn't planning on modifying the block group descriptors,
there's no need to pass in a pointer to a struct buffer_head.  Nuking
this saves a tiny amount of CPU time and stack space usage.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:50:39 -04:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Manish Katiyar
f68301656b ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:44 -04:00
Theodore Ts'o
6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Aneesh Kumar K.V
955ce5f5be ext4: Convert ext4_lock_group to use sb_bgl_lock
We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 20:35:09 -04:00
Curt Wohlgemuth
f40339031b ext4: Make the length of the mb_history file tunable
In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.

This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 20:27:20 -04:00
Theodore Ts'o
bb23c20a85 ext4: Move fs/ext4/group.h into ext4.h
Move the function prototypes in group.h into ext4.h so they are all
defined in one place.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 19:44:44 -04:00
Theodore Ts'o
596397b77c ext4: Move fs/ext4/namei.h into ext4.h
The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file.  Move it
into ext4.h, where should have been from the beginning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:49:15 -04:00
Theodore Ts'o
9bffad1ed2 ext4: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:48:11 -04:00
Theodore Ts'o
32ed5058ce ext4: Replace lock/unlock_super() with an explicit lock for resizing
Use a separate lock to protect s_groups_count and the other block
group descriptors which get changed via an on-line resize operation,
so we can stop overloading the use of lock_super().
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:53:39 -04:00
Theodore Ts'o
3b9d4ed266 ext4: Replace lock/unlock_super() with an explicit lock for the orphan list
Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:54:04 -04:00
Theodore Ts'o
a63c9eb2ce ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
The function ext4_mark_recovery_complete() is called from two call
paths: either (a) while mounting the filesystem, in which case there's
no danger of any other CPU calling write_super() until the mount is
completed, and (b) while remounting the filesystem read-write, in
which case the fs core has already locked the superblock.  This also
allows us to take out a very vile unlock_super()/lock_super() pair in
ext4_remount().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 01:59:42 -04:00
Theodore Ts'o
114e9fc907 ext4: Remove outdated comment about lock_super()
ext4_fill_super() is no longer called by read_super(), and it is no
longer called with the superblock locked.  The
unlock_super()/lock_super() is no longer present, so this comment is
entirely superfluous.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 15:48:07 -04:00
Theodore Ts'o
8df9675f8b ext4: Avoid races caused by on-line resizing and SMP memory reordering
Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count.  However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group.  For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.

Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves.  So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 08:50:38 -04:00
Theodore Ts'o
9ca92389c5 ext4: Use separate super_operations structure for no_journal filesystems
By using a separate super_operations structure for filesystems that
have and don't have journals, we can simply ext4_write_super() ---
which is only needed when no journal is present --- and ext4_freeze(),
ext4_unfreeze(), and ext4_sync_fs(), which are only needed when the
journal is present.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 12:52:25 -04:00
Theodore Ts'o
7234ab2a55 ext4: Fix and simplify s_dirt handling
The s_dirt flag wasn't completely handled correctly, but it didn't
really matter when journalling was enabled.  It turns out that when
ext4 runs without a journal, we don't clear s_dirt in places where we
should have, with the result that the high-level write_super()
function was writing the superblock when it wasn't necessary.

So we fix this by making ext4_commit_super() clear the s_dirt flag,
and removing many of the other places where s_dirt is manipulated.
When journalling is enabled, the s_dirt flag might be left set more
often, but s_dirt really doesn't matter when journalling is enabled.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-30 21:24:04 -04:00
Theodore Ts'o
e2d670523c ext4: Simplify ext4_commit_super()'s function signature
The ext4_commit_super() function took both a struct super_block * and
a struct ext4_super_block *, but the struct ext4_super_block can be
derived from the struct super_block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 00:33:44 -04:00
Theodore Ts'o
f7c439504c ext4: Use is_power_of_2() for clarity
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 23:31:59 -04:00
Theodore Ts'o
c5ca7c7636 ext4: Fallback to vmalloc if kmalloc can't allocate s_flex_groups array
For very large filesystems, the s_flex_groups array can get quite big.
For example, a filesystem that can be resized up to 16TB will have
8192 flex groups (assuming the default flex_bg size of 16), so the
array is 96k, which is *very* marginal for kmalloc().  On the other
hand, a 160GB filesystem without the resize_inode feature will only
require 960 bytes.  So we try to allocate the array first using
kmalloc(), and if that fails, we'll try to use vmalloc() instead.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27 22:48:48 -04:00
From: Thiemo Nagel
0f2ddca66d ext4: check block device size on mount
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-07 14:07:47 -04:00
Theodore Ts'o
06705bff91 ext4: Regularize mount options
Add support for using the mount options "barrier" and "nobarrier", and
"auto_da_alloc" and "noauto_da_alloc", which is more consistent than
"barrier=<0|1>" or "auto_da_alloc=<0|1>".  Most other ext3/ext4 mount
options use the foo/nofoo naming convention.  We allow the old forms
of these mount options for backwards compatibility.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-28 10:59:57 -04:00
Theodore Ts'o
afd4672dc7 ext4: Add auto_da_alloc mount option
Add a mount option which allows the user to disable automatic
allocation of blocks whose allocation by delayed allocation when the
file was originally truncated or when the file is renamed over an
existing file.  This feature is intended to save users from the
effects of naive application writers, but it reduces the effectiveness
of the delayed allocation code.  This mount option disables this
safety feature, which may be desirable for prodcutions systems where
the risk of unclean shutdowns or unexpected system crashes is low.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-16 23:12:23 -04:00
Theodore Ts'o
7d39db14a4 ext4: Use struct flex_groups to calculate get_orlov_stats()
Instead of looping over all of the block groups in a flex group
summing their summary statistics, start tracking used_dirs in struct
flex_groups, and use struct flex_groups instead.  This should save a
bit of CPU for mkdir-heavy workloads.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:31:53 -05:00
Theodore Ts'o
9f24e4208f ext4: Use atomic_t's in struct flex_groups
Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's
to track the number of free blocks and inodes in each flex_group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-04 19:09:10 -05:00
Theodore Ts'o
b713a5ec55 ext4: remove /proc tuning knobs
Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been
replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:11:14 -04:00
Theodore Ts'o
3197ebdb13 ext4: Add sysfs support
Add basic sysfs support so that information about the mounted
filesystem and various tuning parameters can be accessed via
/sys/fs/ext4/<dev>/*.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-03-31 09:10:09 -04:00
Theodore Ts'o
afc32f7ee9 ext4: Track lifetime disk writes
Add a new superblock value which tracks the lifetime amount of writes
to the filesystem.  This is useful in estimating the amount of wear on
solid state drives (SSD's) caused by writes to the filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 19:39:58 -05:00
Pekka Enberg
705895b611 ext4: allocate ->s_blockgroup_lock separately
As spotted by kmemtrace, struct ext4_sb_info is 17664 bytes on 64-bit
which makes it a very bad fit for SLAB allocators.  The culprit of the
wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when
NR_CPUS >= 32.

To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2
page in the worst case, separately.  This shinks down struct ext4_sb_info
enough to fit a 2 KB slab cache so now we allocate 16 KB + 2 KB instead of
32 KB saving 14 KB of memory.

Acked-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-15 18:07:52 -05:00
Jan Kara
a269eb1829 ext4: Use lowercase names of quota functions
Use lowercase names of quota functions instead of old uppercase ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
2009-03-26 02:18:36 +01:00
Mingming Cao
60e58e0f30 ext4: quota reservation for delayed allocation
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Jan Kara
edf7245362 ext4: Remove unnecessary quota functions
ext4_dquot_initialize() and ext4_dquot_drop() is no longer
needed because of modified quota locking.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-03-26 02:18:34 +01:00
Theodore Ts'o
8b1a8ff8b3 ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()
Commit c4be0c1d added error checking to ext4_freeze() when calling
ext4_commit_super().  Unfortunately the patch failed to remove the
original call to ext4_commit_super(), with the net result that when
freezing the filesystem, the superblock gets written twice, the first
time without error checking.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-02-28 00:08:53 -05:00
Jan Kara
9eddacf9e9 Revert "ext4: wait on all pending commits in ext4_sync_fs()"
This undoes commit 14ce0cb411.

Since jbd2_journal_start_commit() is now fixed to return 1 when we
started a transaction commit, there's some transaction waiting to be
committed or there's a transaction already committing, we don't
need to call ext4_force_commit() in ext4_sync_fs(). Furthermore
ext4_force_commit() can unnecessarily create sync transaction which is
expensive so it's worthwhile to remove it when we can.

http://bugzilla.kernel.org/show_bug.cgi?id=12224

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: linux-ext4@vger.kernel.org
2009-02-10 06:46:05 -05:00
Takashi Sato
c4be0c1dc4 filesystem freeze: add error handling of write_super_lockfs/unlockfs
Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests.  So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.

In many case, a commercial filesystem (e.g.  VxFS) has the freeze feature
and it would be used to get the consistent backup.

If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.

So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
   with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
   or the snapshot.

This patch:

VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.

ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.

reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.

Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Masayuki Hamaguchi <m-hamaguchi@ys.jp.nec.com>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-09 16:54:42 -08:00
Linus Torvalds
2150edc6c5 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
  jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
  ext4: Remove "extents" mount option
  block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
  ext4: Make printk's consistently prefixed with "EXT4-fs: "
  ext4: Add sanity checks for the superblock before mounting the filesystem
  ext4: Add mount option to set kjournald's I/O priority
  jbd2: Submit writes to the journal using WRITE_SYNC
  jbd2: Add pid and journal device name to the "kjournald2 starting" message
  ext4: Add markers for better debuggability
  ext4: Remove code to create the journal inode
  ext4: provide function to release metadata pages under memory pressure
  ext3: provide function to release metadata pages under memory pressure
  add releasepage hooks to block devices which can be used by file systems
  ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
  ext4: Init the complete page while building buddy cache
  ext4: Don't allow new groups to be added during block allocation
  ext4: mark the blocks/inode bitmap beyond end of group as used
  ext4: Use new buffer_head flag to check uninit group bitmaps initialization
  ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
  ext4: code cleanup
  ...
2009-01-08 17:14:59 -08:00
Theodore Ts'o
83982b6f47 ext4: Remove "extents" mount option
This mount option is largely superfluous, and in fact the way it was
implemented was buggy; if a filesystem which did not have the extents
feature flag was mounted -o extents, the filesystem would attempt to
create and use extents-based file even though the extents feature flag
was not eabled.  The simplest thing to do is to nuke the mount option
entirely.  It's not all that useful to force the non-creation of new
extent-based files if the filesystem can support it.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 14:53:16 -05:00
Theodore Ts'o
abda141892 ext4: Make printk's consistently prefixed with "EXT4-fs: "
Previously, some were "ext4: ", and some were "EXT4: "; change them to
be consistent with most ext4 printk's, which is to use "EXT4-fs: ".

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 00:20:32 -05:00
Theodore Ts'o
4ec1102813 ext4: Add sanity checks for the superblock before mounting the filesystem
This avoids insane superblock configurations that could lead to kernel
oops due to null pointer derefences.

http://bugzilla.kernel.org/show_bug.cgi?id=12371

Thanks to David Maciejak at Fortinet's FortiGuard Global Security
Research Team who discovered this bug independently (but at
approximately the same time) as Thiemo Nagel, who submitted the patch.

Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-06 14:53:26 -05:00
Theodore Ts'o
b3881f74b3 ext4: Add mount option to set kjournald's I/O priority
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jens Axboe <jens.axboe@oracle.com>
2009-01-05 22:46:26 -05:00
Jan Kara
a5b5ee3201 ext4: Add default allocation routines for quota structures
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:26 -08:00
Jan Kara
17bd13b31c ext4: Use sb_any_quota_loaded() instead of sb_any_quota_enabled()
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:36:56 -08:00
Theodore Ts'o
c319106723 ext4: Remove code to create the journal inode
This code has been obsolete in quite some time, since the supported
method for adding a journal inode is to use tune2fs (or to creating
new filesystem with a journal via mke2fs or mkfs.ext4).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-06 11:14:25 -05:00
Toshiyuki Okajima
c39a7f84d7 ext4: provide function to release metadata pages under memory pressure
Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops.  However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which calls
jbd2_journal_try_to_free_buffers() function to free the metadata, and
which is called by the block device's blkdev_releasepage() function.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
2009-01-05 22:38:48 -05:00
Aneesh Kumar K.V
560671a0d3 ext4: Use high 16 bits of the block group descriptor's free counts fields
Rename the lower bits with suffix _lo and add helper
to access the values. Also rename bg_itable_unused_hi
to bg_pad as in e2fsprogs.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:20:24 -05:00
Aneesh Kumar K.V
5d1b1b3f49 ext4: fix BUG when calling ext4_error with locked block group
The mballoc code likes to call ext4_error while it is holding locked
block groups.  This can causes a scheduling in atomic context BUG.  We
can't just unlock the block group and relock it after/if ext4_error
returns since that might result in race conditions in the case where
the filesystem is set to continue after finding errors.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:19:52 -05:00
Jens Axboe
b3a6ffe16b Get rid of CONFIG_LSF
We have two seperate config entries for large devices/files. One
is CONFIG_LBD that guards just the devices, the other is CONFIG_LSF
that handles large files. This doesn't make a lot of sense, you typically
want both or none. So get rid of CONFIG_LSF and change CONFIG_LBD wording
to indicate that it covers both.

Acked-by: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-12-29 08:29:51 +01:00
Aneesh Kumar K.V
3a06d778df ext4: sparse fixes
* Change EXT4_HAS_*_FEATURE to return a boolean
* Add a function prototype for ext4_fiemap() in ext4.h
* Make ext4_ext_fiemap_cb() and ext4_xattr_fiemap() be static functions
* Add lock annotations to mb_free_blocks()

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-22 15:04:59 -05:00
Theodore Ts'o
a9df9a4910 ext4: Make ext4_group_t be an unsigned int
Nearly all places in the ext3/4 code which uses "unsigned long" is
probably a bug, since on 32-bit systems a ulong a 32-bits, which means
we are wasting stack space on 64-bit systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:18:16 -05:00
Theodore Ts'o
30773840c1 ext4: add fsync batch tuning knobs
Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched with a synchronous write transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-03 20:27:38 -05:00
Theodore Ts'o
fde4d95ad8 ext4: remove extraneous newlines from calls to ext4_error() and ext4_warning()
This removes annoying blank syslog entries emitted by ext4_error() or
ext4_warning(), since these functions add their own newline.

Signed-off-by: Nick Warne <nick@ukfsn.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-05 22:17:35 -05:00
Frank Mayhar
0390131ba8 ext4: Allow ext4 to run without a journal
A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-01-07 00:06:22 -05:00
Roel Kluin
23475e264c ext4: Use simple_strtol() instead of simple_strtoul() in ext4_ui_proc_open
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-26 02:23:19 -05:00
Aneesh Kumar K.V
565a9617b2 ext4: avoid ext4_error when mounting a fs with a single bg
Remove some completely unneeded code which which caused an ext4_error
to be generated when mounting a file system with only a single block
group.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-01-05 21:51:07 -05:00
Theodore Ts'o
14ce0cb411 ext4: wait on all pending commits in ext4_sync_fs()
In ext4_sync_fs, we only wait for a commit to finish if we started it,
but there may be one already in progress which will not be synced.

In the case of a data=ordered umount with pending long symlinks which
are delayed due to a long list of other I/O on the backing block
device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of
fsync_super.  Then, before they can be dirtied again, kjournald exits,
seeing the UMOUNT flag and the dirty pages are never written to the
backing block device, causing long symlink corruption and exposing new
or previously freed block data to userspace.

To ensure all commits are synced, we flush all journal commits now
when sync_fs'ing ext4.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
2008-11-03 18:10:55 -05:00
Aneesh Kumar K.V
d94e99a64c ext4: Convert to host order before using the values.
Use le16_to_cpu to read the s_reserved_gdt_blocks values
from super block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-11-04 09:11:26 -05:00
Theodore Ts'o
f99b25897a ext4: Add support for non-native signed/unsigned htree hash algorithms
The original ext3 hash algorithms assumed that variables of type char
were signed, as God and K&R intended.  Unfortunately, this assumption
is not true on some architectures.  Userspace support for marking
filesystems with non-native signed/unsigned chars was added two years
ago, but the kernel-side support was never added (until now).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-28 13:21:44 -04:00
Hidehiro Kawai
ef2cabf7c6 ext4: fix a bug accessing freed memory in ext4_abort
Vegard Nossum reported a bug which accesses freed memory (found via
kmemcheck).  When journal has been aborted, ext4_put_super() calls
ext4_abort() after freeing the journal_t object, and then ext4_abort()
accesses it.  This patch fix it.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-27 22:53:05 -04:00
Linus Torvalds
2248485640 Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev
* git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits)
  [PATCH] kill the rest of struct file propagation in block ioctls
  [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET
  [PATCH] get rid of blkdev_locked_ioctl()
  [PATCH] get rid of blkdev_driver_ioctl()
  [PATCH] sanitize blkdev_get() and friends
  [PATCH] remember mode of reiserfs journal
  [PATCH] propagate mode through swsusp_close()
  [PATCH] propagate mode through open_bdev_excl/close_bdev_excl
  [PATCH] pass fmode_t to blkdev_put()
  [PATCH] kill the unused bsize on the send side of /dev/loop
  [PATCH] trim file propagation in block/compat_ioctl.c
  [PATCH] end of methods switch: remove the old ones
  [PATCH] switch sr
  [PATCH] switch sd
  [PATCH] switch ide-scsi
  [PATCH] switch tape_block
  [PATCH] switch dcssblk
  [PATCH] switch dasd
  [PATCH] switch mtd_blkdevs
  [PATCH] switch mmc
  ...
2008-10-23 10:23:07 -07:00
Al Viro
8264613def [PATCH] switch quota_on-related stuff to kern_path()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-23 05:12:44 -04:00
Al Viro
9a1c354276 [PATCH] pass fmode_t to blkdev_put()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-10-21 07:48:58 -04:00
Theodore Ts'o
f287a1a561 ext4: Remove automatic enabling of the HUGE_FILE feature flag
If the HUGE_FILE feature flag is not set, don't allow the creation of
large files, instead of automatically enabling the feature flag.
Recent versions of mke2fs will set the HUGE_FILE flag automatically
anyway for ext4 filesystems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-16 22:50:48 -04:00
Theodore Ts'o
01436ef2e4 ext4: Remove unused mount options: nomballoc, mballoc, nocheck
These mount options don't actually do anything any more, so remove
them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 07:22:35 -04:00
Eric Sesterhenn
5128273a32 ext4: Add missing newlines to printk messages
There are some newlines missing in ext4_check_descriptors, which
cause the printk level to be printed out when the next printk call
is made:

[  778.847265] EXT4-fs: ext4_check_descriptors: Block bitmap for group 0
not in group (block 1509949442)!<3>EXT4-fs: group descriptors corrupted!
[  802.646630] EXT4-fs: ext4_check_descriptors: Inode bitmap for group 0
not in group (block 9043971)!<3>EXT4-fs: group descriptors corrupted!

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-17 09:16:19 -04:00
Aneesh Kumar K.V
c2774d84fd ext4: Do mballoc init before doing filesystem recovery
During filesystem recovery we may be doing a truncate
which expects some of the mballoc data structures to
be initialized. So do ext4_mb_init before recovery.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:07:20 -04:00
Steven Whitehouse
a447c09324 vfs: Use const for kernel parser table
This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.

This was posted for review some time ago and I believe its been in -mm
since then.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Alexander Viro <aviro@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-13 10:10:37 -07:00
Alexander Beregalov
3244fcb1ae ext4: fix build failure without procfs
fs/ext4/super.c: In function 'ext4_fill_super':
fs/ext4/super.c:2226: error: 'ext4_ui_proc_fops' undeclared (first use
in this function)
fs/ext4/super.c:2226: error: (Each undeclared identifier is reported
only once
fs/ext4/super.c:2226: error: for each function it appears in.)

Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-12 17:27:49 -04:00
Hidehiro Kawai
5bf5683a33 ext4: add an option to control error handling on file data
If the journal doesn't abort when it gets an IO error in file data
blocks, the file data corruption will spread silently.  Because
most of applications and commands do buffered writes without fsync(),
they don't notice the IO error.  It's scary for mission critical
systems.  On the other hand, if the journal aborts whenever it gets
an IO error in file data blocks, the system will easily become
inoperable.  So this patch introduces a filesystem option to
determine whether it aborts the journal or just call printk() when
it gets an IO error in file data.

If you mount an ext4 fs with data_err=abort option, it aborts on file
data write error.  If you mount it with data_err=ignore, it doesn't
abort, just call printk().  data_err=ignore is the default.

Here is the corresponding patch of the ext3 version:
http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 22:12:43 -04:00
Hidehiro Kawai
7ffe1ea894 ext4: add checks for errors from jbd2
If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space.  Otherwise, the
filesystem will lose uncheckpointed metadata completely and
become inconsistent.  To avoid this, we need to keep needs_recovery
flag if checkpoint has failed.

With this patch, ext4_put_super() detects a checkpointing failure
from the return value of journal_destroy(), then it invokes
ext4_abort() to make the filesystem read only and keep
needs_recovery flag.  Errors from jbd2_journal_flush() are also
handled by this patch in some places.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-10-10 20:29:21 -04:00
Theodore Ts'o
03010a3350 ext4: Rename ext4dev to ext4
The ext4 filesystem is getting stable enough that it's time to drop
the "dev" prefix.  Also remove the requirement for the TEST_FILESYS
flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 20:02:48 -04:00
Andi Kleen
39d80c33a0 ext4: Avoid double dirtying of super block in ext4_put_super()
While reading code I noticed that ext4_put_super() dirties the 
superblock bh twice. It is always done in ext4_commit_super()
too. Remove the redundant dirty operation.
Should be a nop semantically.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
2008-10-06 21:37:44 -04:00
Theodore Ts'o
ede86cc473 ext4: Add debugging markers that can be used by systemtap
This debugging markers are designed to debug problems such as the
random filesystem latency problems reported by Arjan.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-05 20:50:06 -04:00
Theodore Ts'o
c2ea3fde61 ext4: Remove old legacy block allocator
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 09:40:52 -04:00
Theodore Ts'o
240799cdf2 ext4: Use readahead when reading an inode from the inode table
With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block.  So request readahead for adjacent inode table
blocks to reduce the time it takes when iterating over directories
(especially when doing this in htree sort order) in a cold cache case.
With this patch, the time it takes to run "git status" on a kernel
tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
is reduced by 21%.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-09 23:53:47 -04:00
Theodore Ts'o
5e8814f2f7 ext4: Combine proc file handling into a single set of functions
Previously mballoc created a separate set of functions for each proc
file.  This combines the tunables into a single set of functions which
gets used for all of the per-superblock proc files, saving
approximately 2k of compiled object code.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-23 18:07:35 -04:00
Theodore Ts'o
9f6200bbfc ext4: move /proc setup and teardown out of mballoc.c
...and into the core setup/teardown code in fs/ext4/super.c so that
other parts of ext4 can define tuning parameters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-23 09:18:24 -04:00
Theodore Ts'o
914258bf2c ext4/jbd2: Avoid WARN() messages when failing to write to the superblock
This fixes some very common warnings reported by kerneloops.org

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-06 21:35:40 -04:00
Li Zefan
7ee1ec4ca3 ext4: add missing unlock in ext4_check_descriptors() on error path
If there group descriptors are corrupted we need unlock the block
group lock before returning from the function; else we will oops when
freeing a spinlock which is still being held.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:47:19 -04:00
Theodore Ts'o
05496769e5 jbd2: clean up how the journal device name is printed
Calculate the journal device name once and stash it away in the
journal_s structure.  This avoids needing to call bdevname()
everywhere and reduces stack usage by not needing to allocate an
on-stack buffer.  In addition, we eliminate the '/' that can appear in
device names (e.g. "cciss/c0d0p9" --- see kernel bugzilla #11321) that
can cause problems when creating proc directory names, and include the
inode number to support ocfs2 which creates multiple journals with
different inode numbers.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-16 14:36:17 -04:00
Frederic Bohe
c62a11fd95 Update flex_bg free blocks and free inodes counters when resizing.
This fixes a bug which prevented the newly created inodes after a
resize from being used on filesystems with flex_bg.

Signed-off-by: Frederic Bohe <frederic.bohe@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 10:20:24 -04:00
Aneesh Kumar K.V
6bc6e63fcd ext4: Add percpu dirty block accounting.
This patch adds dirty block accounting using percpu_counters.  Delayed
allocation block reservation is now done by updating dirty block
counter.  In a later patch we switch to non delalloc mode if the
filesystem free blocks is greater than 150% of total filesystem dirty
blocks

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao<cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-10-10 09:39:00 -04:00
Theodore Ts'o
af5bc92dde ext4: Fix whitespace checkpatch warnings/errors
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 22:25:24 -04:00
Theodore Ts'o
e5f8eab885 ext4: Fix long long checkpatch warnings
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 22:25:04 -04:00
Theodore Ts'o
4776004f54 ext4: Add printk priority levels to clean up checkpatch warnings
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-09-08 23:00:52 -04:00
Aneesh Kumar K.V
91246c0090 ext4: Initialize writeback_index to 0 when allocating a new inode
The write_cache_pages() function uses the mapping->writeback_index as
the starting index to write out when range_cyclic is set.  Properly
initialize writeback_index so that we start the writeout at index 0.

This was found when debugging the small file fragmentation on ext4.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-08-19 21:14:52 -04:00
Linus Torvalds
8f616cd524 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove write-only variables from ext4_ordered_write_end
  ext4: unexport jbd2_journal_update_superblock
  ext4: Cleanup whitespace and other miscellaneous style issues
  ext4: improve ext4_fill_flex_info() a bit
  ext4: Cleanup the block reservation code path
  ext4: don't assume extents can't cross block groups when truncating
  ext4: Fix lack of credits BUG() when deleting a badly fragmented inode
  ext4: Fix ext4_ext_journal_restart()
  ext4: fix ext4_da_write_begin error path
  jbd2: don't abort if flushing file data failed
  ext4: don't read inode block if the buffer has a write error
  ext4: Don't allow lg prealloc list to be grow large.
  ext4: Convert the usage of NR_CPUS to nr_cpu_ids.
  ext4: Improve error handling in mballoc
  ext4: lock block groups when initializing
  ext4: sync up block and inode bitmap reading functions
  ext4: Allow read/only mounts with corrupted block group checksums
  ext4: Fix data corruption when writing to prealloc area
2008-08-03 10:50:44 -07:00
Al Viro
77e69dac3c [PATCH] fix races and leaks in vfs_quota_on() users
* new helper: vfs_quota_on_path(); equivalent of vfs_quota_on() sans the
  pathname resolution.
* callers of vfs_quota_on() that do their own pathname resolution and
  checks based on it are switched to vfs_quota_on_path(); that way we
  avoid the races.
* reiserfs leaked dentry/vfsmount references on several failure exits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-08-01 11:25:25 -04:00
Theodore Ts'o
2b2d6d0197 ext4: Cleanup whitespace and other miscellaneous style issues
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26 16:15:44 -04:00
Alexey Dobriyan
51cc50685a SL*B: drop kmem cache argument from constructor
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres.  Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
	arch/powerpc/mm/init_64.c
	arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26 12:00:07 -07:00
Li Zefan
ec05e868ac ext4: improve ext4_fill_flex_info() a bit
- use kzalloc() instead of kmalloc() + memset()
- improve a printk info

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-24 12:49:59 -04:00
Eric Sandeen
b5f10eed81 ext4: lock block groups when initializing
I noticed when filling a 1T filesystem with 4 threads using the
fs_mark benchmark:

fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0

that I occasionally got checksum mismatch errors:

EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 6935

etc.  I'd reliably get 4-5 of them during the run.

It appears that the problem is likely a race to init the bg's
when the uninit_bg feature is enabled.

With the patch below, which adds sb_bgl_locking around initialization,
I was able to complete several runs with no errors or warnings.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-08-02 21:21:08 -04:00
Theodore Ts'o
8a266467b8 ext4: Allow read/only mounts with corrupted block group checksums
If the block group checksums are corrupted, still allow the mount to
succeed, so e2fsck can have a chance to try to fix things up.  Add
code in the remount r/w path to make sure the block group checksums
are valid before allowing the filesystem to be remounted read/write.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-26 14:34:21 -04:00
Eric Sandeen
e4079a11f5 ext4: do not set extents feature from the kernel
We've talked for a while about getting rid of any feature-
setting from the kernel; this gets rid of the code which would
set the INCOMPAT_EXTENTS flag on the first file write when mounted
as ext4[dev].

With this patch, if the extents feature is not already set on disk,
then mounting as ext4 will fall back to noextents with a warning,
and if -o extents is explicitly requested, the mount will fail,
also with warning.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
c07651b556 ext4: Don't allow nonextenst mount option for large filesystem
The block mapped inode format can address only blocks within 2**32. This
causes a number of issues, the biggest of which is that the block
allocator needs to be taught that certain inodes can not utilize block
numbers > 2**32.  So until this is fixed, it is simplest to fail
mounting of file systems with more than 2**32 blocks if the -o noextents
option is given.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
dd919b9822 ext4: Enable delalloc by default.
Enable delalloc by default to ensure it gets sufficient testing and
because it makes the filesystem much more efficient.  Add a nodealalloc
option to disable delayed allocation, and update ext4_show_options to
show delayed allocation off if it is disabled.

If the data=journal mount option is used, disable delayed allocation
since the delalloc code doesn't support data=journal yet.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-07-11 19:27:31 -04:00
Mingming Cao
d2a1763791 ext4: delayed allocation ENOSPC handling
This patch does block reservation for delayed
allocation, to avoid ENOSPC later at page flush time.

Blocks(data and metadata) are reserved at da_write_begin()
time, the freeblocks counter is updated by then, and the number of
reserved blocks is store in per inode counter.
        
At the writepage time, the unused reserved meta blocks are returned
back. At unlink/truncate time, reserved blocks are properly released.

Updated fix from  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix the oldallocator block reservation accounting with delalloc, added
lock to guard the counters and also fix the reservation for meta blocks.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2008-07-14 17:52:37 -04:00
Alex Tomas
64769240bd ext4: Add delayed allocation support in data=writeback mode
Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and
release the page from page cache if the delalloc write_begin failed, and
properly handle preallocated blocks.  Also added a fix to clear
buffer_delay in block_write_full_page() after allocating a delayed
buffer.

Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to update i_disksize properly and to add bmap support for delayed
allocation.

Updated with a fix from Valerie Clement <valerie.clement@bull.net> to
avoid filesystem corruption when the filesystem is mounted with the
delalloc option and blocksize < pagesize.

Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by:  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-07-11 19:27:31 -04:00
Jan Kara
678aaf4814 ext4: Use new framework for data=ordered mode in JBD2
This patch makes ext4 use inode-based implementation of data=ordered mode
in JBD2. It allows us to unify some data=ordered and data=writeback paths
(especially writepage since we don't have to start a transaction anymore)
and remove some buffer walking.

Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
to fix file system hang due to corrupt jinode values.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
47b4a50beb ext4: Set journal pointer to NULL when journal is released
Set sbi->s_journal to NULL after we call journal_destroy(). This
will be later needed because after journal_destroy() is called,
ext4_clear_inode() can still be called for some inodes (e.g. root
inode) and we'll need to detect there that journal doesn't exists
anymore.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V
7477827f66 ext4: Fix sparse warning
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Li Zefan
f9a8ac99fd ext4: remove redundant code in ext4_fill_super()
The previous sb_min_blocksize() has already set the block size.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jose R. Santos
772cb7c83b ext4: New inode allocation for FLEX_BG meta-data groups.
This patch mostly controls the way inode are allocated in order to
make ialloc aware of flex_bg block group grouping.  It achieves this
by bypassing the Orlov allocator when block group meta-data are packed
toghether through mke2fs.  Since the impact on the block allocator is
minimal, this patch should have little or no effect on other block
allocation algorithms. By controlling the inode allocation, it can
basically control where the initial search for new block begins and
thus indirectly manipulate the block allocator.

This allocator favors data and meta-data locality so the disk will
gradually be filled from block group zero upward.  This helps improve
performance by reducing seek time.  Since the group of inode tables
within one flex_bg are treated as one giant inode table, uninitialized
block groups would not need to partially initialize as many inode
table as with Orlov which would help fsck time as the filesystem usage
goes up.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Valerie Clement <valerie.clement@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Theodore Ts'o
7ad72ca60b ext4: Remove unused variable from ext4_show_options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-07-11 19:27:31 -04:00
Jan Kara
4d04e4fbf8 ext4: add missing unlock to an error path in ext4_quota_write()
When write in ext4_quota_write() fails, we have to properly release
i_mutex.  One error path has been missing the unlock...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:05 -07:00
Eric Sandeen
571640cad3 ext4: enable barriers by default
I can't think of any valid reason for ext4 to not use barriers when
they are available;  I believe this is necessary for filesystem
integrity in the face of a volatile write cache on storage.

An administrator who trusts that the cache is sufficiently battery-
backed (and power supplies are sufficiently redundant, etc...)
can always turn it back off again.

SuSE has carried such a patch for ext3 for quite some time now.

Also document the mount option while we're at it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 12:29:46 -04:00
Theodore Ts'o
cd0b6a39a1 ext4: Display the journal_async_commit mount option in /proc/mounts
Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-26 10:28:28 -04:00
Theodore Ts'o
624080eded jbd2: If a journal checksum error is detected, propagate the error to ext4
If a journal checksum error is detected, the ext4 filesystem will call
ext4_error(), and the mount will either continue, become a read-only
mount, or cause a kernel panic based on the superblock flags
indicating the user's preference of what to do in case of filesystem
corruption being detected.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-06-06 17:50:40 -04:00
Jan Kara
2c8be6b222 ext4: fix typos in messages and comments (journalled -> journaled)
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 21:27:55 -04:00
Jan Kara
0623543b33 ext4: fix synchronization of quota files in journal=data mode
In journal=data mode, it is not enough to do write_inode_now as done in
vfs_quota_on() to write all data to their final location (which is
needed for quota_read to work correctly).  Calling journal_flush() does
its job.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
cd59e7b978 ext4: Fix mount messages when quota disabled
When quota is disabled, we should not print 'journaled quota not
supported' when user tried to mount non-journaled quota. Also fix typo
in the message.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Jan Kara
dfc5d03f12 ext4: correct mount option parsing to detect when quota options can be changed
We should not allow user to change quota mount options when quota is
just suspended.  It would make mount options and internal quota state
inconsistent.  Also we should not allow user to change quota format when
quota is turned on.  On the other hand we can just silently ignore when
some option is set to the value it already has (mount does this on
remount).

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-05-13 19:11:51 -04:00
Josef Bacik
c19204b0ae ext4: don't use ext4_error in ext4_check_descriptors
Because ext4_check_descriptors is called at mount time you can't use ext4_error
as it calls ext4_commit_sb, which since the sb isn't all the way initialized
causes bad things to happen (ie a panic).  This patch changes the ext4_error's
to printk's to keep this problem from happening.  Thanks much,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:00:28 -04:00
Christoph Hellwig
3dcf54515a ext4: move headers out of include/linux
Move ext4 headers out of include/linux.  This is just the trivial move,
there's some more thing that could be done later. 

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 18:13:32 -04:00
Jan Kara
2887df139c ext4: Fix hang on umount with quotas when journal is aborted
Call dquot_drop() from ext4_dquot_drop() even if we fail to start a
transaction. Otherwise we never get to dropping references to quota structures
from the inode and umount will hang indefinitely.  Thanks to Payphone LIOU for
spotting the problem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
CC: Payphone LIOU <lioupayphone@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:02:07 -04:00
Harvey Harrison
46e665e9d2 ext4: replace remaining __FUNCTION__ occurrences
__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-17 10:38:59 -04:00
Marcin Slusarz
216c34b2b8 ext4: convert byte order of constant instead of variable
Convert byte order of constant instead of variable which can be done at
compile time (vs run time).

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-17 10:38:59 -04:00
Marcin Slusarz
e8546d0615 ext4: le*_add_cpu conversion
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
					expression_in_cpu_byteorder);
with:
	leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
generated with semantic patch

Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Cc: sct@redhat.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: adilger@clusterfs.com
Cc: Mingming Cao <cmm@us.ibm.com>
2008-04-17 10:38:59 -04:00
Josef Bacik
f3f12faa74 ext4: fix mount option parsing
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:05:28 -04:00
Josef Bacik
97bd42b9c8 ext4: check return of ext4_orphan_get properly
This patch fix a panic while running fsfuzzer. 
We are improperly checking the return of ext4_orphan_get.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-04-29 22:04:56 -04:00
Jan Kara
6f28e08794 quota: ext4: make ext4 handle quotaon on remount
Update ext4 to handle quotaon on remount RW.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:33 -07:00
Jan Blunck
1d957f9bf8 Introduce path_put()
* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck
4ac9137858 Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.

Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
  <dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
  struct path in every place where the stack can be traversed
- it reduces the overall code size:

without patch series:
   text    data     bss     dec     hex filename
5321639  858418  715768 6895825  6938d1 vmlinux

with patch series:
   text    data     bss     dec     hex filename
5320026  858418  715768 6894212  693284 vmlinux

This patch:

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Theodore Tso
469108ff3d ext4: Add new "development flag" to the ext4 filesystem
This flag is simply a generic "this is a crash/burn test filesystem"
marker.  If it is set, then filesystem code which is "in development"
will be allowed to mount the filesystem.  Filesystem code which is not
considered ready for prime-time will check for this flag, and if it is
not set, it will refuse to touch the filesystem.

As we start rolling ext4 out to distro's like Fedora, et. al, this makes
it less likely that a user might accidentally start using ext4 on a
production filesystem; a bad thing, since that will essentially make it
be unfsckable until e2fsprogs catches up.

Signed-off-by: Theodore Tso <tytso@MIT.EDU>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-02-10 01:11:44 -05:00
David Howells
1d1fe1ee02 iget: stop EXT4 from using iget() and read_inode()
Stop the EXT4 filesystem from using iget() and read_inode().  Replace
ext4_read_inode() with ext4_iget(), and call that instead of iget().
ext4_iget() then uses iget_locked() directly and returns a proper error code
instead of an inode in the event of an error.

ext4_fill_super() returns any error incurred when getting the root inode
instead of EINVAL.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:27 -08:00
Akinobu Mita
197cd65acc ext[234]: use ext[234]_get_group_desc()
Use ext[234]_get_group_desc() to get group descriptor from group number.

Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:21 -08:00
Aneesh Kumar K.V
ce40733ce9 ext4: Check for return value from sb_set_blocksize
sb_set_blocksize validates whether the specfied block size can be used by
the file system. Make sure we fail mounting the file system if the
blocksize specfied cannot be used.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Miklos Szeredi
cb45bbe44b ext4: Add stripe= option to /proc/mounts
Add stripe= option to /proc/mounts for ext4 filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
3dbd0ede4d ext4: Enable the multiblock allocator by default
Enable the multiblock allocator by default.

Fix ext4_show_options() so if it is not enabled, the nomballoc option
included in /proc/mounts.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-01-28 23:58:26 -05:00
Alex Tomas
c9de560ded ext4: Add multi block allocator for ext4
Signed-off-by: Alex Tomas <alex@clusterfs.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2008-01-29 00:19:52 -05:00
Aneesh Kumar K.V
aa22df2cc8 ext4: Fix ext4_show_options to show the correct mount options.
We need to look at the default value and make sure
the mount options are not set via default value
before showing them via ext4_show_options

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Jean Noel Cordenner
25ec56b518 ext4: Add inode version support in ext4
This patch adds 64-bit inode version support to ext4. The lower 32 bits
are stored in the osd1.linux1.l_i_version field while the high 32 bits
are stored in the i_version_hi field newly created in the ext4_inode.
This field is incremented in case the ext4_inode is large enough. A
i_version mount option has been added to enable the feature.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Jean Noel Cordenner <jean-noel.cordenner@bull.net>
2008-01-28 23:58:27 -05:00
Girish Shilamkar
818d276ceb ext4: Add the journal checksum feature
The journal checksum feature adds two new flags i.e
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT and JBD2_FEATURE_COMPAT_CHECKSUM.

JBD2_FEATURE_CHECKSUM flag indicates that the commit block contains the
checksum for the blocks described by the descriptor blocks.
Due to checksums, writing of the commit record no longer needs to be
synchronous. Now commit record can be sent to disk without waiting for
descriptor blocks to be written to disk. This behavior is controlled
using JBD2_FEATURE_ASYNC_COMMIT flag. Older kernels/e2fsck should not be
able to recover the journal with _ASYNC_COMMIT hence it is made
incompat.
The commit header has been extended to hold the checksum along with the
type of the checksum.

For recovery in pass scan checksums are verified to ensure the sanity
and completeness(in case of _ASYNC_COMMIT) of every transaction.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Girish Shilamkar <girish@clusterfs.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
0e855ac8b1 ext4: Convert truncate_mutex to read write semaphore.
We are currently taking the truncate_mutex for every read. This would have
performance impact on large CPU configuration. Convert the lock to read write
semaphore and take read lock when we are trying to read the file.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:26 -05:00
Aneesh Kumar K.V
bb4f397a1a ext4: Change the default behaviour on error
ext4 file system was by default ignoring errors and continuing. This
is not a good default as continuing on error could lead to file system
corruption. Change the default to mark the file system
readonly. Debian and ubuntu already does this as the default in their
fstab.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:26 -05:00
Eric Sandeen
e7c9559300 ext4: fix oops on corrupted ext4 mount
When mounting an ext4 filesystem with corrupted s_first_data_block, things
can go very wrong and oops.

Because blocks_count in ext4_fill_super is a u64, and we must use do_div,
the calculation of db_count is done differently than on ext4.  If
first_data_block is corrupted such that it is larger than ext4_blocks_count,
for example, then the intermediate blocks_count value may go negative,
but sign-extend to a very large value:

        blocks_count = (ext4_blocks_count(es) -
                        le32_to_cpu(es->s_first_data_block) +
                        EXT4_BLOCKS_PER_GROUP(sb) - 1);

This is then assigned to s_groups_count which is an unsigned long:

        sbi->s_groups_count = blocks_count;

This may result in a value of 0xFFFFFFFF which is then used to compute
db_count:

        db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
                   EXT4_DESC_PER_BLOCK(sb);

and in this case db_count will wind up as 0 because the addition overflows
32 bits.  This in turn causes the kmalloc for group_desc to be of 0 size:

        sbi->s_group_desc = kmalloc(db_count * sizeof (struct buffer_head *),
                                    GFP_KERNEL);

and eventually in ext4_check_descriptors, dereferencing
sbi->s_group_desc[desc_block] will result in a NULL pointer dereference.

The simplest test seems to be to sanity check s_first_data_block,
EXT4_BLOCKS_PER_GROUP, and ext4_blocks_count values to be sure
their combination won't result in a bad intermediate value for
blocks_count.  We could just check for db_count == 0, but
catching it at the root cause seems like it provides more info.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Adrian Bunk
07620f69ef ext4/super.c: fix #ifdef's (CONFIG_EXT4_* -> CONFIG_EXT4DEV_*)
Based on a report by Robert P. J. Day.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-01-28 23:58:27 -05:00
Eric Sandeen
e2b4657453 ext4: store maxbytes for bitmapped files and return EFBIG as appropriate
Calculate & store the max offset for bitmapped files, and
catch too-large seeks, truncates, and writes in ext4, shortening
or rejecting as appropriate.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Eric Sandeen
cd2291a463 ext4: different maxbytes functions for bitmap & extent files
use 2 different maxbytes functions for bitmapped & extent-based
files.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
8180a5627d ext4: Support large files
This patch converts ext4_inode i_blocks to represent total
blocks occupied by the inode in file system block size.
Earlier the variable used to represent this in 512 byte
block size. This actually limited the total size of the file.

The feature is enabled transparently when we write an inode
whose i_blocks cannot be represnted as 512 byte units in a
48 bit variable.

inode flag  EXT4_HUGE_FILE_FL

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
0fc1b45147 ext4: Add support for 48 bit inode i_blocks.
Use the __le16 l_i_reserved1 field of the linux2 struct of ext4_inode
to represet the higher 16 bits for i_blocks. With this change max_file
size becomes (2**48 -1 )* 512 bytes.

We add a RO_COMPAT feature to the super block to indicate that inode
have i_blocks represented as a split 48 bits. Super block with this
feature set cannot be mounted read write on a kernel with CONFIG_LSF
disabled.

Super block flag EXT4_FEATURE_RO_COMPAT_HUGE_FILE

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:26 -05:00
Aneesh Kumar K.V
1d03ec984c ext4: Fix sparse warnings.
Fix sparse warnings related to static functions
and local variables.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
99e6f829a8 ext4: Introduce ext4_update_*_feature
Introduce ext4_update_*_feature and use them instead
of opencoding.


Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2008-01-28 23:58:27 -05:00
Avantika Mathur
fd2d42912f ext4: add ext4_group_t, and change all group variables to this type.
In many places variables for block group are of type int, which limits the
maximum number of block groups to 2^31.  Each block group can have up to
2^15 blocks, with a 4K block size,  and the max filesystem size is limited to
2^31 * (2^15 * 2^12) = 2^58  -- or 256 PB

This patch introduces a new type ext4_group_t, of type unsigned long, to
represent block group numbers in ext4.
All occurrences of block group variables are converted to type ext4_group_t.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
2008-01-28 23:58:27 -05:00
Aneesh Kumar K.V
725d26d3f0 ext4: Introduce ext4_lblk_t
This patch adds a new data type ext4_lblk_t to represent
the logical file blocks.

This is the preparatory patch to support large files in ext4
The follow up patch with convert the ext4_inode i_blocks to
represent the number of blocks in file system block size. This
changes makes it possible to have a block number 2**32 -1 which
will result in overflow if the block number is represented by
signed long. This patch convert all the block number to type
ext4_lblk_t which is typedef to __u32

Also remove dead code ext4_ext_walk_space

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2008-01-28 23:58:27 -05:00
Takashi Sato
afc7cbca5b ext4: Support large blocksize up to PAGESIZE
This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
  [1/2]  ext4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/2]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
2008-01-28 23:58:27 -05:00
Andries E. Brouwer
b47b6f38e5 ext3, ext4: avoid divide by zero
As it turns out, the kernel divides by EXT3_INODES_PER_GROUP(s) when
mounting an ext3 filesystem.  If that number is zero, a crash follows.
Below a patch.

This crash was reported by Joeri de Ruiter, Carst Tankink and Pim Vullers.

Cc: <linux-ext4@vger.kernel.org>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:16 -08:00
Christoph Hellwig
3965516440 exportfs: make struct export_operations const
Now that nfsd has stopped writing to the find_exported_dentry member we an
mark the export_operations const

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: David Chinner <dgc@sgi.com>
Cc: Timothy Shimmin <tes@sgi.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Chris Mason <mason@suse.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: "Vladimir V. Saveliev" <vs@namesys.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22 08:13:21 -07:00
Christoph Hellwig
1b961ac05a ext4: new export ops
Trivial switch over to the new generic helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22 08:13:20 -07:00
Aneesh Kumar K.V
308ba3ece7 ext4: Convert s_r_blocks_count and s_free_blocks_count
Convert s_r_blocks_count and s_free_blocks_count to
s_r_blocks_count_lo and s_free_blocks_count_lo

This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
6bc9feff14 ext4: Convert s_blocks_count to s_blocks_count_lo
Convert s_blocks_count to s_blocks_count_lo
This helps in finding BUGs due to direct partial access of
these split 64 bit values

Also fix direct partial access in ext4 code

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
5272f83727 ext4: Convert bg_inode_bitmap and bg_inode_table
Convert bg_inode_bitmap and bg_inode_table to bg_inode_bitmap_lo
and bg_inode_table_lo.  This helps in finding BUGs due to
direct partial access of these split 64 bit values

Also fix one direct partial access

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:02 -04:00
Aneesh Kumar K.V
3a14589cce ext4: Convert bg_block_bitmap to bg_block_bitmap_lo
Convert bg_block_bitmap to bg_block_bitmap_lo
This helps in catching some BUGS due to direct
partial access of these split fields.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:01 -04:00
Jose R. Santos
ce42158179 ext4: FLEX_BG Kernel support v2.
This feature relaxes check restrictions on where each block groups meta
data is located within the storage media.  This allows for the allocation
of bitmaps or inode tables outside the block group boundaries in cases
where bad blocks forces us to look for new blocks which the owning block
group can not satisfy.  This will also allow for new meta-data allocation
schemes to improve performance and scalability.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-10-17 18:50:01 -04:00
Andreas Dilger
717d50e497 Ext4: Uninitialized Block Groups
In pass1 of e2fsck, every inode table in the fileystem is scanned and checked,
regardless of whether it is in use.  This is this the most time consuming part
of the filesystem check.  The unintialized block group feature can greatly
reduce e2fsck time by eliminating checking of uninitialized inodes.

With this feature, there is a a high water mark of used inodes for each block
group.  Block and inode bitmaps can be uninitialized on disk via a flag in the
group descriptor to avoid reading or scanning them at e2fsck time.  A checksum
of each group descriptor is used to ensure that corruption in the group
descriptor's bit flags does not cause incorrect operation.

The feature is enabled through a mkfs option

	mke2fs /dev/ -O uninit_groups

A patch adding support for uninitialized block groups to e2fsprogs tools has
been posted to the linux-ext4 mailing list.

The patches have been stress tested with fsstress and fsx.  In performance
tests testing e2fsck time, we have seen that e2fsck time on ext3 grows
linearly with the total number of inodes in the filesytem.  In ext4 with the
uninitialized block groups feature, the e2fsck time is constant, based
solely on the number of used inodes rather than the total inode count.
Since typical ext4 filesystems only use 1-10% of their inodes, this feature can
greatly reduce e2fsck time for users.  With performance improvement of 2-20
times, depending on how full the filesystem is.

The attached graph shows the major improvements in e2fsck times in filesystems
with a large total inode count, but few inodes in use.

In each group descriptor if we have

EXT4_BG_INODE_UNINIT set in bg_flags:
        Inode table is not initialized/used in this group. So we can skip
        the consistency check during fsck.
EXT4_BG_BLOCK_UNINIT set in bg_flags:
        No block in the group is used. So we can skip the block bitmap
        verification for this group.

We also add two new fields to group descriptor as a part of
uninitialized group patch.

        __le16  bg_itable_unused;       /* Unused inodes count */
        __le16  bg_checksum;            /* crc16(sb_uuid+group+desc) */

bg_itable_unused:

If we have EXT4_BG_INODE_UNINIT not set in bg_flags
then bg_itable_unused will give the offset within
the inode table till the inodes are used. This can be
used by fsck to skip list of inodes that are marked unused.

bg_checksum:
Now that we depend on bg_flags and bg_itable_unused to determine
the block and inode usage, we need to make sure group descriptor
is not corrupt. We add checksum to group descriptor to
detect corruption. If the descriptor is found to be corrupt, we
mark all the blocks and inodes in the group used.

Signed-off-by: Avantika Mathur <mathur@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2007-10-17 18:50:00 -04:00
Coly Li
f077d0d7ea ext4: Remove (partial, never completed) fragment support
Fragment support in ext2/3/4 was never implemented, and it probably will
never be implemented.   So remove it from ext4.

Signed-off-by: Coly Li <coyli@suse.de>
Acked-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:59 -04:00
Mingming Cao
cd02ff0b14 jbd2: JBD_XXX to JBD2_XXX naming cleanup
change JBD_XXX macros to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-10-17 18:49:58 -04:00
vignesh babu
d8ea6cf899 ext2/4: use is_power_of_2()
Replace n & (n - 1) with is_power_of_2(n)

Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Miklos Szeredi
d9c9bef134 ext4: show all mount options
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Fengguang Wu
e57aa839ce convert ill defined log2() to ilog2()
It's *wrong* to have
			#define log2(n) ffz(~(n))
It should be *reversed*:
			#define log2(n) flz(~(n))
or
			#define log2(n) fls(n)
or just use
			ilog2(n) defined in linux/log2.h.

This patch follows the last solution, recommended by Andrew Morton.

Cc: <linux-ext4@vger.kernel.org>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Chris Ahna <christopher.j.ahna@intel.com>
Cc: David Mosberger-Tang <davidm@hpl.hp.com>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Christoph Lameter
4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra
833f4077bf lib: percpu_counter_init error handling
alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Peter Zijlstra
52d9f3b409 lib: percpu_counter_sum_positive
s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:44 -07:00
Jan Kara
9c3013e9b9 quota: fix infinite loop
If we fail to start a transaction when releasing dquot, we have to call
dquot_release() anyway to mark dquot structure as inactive.  Otherwise we
end in an infinite loop inside dqput().

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: xb <xavier.bru@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:19 -07:00
Eric Sandeen
780dcdb211 fix inode_table test in ext234_check_descriptors
ext[234]_check_descriptors sanity checks block group descriptor geometry at
mount time, testing whether the block bitmap, inode bitmap, and inode table
reside wholly within the blockgroup.  However, the inode table test is off
by one so that if the last block in the inode table resides on the last
block of the block group, the test incorrectly fails.  This is because it
tests the last block as (start + length) rather than (start + length - 1).

This can be seen by trying to mount a filesystem made such as:

 mkfs.ext2 -F -b 1024 -m 0 -g 256 -N 3744 fsfile 1024

which yields:

 EXT2-fs error (device loop0): ext2_check_descriptors: Inode table for group 0 not in group (block 101)!
 EXT2-fs: group descriptors corrupted!

There is a similar bug in e2fsprogs, patch already sent for that.

(I wonder if inside(), outside(), and/or in_range() should someday be
used in this and other tests throughout the ext filesystems...)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26 11:35:17 -07:00
Paul Mundt
20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Vignesh Babu
1330593eb2 ext4: Use is_power_of_2()
Replace (n & (n-1)) in the context of power of 2 checks with
is_power_of_2()

Signed-off-by: Vignesh Babu <vignesh.babu@wipro.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:11:02 -04:00
Kalpak Shah
ef7f38359e ext4: Add nanosecond timestamps
This patch adds nanosecond timestamps for ext4. This involves adding
*time_extra fields to the ext4_inode to extend the timestamps to
64-bits.  Creation time is also added by this patch.

These extended fields will fit into an inode if the filesystem was
formatted with large inodes (-I 256 or larger) and there are currently
no EAs consuming all of the available space. For new inodes we always
reserve enough space for the kernel's known extended fields, but for
inodes created with an old kernel this might not have been the case. So
this patch also adds the EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature
flag(ro-compat so that older kernels can't create inodes with a smaller
extra_isize). which indicates if the fields fitting inside
s_min_extra_isize are available or not.  If the expansion of inodes if
unsuccessful then this feature will be disabled.  This feature is only
enabled if requested by the sysadmin.

None of the extended inode fields is critical for correct filesystem
operation.

Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Kalpak Shah <kalpak@clusterfs.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:15:20 -04:00
Jose R. Santos
eb40a09c67 ext4: Set the journal JBD2_FEATURE_INCOMPAT_64BIT on large devices
Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos <jrs@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 08:37:25 -04:00
Mingming Cao
1e2462f93e ext4: Enable extents by default
Turn on extents feature by default in ext4 filesystem, to get wider
testing of extents feature in ext4dev.  This can be disabled using 
-o noextents.  

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-07-18 09:00:55 -04:00
Christoph Hellwig
a569425512 knfsd: exportfs: add exportfs.h header
currently the export_operation structure and helpers related to it are in
fs.h.  fs.h is already far too large and there are very few places needing the
export bits, so split them off into a separate header.

[akpm@linux-foundation.org: fix cifs build]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:06 -07:00
Badari Pulavarty
5e70030d4c ext4: statfs speed up
This is a patch that speeds up statfs.  It is very simple - the "overhead"
calculation, which takes a huge amount of time for large filesystems, never
changes unless the size of the filesystem itself changes.  That means we can
store it in memory and only recalculate if the filesystem has been resized
(almost never).

It also fixes a minor problem that we never update the on-disk superblock free
blocks/inodes counts until the filesystem is unmounted.  While not fatal, we
may as well update that on disk when we have the information, and it makes
things like debugfs and dumpe2fs report a bit more accurate info.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andreas Dilger <adilger@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:52 -07:00
Borislav Petkov
6c675bd43c ext4: fix error handling in ext4_create_journal
Fix error handling in ext4_create_journal according to kernel conventions.

Signed-off-by: Borislav Petkov <bbpetkov@yahoo.de>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:51 -07:00
Jan Kara
32c3773011 ext4: fix deadlock in ext4_remount() and orphan list handling
ext4_orphan_add() and ext4_orphan_del() functions lock sb->s_lock with a
transaction started with ext4_mark_recovery_complete() waits for a transaction
holding sb->s_lock, thus leading to a possible deadlock.  At the moment we
call ext4_mark_recovery_complete() from ext4_remount() we have done all the
work needed for remounting and thus we are safe to drop sb->s_lock before we
wait for transactions to commit.  Note that at this moment we are still
guarded by s_umount lock against other remounts/umounts.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Sandeen <sandeen@sandeen.net>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:48 -07:00
Vasily Averin
9f7dd93de0 ext3/ext4: orphan list check on destroy_inode
Customers claims to ext3-related errors, investigation showed that ext3
orphan list has been corrupted and have the reference to non-ext3 inode.
The following debug helps to understand the reasons of this issue.

[akpm@linux-foundation.org: update for print_hex_dump() changes]
Signed-off-by: Vasily Averin <vvs@sw.ru>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:46 -07:00
Dave Kleikamp
8c55e20411 EXT4: Fix whitespace
Replace a lot of spaces with tabs

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2007-05-31 16:20:14 -04:00
Christoph Lameter
a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter
50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Peter Zijlstra
f98393a64c mm: remove destroy_dirty_buffers from invalidate_bdev()
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).

find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
	quilt add $file;
	sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
done

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Josef 'Jeff' Sipek
ee9b6d61a2 [PATCH] Mark struct super_operations const
This patch is inspired by Arjan's "Patch series to mark struct
file_operations and struct inode_operations const".

Compile tested with gcc & sparse.

Signed-off-by: Josef 'Jeff' Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-12 09:48:47 -08:00
Hugh Dickins
2e7842b887 [PATCH] fix umask when noACL kernel meets extN tuned for ACLs
Fix insecure default behaviour reported by Tigran Aivazian: if an ext2 or
ext3 or ext4 filesystem is tuned to mount with "acl", but mounted by a
kernel built without ACL support, then umask was ignored when creating
inodes - though root or user has umask 022, touch creates files as 0666,
and mkdir creates directories as 0777.

This appears to have worked right until 2.6.11, when a fix to the default
mode on symlinks (always 0777) assumed VFS applies umask: which it does,
unless the mount is marked for ACLs; but ext[234] set MS_POSIXACL in
s_flags according to s_mount_opt set according to def_mount_opts.

We could revert to the 2.6.10 ext[234]_init_acl (adding an S_ISLNK test);
but other filesystems only set MS_POSIXACL when ACLs are configured.  We
could fix this at another level; but it seems most robust to avoid setting
the s_mount_opt flag in the first place (at the expense of more ifdefs).

Likewise don't set the XATTR_USER flag when built without XATTR support.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: <linux-ext4@vger.kernel.org>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen
ead6596b9e [PATCH] ext4: refuse ro to rw remount of fs with orphan inodes
In the rare case where we have skipped orphan inode processing due to a
readonly block device, and the block device subsequently changes back to
read-write, disallow a remount,rw transition of the filesystem when we have an
unprocessed orphan inodes as this would corrupt the list.

Ideally we should process the orphan inode list during the remount, but that's
trickier, and this plugs the hole for now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: "Stephen C. Tweedie" <sct@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11 10:51:34 -08:00
Eric Sandeen
a8f48a9561 [PATCH] ext3/4: don't do orphan processing on readonly devices
If you do something like:

  # touch foo
  # tail -f foo &
  # rm foo
  # <take snapshot>
  # <mount snapshot>

you'll panic, because ext3/4 tries to do orphan list processing on the
readonly snapshot device, and:

  kernel: journal commit I/O error
  kernel: Assertion failure in journal_flush_Rsmp_e2f189ce() at journal.c:1356: "!journal->j_checkpoint_transactions"
  kernel: Kernel panic: Fatal exception

for a truly readonly underlying device, it's reasonable and necessary
to just skip orphan list processing.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:44 -08:00
Pekka Enberg
960cc398a7 [PATCH] ext4: fsid for statvfs
Update ext4_statfs to return an FSID that is a 64 bit XOR of the 128 bit
filesystem UUID as suggested by Andreas Dilger.  See the following Bugzilla
entry for details:

  http://bugzilla.kernel.org/show_bug.cgi?id=136

Cc: Andreas Dilger <adilger@clusterfs.com>
Cc: Stephen Tweedie <sct@redhat.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:31 -08:00
Christoph Lameter
e18b890bb0 [PATCH] slab: remove kmem_cache_t
Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

	#!/bin/sh
	#
	# Replace one string by another in all the kernel sources.
	#

	set -e

	for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
		quilt add $file
		sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
		mv /tmp/$$ $file
		quilt refresh
	done

The script was run like this

	sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:25 -08:00
Christoph Lameter
e6b4f8da3a [PATCH] slab: remove SLAB_NOFS
SLAB_NOFS is an alias of GFP_NOFS.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07 08:39:23 -08:00