Commit Graph

647 Commits

Author SHA1 Message Date
Phillip Potter
a8867f4e38 ext4: fix memory leak in ext4_mb_init_backend on error path.
Fix a memory leak discovered by syzbot when a file system is corrupted
with an illegally large s_log_groups_per_flex.

Reported-by: syzbot+aa12d6106ea4ca1b6aae@syzkaller.appspotmail.com
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20210412073837.1686-1-phil@philpotter.co.uk
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-05-20 23:29:32 -04:00
Jack Qiu
666245d9a4 ext4: fix trailing whitespace
Made suggested modifications from checkpatch in reference to ERROR:
 trailing whitespace

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20210409042035.15516-1-jack.qiu@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 23:24:52 -04:00
Harshad Shirwadkar
f68f406385 ext4: add proc files to monitor new structures
This patch adds a new file "mb_structs_summary" which allows us to see
the summary of the new allocator structures added in this
series. Here's the sample output of file:

optimize_scan: 1
max_free_order_lists:
        list_order_0_groups: 0
        list_order_1_groups: 0
        list_order_2_groups: 0
        list_order_3_groups: 0
        list_order_4_groups: 0
        list_order_5_groups: 0
        list_order_6_groups: 0
        list_order_7_groups: 0
        list_order_8_groups: 0
        list_order_9_groups: 0
        list_order_10_groups: 0
        list_order_11_groups: 0
        list_order_12_groups: 0
        list_order_13_groups: 40
fragment_size_tree:
        tree_min: 16384
        tree_max: 32768
        tree_nodes: 40

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
196e402adf ext4: improve cr 0 / cr 1 group scanning
Instead of traversing through groups linearly, scan groups in specific
orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
largest free order >= the order of the request. So, with this patch,
we maintain lists for each possible order and insert each group into a
list based on the largest free order in its buddy bitmap. During cr 0
allocation, we traverse these lists in the increasing order of largest
free orders. This allows us to find a group with the best available cr
0 match in constant time. If nothing can be found, we fallback to cr 1
immediately.

At CR1, the story is slightly different. We want to traverse in the
order of increasing average fragment size. For CR1, we maintain a rb
tree of groupinfos which is sorted by average fragment size. Instead
of traversing linearly, at CR1, we traverse in the order of increasing
average fragment size, starting at the most optimal group. This brings
down cr 1 search complexity to log(num groups).

For cr >= 2, we just perform the linear search as before. Also, in
case of lock contention, we intermittently fallback to linear search
even in CR 0 and CR 1 cases. This allows us to proceed during the
allocation path even in case of high contention.

There is an opportunity to do optimization at CR2 too. That's because
at CR2 we only consider groups where bb_free counter (number of free
blocks) is greater than the request extent size. That's left as future
work.

All the changes introduced in this patch are protected under a new
mount option "mb_optimize_scan".

With this patchset, following experiment was performed:

Created a highly fragmented disk of size 65TB. The disk had no
contiguous 2M regions. Following command was run consecutively for 3
times:

time dd if=/dev/urandom of=file bs=2M count=10

Here are the results with and without cr 0/1 optimizations introduced
in this patch:

|---------+------------------------------+---------------------------|
|         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
|---------+------------------------------+---------------------------|
| 1st run | 5m1.871s                     | 2m47.642s                 |
| 2nd run | 2m28.390s                    | 0m0.611s                  |
| 3rd run | 2m26.530s                    | 0m1.255s                  |
|---------+------------------------------+---------------------------|

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
4b68f6df10 ext4: add MB_NUM_ORDERS macro
A few arrays in mballoc.c use the total number of valid orders as
their size. Currently, this value is set as "sb->s_blocksize_bits +
2". This makes code harder to read. So, instead add a new macro
MB_NUM_ORDERS(sb) to make the code more readable.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
a6c75eaf11 ext4: add mballoc stats proc file
Add new stats for measuring the performance of mballoc. This patch is
forked from Artem Blagodarenko's work that can be found here:

https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch

This patch reorganizes the stats by cr level. This is how the output
looks like:

mballoc:
	reqs: 0
	success: 0
	groups_scanned: 0
	cr0_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr1_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
		bad_suggestions: 0
	cr2_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	cr3_stats:
		hits: 0
		groups_considered: 0
		useless_loops: 0
	extents_scanned: 0
		goal_hits: 0
		2^n_hits: 0
		breaks: 0
		lost: 0
	buddies_generated: 0/40
	buddies_time_used: 0
	preallocated: 0
	discarded: 0

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:59 -04:00
Harshad Shirwadkar
67d2518604 ext4: drop s_mb_bal_lock and convert protected fields to atomic
s_mb_buddies_generated gets used later in this patch series to
determine if the cr 0 and cr 1 optimziations should be performed or
not. Currently, s_mb_buddies_generated is protected under a
spin_lock. In the allocation path, it is better if we don't depend on
the lock and instead read the value atomically. In order to do that,
we drop s_bal_lock altogether and we convert the only two protected
fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
type.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-04-09 11:34:58 -04:00
Sabyrzhan Tasbolatov
f91436d55a fs/ext4: fix integer overflow in s_log_groups_per_flex
syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
where sbi->s_mb_prefetch is unsigned integer type.

32 is the maximum allowed power of s_log_groups_per_flex. Following if
check will also trigger UBSAN shift-out-of-bound:

if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {

So I'm checking it against the raw number, perhaps there is another way
to calculate UINT_MAX max power. Also use min_t as to make sure it's
uint type.

[1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
shift exponent 60 is too large for 32-bit type 'int'
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x137/0x1be lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
 ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983

Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-06 11:56:11 -05:00
Chunguang Xu
82ef1370b0 ext4: avoid s_mb_prefetch to be zero in individual scenarios
Commit cfd7323772 ("ext4: add prefetching for block allocation
bitmaps") introduced block bitmap prefetch, and expects to read block
bitmaps of flex_bg through an IO.  However, it seems to ignore the
value range of s_log_groups_per_flex.  In the scenario where the value
of s_log_groups_per_flex is greater than 27, s_mb_prefetch or
s_mb_prefetch_limit will overflow, cause a divide zero exception.

In addition, the logic of calculating nr is also flawed, because the
size of flexbg is fixed during a single mount, but s_mb_prefetch can
be modified, which causes nr to fail to meet the value condition of
[1, flexbg_size].

To solve this problem, we need to set the upper limit of
s_mb_prefetch.  Since we expect to load block bitmaps of a flex_bg
through an IO, we can consider determining a reasonable upper limit
among the IO limit parameters.  After consideration, we chose
BLK_MAX_SEGMENT_SIZE.  This is a good choice to solve divide zero
problem and avoiding performance degradation.

[ Some minor code simplifications to make the changes easy to follow -- TYT ]

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Samuel Liao <samuelliao@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22 13:08:45 -05:00
Chunguang Xu
cca4155372 ext4: fix a memory leak of ext4_free_data
When freeing metadata, we will create an ext4_free_data and
insert it into the pending free list.  After the current
transaction is committed, the object will be freed.

ext4_mb_free_metadata() will check whether the area to be freed
overlaps with the pending free list. If true, return directly. At this
time, ext4_free_data is leaked.  Fortunately, the probability of this
problem is small, since it only occurs if the file system is corrupted
such that a block is claimed by more one inode and those inodes are
deleted within a single jbd2 transaction.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1604764698-4269-8-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-12-17 13:30:09 -05:00
Chunguang Xu
ce3cca3374 ext4: simplify the code of mb_find_order_for_block
The code of mb_find_order_for_block is a bit obscure, but we can
simplify it with mb_find_buddy(), make the code more concise.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-3-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:42:41 -05:00
Chunguang Xu
6bd97bf273 ext4: remove redundant mb_regenerate_buddy()
After this patch (163a203), if an abnormal bitmap is detected, we
will mark the group as corrupt, and we will not use this group in
the future. Therefore, it should be meaningless to regenerate the
buddy bitmap of this group, It might be better to delete it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1604764698-4269-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03 09:42:39 -05:00
Harshad Shirwadkar
9b5f6c9b83 ext4: make s_mount_flags modifications atomic
Fast commit file system states are recorded in
sbi->s_mount_flags. Fast commit expects these bit manipulations to be
atomic. This patch adds helpers to make those modifications atomic.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06 23:01:05 -05:00
Dan Carpenter
e121bd48b9 ext4: silence an uninitialized variable warning
Smatch complains that "i" can be uninitialized if we don't enter the
loop.  I don't know if it's possible but we may as well silence this
warning.

[ Initialize i to sb->s_blocksize instead of 0.  The only way the for
  loop could be skipped entirely is the in-memory data structures, in
  particular the bh->b_data for the on-disk superblock has gotten
  corrupted enough that calculated value of group is >= to
  ext4_get_groups_count(sb).  In that case, we want to exit
  immediately without allocating a block.  -- TYT ]

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20201030114620.GB3251003@mwanda
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-11-06 22:52:36 -05:00
Harshad Shirwadkar
8016e29f43 ext4: fast commit recovery path
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21 23:22:38 -04:00
Chunguang Xu
addd752cff ext4: make mb_check_counter per group
Make bb_check_counter per group, so each group has the same chance
to be checked, which can expose errors more easily.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-2-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Chunguang Xu
9d1f9b2770 ext4: delete invalid comments near mb_buddy_adjust_border
The comment near mb_buddy_adjust_border seems meaningless, just
clear it.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/1601292995-32205-1-git-send-email-brookxu@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:37:13 -04:00
Randy Dunlap
b483bb7719 ext4: delete duplicated words + other fixes
Delete repeated words in fs/ext4/.
{the, this, of, we, after}

Also change spelling of "xttr" in inline.c to "xattr" in 2 places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:13 -04:00
Jan Kara
5b3dc19dda ext4: discard preallocations before releasing group lock
ext4_mb_discard_group_preallocations() can be releasing group lock with
preallocations accumulated on its local list. Thus although
discard_pa_seq was incremented and concurrent allocating processes will
be retrying allocations, it can happen that premature ENOSPC error is
returned because blocks used for preallocations are not available for
reuse yet. Make sure we always free locally accumulated preallocations
before releasing group lock.

Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200924150959.4335-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
Ye Bin
70022da804 ext4: fix dead loop in ext4_mb_new_blocks
As we test disk offline/online with running fsstress, we find fsstress
process is keeping running state.
kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114
....
kworker/u32:3-262   [004] ...1   140.787471: ext4_mb_discard_preallocations: dev 8,32 needed 114

ext4_mb_new_blocks
repeat:
        ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)
                freed = ext4_mb_discard_preallocations
                        ext4_mb_discard_group_preallocations
                                this_cpu_inc(discard_pa_seq);
                ---> freed == 0
                seq_retry = ext4_get_discard_pa_seq_sum
                        for_each_possible_cpu(__cpu)
                                __seq += per_cpu(discard_pa_seq, __cpu);
                if (seq_retry != *seq) {
                        *seq = seq_retry;
                        ret = true;
                }

As we see seq_retry is sum of discard_pa_seq every cpu, if
ext4_mb_discard_group_preallocations return zero discard_pa_seq in this
cpu maybe increase one, so condition "seq_retry != *seq" have always
been met.
Ritesh Harjani suggest to in ext4_mb_discard_group_preallocations function we
only increase discard_pa_seq when there is some PA to free.

Fixes: 07b5b8e1ac ("ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200916113859.1556397-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18 10:36:12 -04:00
brookxu
27bc446e2d ext4: limit the length of per-inode prealloc list
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m2.051s
user    0m0.008s
sys     0m2.026s

Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m0.471s
user    0m0.004s
sys     0m0.395s

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19 12:04:36 -04:00
brookxu
66d5e0277e ext4: reorganize if statement of ext4_mb_release_context()
Reorganize the if statement of ext4_mb_release_context(), make it
easier to read.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.com
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19 12:04:36 -04:00
brookxu
c55ee7d202 ext4: add mb_debug logging when there are lost chunks
Lost chunks are when some other process raced with the current thread
to grab a particular block allocation.  Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-19 12:04:36 -04:00
Xu Wang
e0d438c72a mballoc: replace seq_printf with seq_puts
seq_puts is a lot cheaper than seq_printf, so use that to print
literal strings.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20200810022158.9167-1-vulab@iscas.ac.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-18 14:21:59 -04:00
brookxu
dddcd2f9eb ext4: optimize the implementation of ext4_mb_good_group()
It might be better to adjust the code in two places:
1. Determine whether grp is currupt or not should be placed first.
2. (cr<=2 && free <ac->ac_g_ex.fe_len)should may belong to the crx
   strategy, and it may be more appropriate to put it in the
   subsequent switch statement block. For cr1, cr2, the conditions
   in switch potentially realize the above judgment. For cr0, we
   should add (free <ac->ac_g_ex.fe_len) judgment, and then delete
   (free / fragments) >= ac->ac_g_ex.fe_len), because cr0 returns
   true by default.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/e20b2d8f-1154-adb7-3831-a9e11ba842e9@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-18 14:18:36 -04:00
brookxu
051e2ce8cb ext4: delete invalid comments near ext4_mb_check_limits()
These comments do not seem to be related to ext4_mb_check_limits(),
it may be invalid.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c49faf0c-d5d5-9c51-6911-9e0ff57c6bfa@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-18 14:15:54 -04:00
brookxu
e9a3cd48d6 ext4: fix typos in ext4_mb_regular_allocator() comment
Fix typos in ext4_mb_regular_allocator() comment

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/d6514145-73b3-808b-ec5a-a8be27c51f9c@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-18 14:14:16 -04:00
Jan Kara
ce9f24cccd ext4: check journal inode extents more carefully
Currently, system zones just track ranges of block, that are "important"
fs metadata (bitmaps, group descriptors, journal blocks, etc.). This
however complicates how extent tree (or indirect blocks) can be checked
for inodes that actually track such metadata - currently the journal
inode but arguably we should be treating quota files or resize inode
similarly. We cannot run __ext4_ext_check() on such metadata inodes when
loading their extents as that would immediately trigger the validity
checks and so we just hack around that and special-case the journal
inode. This however leads to a situation that a journal inode which has
extent tree of depth at least one can have invalid extent tree that gets
unnoticed until ext4_cache_extents() crashes.

To overcome this limitation, track inode number each system zone belongs
to (0 is used for zones not belonging to any inode). We can then verify
inode number matches the expected one when verifying extent tree and
thus avoid the false errors. With this there's no need to to
special-case journal inode during extent tree checking anymore so remove
it.

Fixes: 0a944e8a6c ("ext4: don't perform block validity checks on the journal inode")
Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:36 -04:00
brookxu
9375ac770c ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous
code has already judged whether page is NULL.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/ad68e8a2-5ec3-5beb-537f-f3e53f55367a@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07 14:12:36 -04:00
Theodore Ts'o
3d392b2676 ext4: add prefetch_block_bitmaps mount option
For file systems where we can afford to keep the buddy bitmaps cached,
we can speed up initial writes to large file systems by starting to
load the block allocation bitmaps as soon as the file system is
mounted.  This won't work well for _super_ large file systems, or
memory constrained systems, so we only enable this when it is
requested via a mount option.

Addresses-Google-Bug: 159488342
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2020-08-07 14:12:35 -04:00
Alex Zhuravlev
c1d2c7d47e ext4: skip non-loaded groups at cr=0/1 when scanning for good groups
cr=0 is supposed to be an optimization to save CPU cycles, but if
buddy data (in memory) is not initialized then all this makes no sense
as we have to do sync IO taking a lot of cycles.  Also, at cr=0
mballoc doesn't choose any available chunk.  cr=1 also skips groups
using heuristic based on avg. fragment size.  It's more useful to skip
such groups and switch to cr=2 where groups will be scanned for
available chunks.  However, we always read the first block group in a
flex_bg so metadata blocks will get read into the first flex_bg if
possible.

Using sparse image and dm-slow virtual device of 120TB was
simulated, then the image was formatted and filled using debugfs to
mark ~85% of available space as busy.  mount process w/o the patch
couldn't complete in half an hour (according to vmstat it would take
~10-11 hours).  With the patch applied mount took ~20 seconds.

Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <azhuravlev@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Artem Blagodarenko <artem.blagodarenko@gmail.com>
2020-08-06 01:44:55 -04:00
Alex Zhuravlev
cfd7323772 ext4: add prefetching for block allocation bitmaps
This should significantly improve bitmap loading, especially for flex
groups as it tries to load all bitmaps within a flex.group instead of
one by one synchronously.

Prefetching is done in 8 * flex_bg groups, so it should be 8
read-ahead reads for a single allocating thread. At the end of
allocation the thread waits for read-ahead completion and initializes
buddy information so that read-aheads are not lost in case of memory
pressure.

At cr=0 the number of prefetching IOs is limited per allocation
context to prevent a situation when mballoc loads thousands of bitmaps
looking for a perfect group and ignoring groups with good chunks.

Together with the patch "ext4: limit scanning of uninitialized groups"
the mount time (which includes few tiny allocations) of a 1PB
filesystem is reduced significantly:

               0% full    50%-full unpatched    patched
  mount time       33s                9279s       563s

[ Restructured by tytso; removed the state flags in the allocation
  context, so it can be used to lazily prefetch the allocation bitmaps
  immediately after the file system is mounted.  Skip prefetching
  block groups which are uninitialized.  Finally pass in the
  REQ_RAHEAD flag to the block layer while prefetching. ]

Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06 01:44:48 -04:00
brookxu
3cb77bd241 ext4: fix spelling typos in ext4_mb_initialize_context
Fix spelling typos in ext4_mb_initialize_context.

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/883b523c-58ec-7f38-0bb8-cd2ea4393684@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06 01:38:01 -04:00
Ritesh Harjani
8119853653 ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
Simplify reading a seq variable by directly using this_cpu_read API
instead of doing this_cpu_ptr and then dereferencing it.

This also avoid the below kernel BUG: which happens when
CONFIG_DEBUG_PREEMPT is enabled

BUG: using smp_processor_id() in preemptible [00000000] code: syz-fuzzer/6927
caller is ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
CPU: 1 PID: 6927 Comm: syz-fuzzer Not tainted 5.7.0-next-20200602-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 check_preemption_disabled+0x20d/0x220 lib/smp_processor_id.c:48
 ext4_mb_new_blocks+0xa4d/0x3b70 fs/ext4/mballoc.c:4711
 ext4_ext_map_blocks+0x201b/0x33e0 fs/ext4/extents.c:4244
 ext4_map_blocks+0x4cb/0x1640 fs/ext4/inode.c:626
 ext4_getblk+0xad/0x520 fs/ext4/inode.c:833
 ext4_bread+0x7c/0x380 fs/ext4/inode.c:883
 ext4_append+0x153/0x360 fs/ext4/namei.c:67
 ext4_init_new_dir fs/ext4/namei.c:2757 [inline]
 ext4_mkdir+0x5e0/0xdf0 fs/ext4/namei.c:2802
 vfs_mkdir+0x419/0x690 fs/namei.c:3632
 do_mkdirat+0x21e/0x280 fs/namei.c:3655
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 42f56b7a4a7d ("ext4: mballoc: introduce pcpu seqcnt for freeing PA
to improve ENOSPC handling")
Suggested-by: Borislav Petkov <bp@alien8.de>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: syzbot+82f324bb69744c5f6969@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/534f275016296996f54ecf65168bb3392b6f653d.1591699601.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-11 11:03:26 -04:00
Ritesh Harjani
993778306e ext4: mballoc: use lock for checking free blocks while retrying
Currently while doing block allocation grp->bb_free may be getting
modified if discard is happening in parallel.
For e.g. consider a case where there are lot of threads who have
preallocated lot of blocks and there is a thread which is trying
to discard all of this group's PA. Now it could happen that
we see all of those group's bb_free is zero and fail the allocation
while there is sufficient space if we free up all the PA.

So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
if we are unable to allocate any blocks in the first try (since we may
not have considered blocks about to be discarded from PA lists).
So during retry attempt to allocate blocks we will use ext4_lock_group()
for checking if the group is good or not.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
8ef123fe02 ext4: mballoc: refactor ext4_mb_good_group()
ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/d9f7d031a5fbe1c943fae6bf1ff5cdf0604ae722.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
07b5b8e1ac ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
   allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
   or free these blocks i.e. while marking those blocks as used/free in
   mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
   that the bb_prealloc_list is not empty and hence proceed for discarding
   of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
   appropriate for allocation logic (say for e.g.
   EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
   the retry logic may result into infinte looping since grp->bb_free is
   non-zero.

2. Also before preallocation was clubbed with block allocation with the
   same ext4_lock_group() held, there were lot of races where grp->bb_free
   could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7f254686903b87c419d798742fd9a1be34f0657b.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
cf5e2ca6c9 ext4: mballoc: refactor ext4_mb_discard_preallocations()
Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/1cfae0098d2aa9afbeb59331401258182868c8f2.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
53f86b170d ext4: mballoc: add blocks to PA list under same spinlock after allocating blocks
ext4_mb_discard_preallocations() only checks for grp->bb_prealloc_list
of every group to discard the group's PA to free up the space if
allocation request fails. Consider below race:-

Process A  				Process B

1. allocate blocks
					1. Fails block allocation from
					     ext4_mb_regular_allocator()
   ext4_lock_group()
	allocated blocks
	more than ac_o_ex.fe_len
   ext4_unlock_group()
					2. Scans the
					   grp->bb_prealloc_list (under
					   ext4_lock_group()) and
					   find nothing and thus return
					   -ENOSPC.

2. Add the additional blocks to PA list

   ext4_lock_group()
   	add blocks to grp->bb_prealloc_list
   ext4_unlock_group()

Above race could be avoided if we add those additional blocks to
grp->bb_prealloc_list at the same time with block allocation when
ext4_lock_group() was still held.
With this discard-PA will know if there are actually any blocks which
could be freed from the PA

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/a2217dd782585b42328981832e6d396abaaccb80.1589955723.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:53 -04:00
Ritesh Harjani
d3df14535f ext4: mballoc: make mb_debug() implementation to use pr_debug()
mb_debug() msg had only 1 control level for all type of msgs.
And if we enable mballoc_debug then all of those msgs would be enabled.
Instead of adding multiple debug levels for mb_debug() msgs, use
pr_debug() with which we could have finer control to print msgs at all
of different levels (i.e. at file, func, line no.).

Also add process name/pid, superblk id, and other info in mb_debug()
msg. This also kills the mballoc_debug module parameter, since it is
not needed any more.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/f0c660cbde9e2edbe95c67942ca9ad80dd2231eb.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
eb2b8ebb86 ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK
Make sure to check for e4b->bd_info->bb_bitmap == NULL, in
mb_cmp_bitmaps() and return if NULL, to avoid possible NULL ptr
dereference. Similar to how we do this in other ifdef DOUBLE_CHECK
functions.

Also remove the BUG_ON() logic if kmalloc() or ext4_read_block_bitmap()
fails. We should simply mark grp->bb_bitmap as NULL if above happens.
In fact ext4_read_block_bitmap() may even return an error in case of resize
ioctl. Hence remove this BUG_ON logic (fstests ext4/032 may trigger
this).

Link: https://lore.kernel.org/r/9a54f8a696ff17c057cd571be3d15ac3ec1407f1.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
a345021553 ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function
This patch implemets mb_group_bb_bitmap_alloc() and
mb_group_bb_bitmap_free() function to remove #ifdef DOUBLE_CHECK macro
and it's related code from inside
ext4_mb_add_groupinfo()/ext4_mb_release().

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8c2095d74b779f0254a19b24982490dc6f07c4f9.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:52 -04:00
Ritesh Harjani
4fca8f0779 ext4: mballoc: make ext4_mb_use_preallocated() return type as bool
Change return type of function ext4_mb_use_preallocated() to bool to
better reflect what this function can return.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/7880cb6ef911465beafefcd7e9c3ea214688744b.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
f283529aba ext4: mballoc: simplify error handling in ext4_init_mballoc()
This patch simplifies error handling logic in ext4_init_mballoc(),
by adding all the cleanups at one place at the end of that function.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8621a7bc68f7107a9ac4292afeb784515333bd25.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
004379d0b0 ext4: mballoc: fix few other format specifier in mb_debug()
Fix few other format specifiers in mb_debug() msgs.
As such no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/574fa7f833abf2dbf3b53a2fea3195e71f6cdbd8.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
36bad4233c ext4: mballoc: correct the mb_debug() format specifier for pa_len var
pa->pa_len is an integer. Fix all of the format specifier used in
mb_debug() for pa_len to %d instead of %u.

As such no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/af4987f643c586f62bcc9961e43f0a67151d5551.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
bbc4ec77e9 ext4: mballoc: add more mb_debug() msgs
This patch adds some more debugging mb_debug() msgs to help improve
mballoc code debugging.
Other than adding more mb_debug() msgs at few more places,
there should be no other functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/5fc8e7788b924e211fcfa4a4c1d2f8503511661a.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
e68cf40c0d ext4: mballoc: refactor ext4_mb_show_ac()
This factors out ext4_mb_show_pa() function to show all the group's
preallocation info. This could be useful info to be added in later
patches.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/8f07d890b0038dcc935e9c10e6043ec9f3792721.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Ritesh Harjani
212da3ec6f ext4: mballoc: print bb_free info even when it is 0
Improve the debugging msg by also printing even if bb_free is 0.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/c894f1d1d30f86ae38f4e3a861949665b6dc61cd.1589086800.git.riteshh@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03 23:16:51 -04:00
Theodore Ts'o
907ea529fc ext4: convert BUG_ON's to WARN_ON's in mballoc.c
If the in-core buddy bitmap gets corrupted (or out of sync with the
block bitmap), issue a WARN_ON and try to recover.  In most cases this
involves skipping trying to allocate out of a particular block group.
We can end up declaring the file system corrupted, which is fair,
since the file system probably should be checked before we proceed any
further.

Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu
Google-Bug-Id: 34811296
Google-Bug-Id: 34639169
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-15 23:58:49 -04:00
Theodore Ts'o
54d3adbc29 ext4: save all error info in save_error_info() and drop ext4_set_errno()
Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero.  It's also less racy to set all of
the error information all at once.  (Also, as a bonus, it shrinks code
size slightly.)

Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes: 878520ac45 ("ext4: save the error code which triggered...")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-04-01 17:29:06 -04:00
Dmitry Monakhov
eb5760863f ext4: mark block bitmap corrupted when found instead of BUGON
We already has similar code in ext4_mb_complex_scan_group(), but
ext4_mb_simple_scan_group() still affected.

Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com>
Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-14 14:43:13 -04:00
Madhuparna Bhowmik
92e9c58c56 ext4: use built-in RCU list checking in mballoc
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200213152558.7070-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-03-05 20:24:50 -05:00
Suraj Jitindar Singh
7c990728b9 ext4: fix potential race between s_flex_groups online resizing and access
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.

The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2020-02-21 19:31:46 -05:00
Suraj Jitindar Singh
df3da4ea5a ext4: fix potential race between s_group_info online resizing and access
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
2020-02-21 00:38:12 -05:00
Theodore Ts'o
878520ac45 ext4: save the error code which triggered an ext4_error() in the superblock
This allows the cause of an ext4_error() report to be categorized
based on whether it was triggered due to an I/O error, or an memory
allocation error, or other possible causes.  Most errors are caused by
a detected file system inconsistency, so the default code stored in
the superblock will be EXT4_ERR_EFSCORRUPTED.

Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26 11:28:23 -05:00
Theodore Ts'o
c60990b361 ext4: clean up kerneldoc warnigns when building with W=1
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-06-19 16:30:03 -04:00
Khazhismel Kumykov
4b99faa23c ext4: cond_resched in work-heavy group loops
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2019-04-25 12:58:01 -04:00
Jan Kara
31562b954b ext4: make sanity check in mballoc more strict
The sanity check in mb_find_extent() only checked that returned extent
does not extend past blocksize * 8, however it should not extend past
EXT4_CLUSTERS_PER_GROUP(sb). This can happen when clusters_per_group <
blocksize * 8 and the tail of the bitmap is not properly filled by 1s
which happened e.g. when ancient kernels have grown the filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2019-04-06 18:33:06 -04:00
Nikolay Borisov
82dd124c40 ext4: replace opencoded i_writecount usage with inode_is_open_for_write()
There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2019-02-10 23:04:16 -05:00
Eric Whitney
9fe671496b ext4: adjust reserved cluster count when removing extents
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree.  Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster.  To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed.  This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:25:08 -04:00
zhong jiang
863c37fcb1 ext4: remove unneeded variable "err" in ext4_mb_release_inode_pa()
The err is not used after initalization. So just remove the variable.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-08-04 17:34:07 -04:00
Jeremy Cline
1a5d5e5d51 ext4: fix spectre gadget in ext4_mb_regular_allocator()
'ac->ac_g_ex.fe_len' is a user-controlled value which is used in the
derivation of 'ac->ac_2order'. 'ac->ac_2order', in turn, is used to
index arrays which makes it a potential spectre gadget. Fix this by
sanitizing the value assigned to 'ac->ac2_order'.  This covers the
following accesses found with the help of smatch:

* fs/ext4/mballoc.c:1896 ext4_mb_simple_scan_group() warn: potential
  spectre issue 'grp->bb_counters' [w] (local cap)

* fs/ext4/mballoc.c:445 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_offsets' [r] (local cap)

* fs/ext4/mballoc.c:446 mb_find_buddy() warn: potential spectre issue
  'EXT4_SB(e4b->bd_sb)->s_mb_maxs' [r] (local cap)

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-08-02 00:03:40 -04:00
Linus Torvalds
70a2dc6abc Bug fixes for ext4; most of which relate to vulnerabilities where a
maliciously crafted file system image can result in a kernel OOPS or
 hang.  At least one fix addresses an inline data bug could be
 triggered by userspace without the need of a crafted file system
 (although it does require that the inline data feature be enabled).
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAltBmcYACgkQ8vlZVpUN
 gaPDJgf/cEa9QuiYTbNOmcOMorK9LEk5XO8qsiJdUVNQtLsHZfl0QowbkF9/F/W5
 andTJzNpFvXeLADMTTjpsDnQ90i8LKD11Kol3dPJcMhJhELtQsjxUBguxpQBP86R
 dvHuCl2/AaqX7rr6Co80yYSinRCquqkzJNhdM5/MLNGziSpkQL3dPSs93rmV+YbU
 8DkUwmhDhoiToLBTLaldrAsAzKvor3uyjNPJ3qhxeE2kXrnuI1V4XfstBGjhVKFB
 /5aYWexDZkL5qiCo+lZnqdITqUnPx3uAkUdBn0dj7V+nDow+/R/8nApvlvJu6usF
 OfMoKr098/pmPAjE5aZ8QpBNVtLFpg==
 =njzR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes for ext4; most of which relate to vulnerabilities where a
  maliciously crafted file system image can result in a kernel OOPS or
  hang.

  At least one fix addresses an inline data bug could be triggered by
  userspace without the need of a crafted file system (although it does
  require that the inline data feature be enabled)"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: check superblock mapped prior to committing
  ext4: add more mount time checks of the superblock
  ext4: add more inode number paranoia checks
  ext4: avoid running out of journal credits when appending to an inline file
  jbd2: don't mark block as modified if the handle is out of credits
  ext4: never move the system.data xattr out of the inode body
  ext4: clear i_data in ext4_inode_info when removing inline data
  ext4: include the illegal physical block in the bad map ext4_error msg
  ext4: verify the depth of extent tree in ext4_find_extent()
  ext4: only look at the bg_flags field if it is valid
  ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
  ext4: always check block group bounds in ext4_init_block_bitmap()
  ext4: always verify the magic number in xattr blocks
  ext4: add corruption check in ext4_xattr_set_entry()
  ext4: add warn_on_error mount option
2018-07-08 11:10:30 -07:00
Theodore Ts'o
8844618d8a ext4: only look at the bg_flags field if it is valid
The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled.  We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up.  Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-06-14 00:58:00 -04:00
Linus Torvalds
1434763ca5 A lot of cleanups and bug fixes, especially dealing with corrupted
file systems.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlsWmXcACgkQ8vlZVpUN
 gaMGmAf+JGK4XysAvlJuj9tJfFPHHwgXSBBe/GAgyjhW9XhtNRHprUM2SpvwpIdI
 Isl5O8Ec+FywBJB0I4AGy6yds6DE6jn38FFRFEhVmkp4EoROJiIr8+a7spfVuC3m
 cWrHBgc7FwK4qYlyuGtH2+6NYva+KNFr+wwbvvUusvldyZAWMzflfrcdHM6D+/JE
 +Sd5I7aniqnP5fICq0b4xrP2zWO4XJEKMbZO2dJ9yRtMmUnbaSj6G+bTGDRyfrNk
 L3wJhqIu93o7zjDaEC0UfXSLAXzoDGWHeq7fBssaJiXj/hNtAvAGPaRMbgFR9a3h
 uHmhvf84iyJuM+8GgG25UqeGwCuWiA==
 =b0VQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A lot of cleanups and bug fixes, especially dealing with corrupted
  file systems"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: fix fencepost error in check for inode count overflow during resize
  ext4: correctly handle a zero-length xattr with a non-zero e_value_offs
  ext4: bubble errors from ext4_find_inline_data_nolock() up to ext4_iget()
  ext4: do not allow external inodes for inline data
  ext4: report delalloc reserve as non-free in statfs for project quota
  ext4: remove NULL check before calling kmem_cache_destroy()
  jbd2: remove NULL check before calling kmem_cache_destroy()
  jbd2: remove bunch of empty lines with jbd2 debug
  ext4: handle errors on ext4_commit_super
  ext4: do not update s_last_mounted of a frozen fs
  ext4: factor out helper ext4_sample_last_mounted()
  vfs: add the sb_start_intwrite_trylock() helper
  ext4: update mtime in ext4_punch_hole even if no blocks are released
  ext4: add verifier check for symlink with append/immutable flags
  fs: ext4: add new return type vm_fault_t
  ext4: fix hole length detection in ext4_ind_map_blocks()
  ext4: mark block bitmap corrupted when found
  ext4: mark inode bitmap corrupted when found
  ext4: add new ext4_mark_group_bitmap_corrupted() helper
  ext4: fix wrong return value in ext4_read_inode_bitmap()
  ...
2018-06-05 12:49:17 -07:00
Sean Fu
21c580d88e ext4: remove NULL check before calling kmem_cache_destroy()
Signed-off-by: Sean Fu <fxinrong@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-05-20 22:44:13 -04:00
Christoph Hellwig
247dbed8c9 ext4: simplify procfs code
Use remove_proc_subtree to remove the whole subtree on cleanup, and
unwind the registration loop into individual calls.  Switch to use
proc_create_seq where applicable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Wang Shilong
736dedbb1a ext4: mark block bitmap corrupted when found
There are still some cases that we missed to set
block bitmaps corrupted bit properly:

1) block bitmap number is wrong.
2) failed to read block bitmap due to disk errors.
3) double free block bitmaps..
4) some mismatch check with bitmaps vs buddy information.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-05-12 12:37:58 -04:00
Wang Shilong
db79e6d1fb ext4: add new ext4_mark_group_bitmap_corrupted() helper
Since there are many places to set inode/block bitmap
corrupt bit, add a new helper for it, which will make
codes more clear.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2018-05-12 11:39:40 -04:00
Jun Piao
49598e04b5 ext4: use 'sbi' instead of 'EXT4_SB(sb)'
We could use 'sbi' instead of 'EXT4_SB(sb)' to make code more elegant.

Signed-off-by: Jun Piao <piaojun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-01-11 13:17:49 -05:00
Theodore Ts'o
f516676857 ext4: fix up remaining files with SPDX cleanups
A number of ext4 source files were skipped due because their copyright
permission statements didn't match the expected text used by the
automated conversion utilities.  I've added SPDX tags for the rest.

While looking at some of these files, I've noticed that we have quite
a bit of variation on the licenses that were used --- in particular
some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
and we have some files that have a LGPL-2.1 license (which was quite
surprising).

I've not attempted to do any license changes.  Even if it is perfectly
legal to relicense to GPL 2.0-only for consistency's sake, that should
be done with ext4 developer community discussion.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-12-17 22:00:59 -05:00
harshads
d77147ff44 ext4: add support for online resizing with bigalloc
This patch adds support for online resizing on bigalloc file system by
implementing EXT4_IOC_RESIZE_FS ioctl. Old resize interfaces (add
block groups and extend last block group) are left untouched. Tests
performed with cluster sizes of 1, 2, 4 and 8 blocks (of size 4k) per
cluster. I will add these tests to xfstests.

Signed-off-by: Harshad Shirwadkar <harshads@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-10-29 09:38:46 -04:00
Theodore Ts'o
b80b32b6d5 ext4: fix clang build regression
Arnd Bergmann <arnd@arndb.de>

As Stefan pointed out, I misremembered what clang can do specifically,
and it turns out that the variable-length array at the end of the
structure did not work (a flexible array would have worked here
but not solved the problem):

fs/ext4/mballoc.c:2303:17: error: fields must have a constant size:
'variable length array in structure' extension will never be supported
                ext4_grpblk_t counters[blocksize_bits + 2];

This reverts part of my previous patch, using a fixed-size array
again, but keeping the check for the array overflow.

Fixes: 2df2c3402f ("ext4: fix warning about stack corruption")
Reported-by: Stefan Agner <stefan@agner.ch>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-14 08:29:18 -04:00
Arnd Bergmann
2df2c3402f ext4: fix warning about stack corruption
After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"),
we get a warning about possible stack overflow from a memcpy that
was not strictly bounded to the size of the local variable:

    inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2:
include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=]

We actually had a bug here that would have been found by the warning,
but it was already fixed last year in commit 30a9d7afe7 ("ext4: fix
stack memory corruption with 64k block size").

This replaces the fixed-length structure on the stack with a variable-length
structure, using the correct upper bound that tells the compiler that
everything is really fine here. I also change the loop count to check
for the same upper bound for consistency, but the existing code is
already correct here.

Note that while clang won't allow certain kinds of variable-length arrays
in structures, this particular instance is fine, as the array is at the
end of the structure, and the size is strictly bounded.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-08-05 21:57:46 -04:00
Daeho Jeong
e45105772d ext4: release discard bio after sending discard commands
We've changed the discard command handling into parallel manner.
But, in this change, I forgot decreasing the usage count of the bio
which was used to send discard request. I'm sorry about that.

Fixes: a015434480 ("ext4: send parallel discards on commit completions")
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-08-05 13:11:57 -04:00
Colin Ian King
ff95015648 ext4: fix spelling mistake: "prellocated" -> "preallocated"
Trivial fix to spelling mistake in mb_debug debug message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-07-06 15:28:45 -04:00
Daeho Jeong
a015434480 ext4: send parallel discards on commit completions
Now, when we mount ext4 filesystem with '-o discard' option, we have to
issue all the discard commands for the blocks to be deallocated and
wait for the completion of the commands on the commit complete phase.
Because this procedure might involve a lot of sequential combinations of
issuing discard commands and waiting for that, the delay of this
procedure might be too much long, even to 17.0s in our test,
and it results in long commit delay and fsync() performance degradation.

To reduce this kind of delay, instead of adding callback for each
extent and handling all of them in a sequential manner on commit phase,
we instead add a separate list of extents to free to the superblock and
then process this list at once after transaction commits so that
we can issue all the discard commands in a parallel manner like XFS
filesystem.

Finally, we could enhance the discard command handling performance.
The result was such that 17.0s delay of a single commit in the worst
case has been enhanced to 4.8s.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Tested-by: Kitae Lee <kitae87.lee@samsung.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-06-22 23:54:33 -04:00
Tahsin Erdogan
02749a4c20 ext4: add ext4_is_quota_file()
IS_NOQUOTA() indicates whether quota is disabled for an inode. Ext4
also uses it to check whether an inode is for a quota file. The
distinction currently doesn't matter because quota is disabled only
for the quota files. When we start disabling quota for other inodes
in the future, we will want to make the distinction clear.

Replace IS_NOQUOTA() call with ext4_is_quota_file() at places where
we are checking for quota files.

Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-22 11:31:25 -04:00
Konstantin Khlebnikov
9651e6b2e2 ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
I've got another report about breaking ext4 by ENOMEM error returned from
ext4_mb_load_buddy() caused by memory shortage in memory cgroup.
This time inside ext4_discard_preallocations().

This patch replaces ext4_error() with ext4_warning() where errors returned
from ext4_mb_load_buddy() are not fatal and handled by caller:
* ext4_mb_discard_group_preallocations() - called before generating ENOSPC,
  we'll try to discard other group or return ENOSPC into user-space.
* ext4_trim_all_free() - just stop trimming and return ENOMEM from ioctl.

Some callers cannot handle errors, thus __GFP_NOFAIL is used for them:
* ext4_discard_preallocations()
* ext4_mb_discard_lg_preallocations()

Fixes: adb7ef600c ("ext4: use __GFP_NOFAIL in ext4_free_blocks()")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:35:23 -04:00
Linus Torvalds
bf5f89463f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - various misc things

 - procfs updates

 - lib/ updates

 - checkpatch updates

 - kdump/kexec updates

 - add kvmalloc helpers, use them

 - time helper updates for Y2038 issues. We're almost ready to remove
   current_fs_time() but that awaits a btrfs merge.

 - add tracepoints to DAX

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
  drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
  selftests/vm: add a test for virtual address range mapping
  dax: add tracepoint to dax_insert_mapping()
  dax: add tracepoint to dax_writeback_one()
  dax: add tracepoints to dax_writeback_mapping_range()
  dax: add tracepoints to dax_load_hole()
  dax: add tracepoints to dax_pfn_mkwrite()
  dax: add tracepoints to dax_iomap_pte_fault()
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mm: introduce memalloc_noreclaim_{save,restore}
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
  mm/huge_memory.c: use zap_deposited_table() more
  time: delete CURRENT_TIME_SEC and CURRENT_TIME
  gfs2: replace CURRENT_TIME with current_time
  apparmorfs: replace CURRENT_TIME with current_time()
  lustre: replace CURRENT_TIME macro
  fs: ubifs: replace CURRENT_TIME_SEC with current_time
  fs: ufs: use ktime_get_real_ts64() for birthtime
  ...
2017-05-08 18:17:56 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Darrick J. Wong
0c9ec4beec ext4: support GETFSMAP ioctls
Support the GETFSMAP ioctls so that we can use the xfs free space
management tools to probe ext4 as well.  Note that this is a partial
implementation -- we only report fixed-location metadata and free space;
everything else is reported as "unknown".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 00:36:53 -04:00
Eric Biggers
d600618673 ext4: constify static data that is never modified
Constify static data in ext4 that is never (intentionally) modified so
that it is placed in .rodata and benefits from memory protection.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-29 23:47:50 -04:00
Fabian Frederick
93407472a2 fs: add i_blocksize()
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.

This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'

Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.

[geliangtang@gmail.com: truncate: use i_blocksize()]
  Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Jan Kara
d9b22cf9f5 ext4: fix stripe-unaligned allocations
When a filesystem is created using:

	mkfs.ext4 -b 4096 -E stride=512 <dev>

and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().

Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-10 00:50:56 -05:00
Jan Kara
cd648b8a8f ext4: trim allocation requests to group size
If filesystem groups are artifically small (using parameter -g to
mkfs.ext4), ext4_mb_normalize_request() can result in a request that is
larger than a block group. Trim the request size to not confuse
allocation code.

Reported-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-01-27 14:34:30 -05:00
Theodore Ts'o
43c73221b3 ext4: replace BUG_ON with WARN_ON in mb_find_extent()
The last BUG_ON in mb_find_extent() is apparently triggering in some
rare cases.  Most of the time it indicates a bug in the buddy bitmap
algorithms, but there are some weird cases where it can trigger when
buddy bitmap is still in memory, but the block bitmap has to be read
from disk, and there is disk or memory corruption such that the block
bitmap and the buddy bitmap are out of sync.

Google-Bug-Id: #33702157

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-22 19:35:52 -05:00
Chandan Rajendra
30a9d7afe7 ext4: fix stack memory corruption with 64k block size
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2016-11-14 21:26:26 -05:00
Chandan Rajendra
69e43e8cc9 ext4: fix mballoc breakage with 64k block size
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.

Fixes: c9de560ded
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2016-11-14 21:04:37 -05:00
Linus Torvalds
6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Vegard Nossum
554a5ccc4e ext4: fix reference counting bug on block allocation error
If we hit this error when mounted with errors=continue or
errors=remount-ro:

    EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata

then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
continue. However, ext4_mb_release_context() is the wrong thing to call
here since we are still actually using the allocation context.

Instead, just error out. We could retry the allocation, but there is a
possibility of getting stuck in an infinite loop instead, so this seems
safer.

[ Fixed up so we don't return EAGAIN to userspace. --tytso ]

Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation")
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
2016-07-14 23:02:47 -04:00
Theodore Ts'o
d08854f5bc ext4: optimize ext4_should_retry_alloc() to improve ENOSPC performance
If there are no pending blocks to be released after a commit, forcing
a journal commit has no hope of helping.  It's possible that a commit
had just completed, so if there are now free blocks available for
allocation, it's worth retrying the commit.

Reported-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-06-26 18:24:01 -04:00
Al Viro
84c60b1388 drop redundant ->owner initializations
it's not needed for file_operations of inodes located on fs defined
in the hosting module and for file_operations that go into procfs.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-29 19:08:00 -04:00
Nicolai Stange
935244cd54 ext4: silence UBSAN in ext4_mb_init()
Currently, in ext4_mb_init(), there's a loop like the following:

  do {
    ...
    offset += 1 << (sb->s_blocksize_bits - i);
    i++;
  } while (i <= sb->s_blocksize_bits + 1);

Note that the updated offset is used in the loop's next iteration only.

However, at the last iteration, that is at i == sb->s_blocksize_bits + 1,
the shift count becomes equal to (unsigned)-1 > 31 (c.f. C99 6.5.7(3))
and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:2621:15
  shift exponent 4294967295 is too large for 32-bit type 'int'
  [...]
  Call Trace:
   [<ffffffff818c4d25>] dump_stack+0xbc/0x117
   [<ffffffff818c4c69>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411ab>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cac>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ab1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff814b6dc1>] ? kmem_cache_alloc+0x101/0x390
   [<ffffffff816fc13b>] ? ext4_mb_init+0x13b/0xfd0
   [<ffffffff814293c7>] ? create_cache+0x57/0x1f0
   [<ffffffff8142948a>] ? create_cache+0x11a/0x1f0
   [<ffffffff821c2168>] ? mutex_lock+0x38/0x60
   [<ffffffff821c23ab>] ? mutex_unlock+0x1b/0x50
   [<ffffffff814c26ab>] ? put_online_mems+0x5b/0xc0
   [<ffffffff81429677>] ? kmem_cache_create+0x117/0x2c0
   [<ffffffff816fcc49>] ext4_mb_init+0xc49/0xfd0
   [...]

Observe that the mentioned shift exponent, 4294967295, equals (unsigned)-1.

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of offset is never used again.

Silence UBSAN by introducing another variable, offset_incr, holding the
next increment to apply to offset and adjust that one by right shifting it
by one position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 19:46:19 -04:00
Nicolai Stange
b5cb316cdf ext4: address UBSAN warning in mb_find_order_for_block()
Currently, in mb_find_order_for_block(), there's a loop like the following:

  while (order <= e4b->bd_blkbits + 1) {
    ...
    bb += 1 << (e4b->bd_blkbits - order);
  }

Note that the updated bb is used in the loop's next iteration only.

However, at the last iteration, that is at order == e4b->bd_blkbits + 1,
the shift count becomes negative (c.f. C99 6.5.7(3)) and UBSAN reports

  UBSAN: Undefined behaviour in fs/ext4/mballoc.c:1281:11
  shift exponent -1 is negative
  [...]
  Call Trace:
   [<ffffffff818c4d35>] dump_stack+0xbc/0x117
   [<ffffffff818c4c79>] ? _atomic_dec_and_lock+0x169/0x169
   [<ffffffff819411bb>] ubsan_epilogue+0xd/0x4e
   [<ffffffff81941cbc>] __ubsan_handle_shift_out_of_bounds+0x1fb/0x254
   [<ffffffff81941ac1>] ? __ubsan_handle_load_invalid_value+0x158/0x158
   [<ffffffff816e93a0>] ? ext4_mb_generate_from_pa+0x590/0x590
   [<ffffffff816502c8>] ? ext4_read_block_bitmap_nowait+0x598/0xe80
   [<ffffffff816e7b7e>] mb_find_order_for_block+0x1ce/0x240
   [...]

Unless compilers start to do some fancy transformations (which at least
GCC 6.0.0 doesn't currently do), the issue is of cosmetic nature only: the
such calculated value of bb is never used again.

Silence UBSAN by introducing another variable, bb_incr, holding the next
increment to apply to bb and adjust that one by right shifting it by one
position per loop iteration.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=114701
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112161

Cc: stable@vger.kernel.org
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-05 17:38:03 -04:00
Jakub Wilk
8d2ae1cbe8 ext4: remove trailing \n from ext4_warning/ext4_error calls
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-27 01:11:21 -04:00
Kirill A. Shutemov
ea1754a084 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Konstantin Khlebnikov
adb7ef600c ext4: use __GFP_NOFAIL in ext4_free_blocks()
This might be unexpected but pages allocated for sbi->s_buddy_cache are
charged to current memory cgroup. So, GFP_NOFS allocation could fail if
current task has been killed by OOM or if current memory cgroup has no
free memory left. Block allocator cannot handle such failures here yet.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-13 17:29:06 -04:00
Adam Buchbinder
b8a07463c8 ext4: fix misspellings in comments.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:49:05 -05:00
Daeho Jeong
f96c450dab ext4: make sure to revoke all the freeable blocks in ext4_free_blocks
Now, ext4_free_blocks() doesn't revoke data blocks of per-file data
journalled inode and it can cause file data inconsistency problems.
Even though data blocks of per-file data journalled inode are already
forgotten by jbd2_journal_invalidatepage() in advance of invoking
ext4_free_blocks(), we still need to revoke the data blocks here.
Moreover some of the metadata blocks, which are not found by
sb_find_get_block(), are still needed to be revoked, but this is also
missing here.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-02-21 18:31:41 -05:00
Huaitong Han
802cf1f9f5 ext4: add a line break for proc mb_groups display
This patch adds a line break for proc mb_groups display.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-02-12 00:17:16 -05:00
Andrew Morton
79211c8ed1 remove abs64()
Switch everything to the new and more capable implementation of abs().
Mainly to give the new abs() a bit of a workout.

Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-09 15:11:24 -08:00
John Stultz
16175039e6 ext4: fix abs() usage in ext4_mb_check_group_pa
The ext4_fsblk_t type is a long long, which should not be used
with abs(), as is done in ext4_mb_check_group_pa().

This patch modifies ext4_mb_check_group_pa() to use abs64()
instead.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-19 00:01:05 -04:00
Daeho Jeong
9c02ac9798 ext4: fix xfstest generic/269 double revoked buffer bug with bigalloc
When you repeatly execute xfstest generic/269 with bigalloc_1k option
enabled using the below command:

"./kvm-xfstests -c bigalloc_1k -m nodelalloc -C 1000 generic/269"

you can easily see the below bug message.

"JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);"

This means that an already revoked buffer is erroneously revoked again
and it is caused by doing revoke for the buffer at the wrong position
in ext4_free_blocks(). We need to re-position the buffer revoke
procedure for an unspecified buffer after checking the cluster boundary
for bigalloc option. If not, some part of the cluster can be doubly
revoked.

Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
2015-10-17 22:28:21 -04:00
Darrick J. Wong
9008a58e5d ext4: make the bitmap read routines return real error codes
Make the bitmap reaading routines return real error codes (EIO,
EFSCORRUPTED, EFSBADCRC) which can then be reflected back to
userspace for more precise diagnosis work.

In particular, this means that mballoc no longer claims that we're out
of memory if the block bitmaps become corrupt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-10-17 21:33:24 -04:00
Theodore Ts'o
ebd173beb8 ext4: move procfs registration code to fs/ext4/sysfs.c
This allows us to refactor the procfs code, which saves a bit of
compiled space.  More importantly it isolates most of the procfs
support code into a single file, so it's easier to #ifdef it out if
the proc file system has been disabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-09-23 12:46:17 -04:00
Linus Torvalds
1c4c7159ed Bug fixes (all for stable kernels) for ext4:
* address corner cases for indirect blocks->extent migration
   * fix reserved block accounting invalidate_page when
 	page_size != block_size (i.e., ppc or 1k block size file systems)
   * fix deadlocks when a memcg is under heavy memory pressure
   * fix fencepost error in lazytime optimization
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJVmW27AAoJEPL5WVaVDYGjmEkIAJsGHVIKur1Kp//FhejSB/wI
 B0d+UuQt5kdAE3lNxC7lHO1NqIhvnS7eBho+52LG8V4JDRrzTbE1GdbsBhAIk6FW
 CcsQvsHAI99QJMdqOCachu/+nhCwIINGkxmbumhNaZoJPn6wmGQzCA3Cn5qmnGnK
 Ctbk6li1HuMXyzbbvxCLfaD/xCUs1NCdufEnRU44i0U4OfaYNpiAhddeGIQ8WMEQ
 G14l2JvhIfye6fG8lnCzfacFvnT9zvvSGfRO3ZQjC4Az1EogIUbhCPLvq0ebDbPp
 i4eRfrSRdXmMojqmW/knET8skXQVZVnD7LWuvkue+n47UbTH2c0roTbp4l76W+U=
 =x8Cc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Bug fixes (all for stable kernels) for ext4:

   - address corner cases for indirect blocks->extent migration

   - fix reserved block accounting invalidate_page when
     page_size != block_size (i.e., ppc or 1k block size file systems)

   - fix deadlocks when a memcg is under heavy memory pressure

   - fix fencepost error in lazytime optimization"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: replace open coded nofail allocation in ext4_free_blocks()
  ext4: correctly migrate a file with a hole at the beginning
  ext4: be more strict when migrating to non-extent based file
  ext4: fix reservation release on invalidatepage for delalloc fs
  ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp
  bufferhead: Add _gfp version for sb_getblk()
  ext4: fix fencepost error in lazytime optimization
2015-07-05 16:24:54 -07:00
Michal Hocko
7444a072c3 ext4: replace open coded nofail allocation in ext4_free_blocks()
ext4_free_blocks is looping around the allocation request and mimics
__GFP_NOFAIL behavior without any allocation fallback strategy. Let's
remove the open coded loop and replace it with __GFP_NOFAIL. Without the
flag the allocator has no way to find out never-fail requirement and
cannot help in any way.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2015-07-05 12:33:44 -04:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Rasmus Villemoes
97b4af2f76 ext4: mballoc: avoid 20-argument function call
Making a function call with 20 arguments is rather expensive in both
stack and .text. In this case, doing the formatting manually doesn't
make it any less readable, so we might as well save 155 bytes of .text
and 112 bytes of stack.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
2015-06-15 00:32:58 -04:00
Lukas Czerner
42ac1848ea ext4: return error code from ext4_mb_good_group()
Currently ext4_mb_good_group() only returns 0 or 1 depending on whether
the allocation group is suitable for use or not. However we might get
various errors and fail while initializing new group including -EIO
which would never get propagated up the call chain. This might lead to
an endless loop at writeback when we're trying to find a good group to
allocate from and we fail to initialize new group (read error for
example).

Fix this by returning proper error code from ext4_mb_good_group() and
using it in ext4_mb_regular_allocator(). In ext4_mb_regular_allocator()
we will always return only the first occurred error from
ext4_mb_good_group() and we only propagate it back  to the caller if we
do not get any other errors and we fail to allocate any blocks.

Note that with other modes than errors=continue, we will fail
immediately in ext4_mb_good_group() in case of error, however with
errors=continue we should try to continue using the file system, that's
why we're not going to fail immediately when we see an error from
ext4_mb_good_group(), but rather when we fail to find a suitable block
group to allocate from due to an problem in group initialization.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2015-06-08 11:40:40 -04:00
Lukas Czerner
bbdc322f2c ext4: try to initialize all groups we can in case of failure on ppc64
Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.

Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-06-08 11:38:37 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Markus Elfring
bfcba2d035 ext4: Remove an unnecessary check for NULL before iput()
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:01:37 -05:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Al Viro
b93b41d4c7 ext4: kill ext4_kvfree()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-20 12:19:11 -05:00
Linus Torvalds
c2661b8060 A large number of cleanups and bug fixes, with some (minor) journal
optimizations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUPlLCAAoJENNvdpvBGATwpN8P/jnbDL1RqM9ZEAWfbDhvYumR
 Fi59b3IDzSJHuuJeP0nTblVbbWclpO9ljCd18ttsHr8gBXA0ViaEU0XvWbpHIwPN
 1fr1/Ovd0wvBdIVdLlaLXTR9skH4lbkiXxv/tkfjVCOSpzqiKID98Z72e/gUjB7Z
 8xjAn/mTCnXKnhqMGzi8RC2MP1wgY//ErR21bj6so/8RC8zu4P6JuVj/hI6s0y5i
 IPtAmjhdM7nxnS0wJwj7dLT0yNDftDh69qE6CgIwyK+Xn/SZFgYwE6+l02dj3DET
 ZcAzTT9ToTMJdWtMu+5Y4LY8ObJ5xqMPbMoUclQ3DWe6nZicvtcBVCjfG/J8pFlY
 IFD0nfh/OpX9cQMwJ+5Y8P4TrMiqM+FfuLfu+X83gLyrAyIazwoaZls2lxlEyC0w
 M25oAqeKGUeVakVlmDZlVyBf05cu5m62x1rRvpcwMXMNhJl8/xwsSdhdYGeJfbO0
 0MfL1n6GmvHvouMXKNsXlat/w3QVaQWVRzqdF9x7Q730fSHC/zxVGO+Po3jz2fBd
 fBdfE14BIIU7nkyBVy0CZG5SDmQW4YACocOv/ATmII9j76F9eZQ3zsA8J1x+dLmJ
 dP1Uxvsn1C3HW8Ua239j0XUJncglb06iEId0ywdkmWcc1rbzsyZ/NzXN/QBdZmqB
 9g4GKAXAyh15PeBTJ5K/
 =vWic
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "A large number of cleanups and bug fixes, with some (minor) journal
  optimizations"

[ This got sent to me before -rc1, but was stuck in my spam folder.   - Linus ]

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (67 commits)
  ext4: check s_chksum_driver when looking for bg csum presence
  ext4: move error report out of atomic context in ext4_init_block_bitmap()
  ext4: Replace open coded mdata csum feature to helper function
  ext4: delete useless comments about ext4_move_extents
  ext4: fix reservation overflow in ext4_da_write_begin
  ext4: add ext4_iget_normal() which is to be used for dir tree lookups
  ext4: don't orphan or truncate the boot loader inode
  ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
  ext4: optimize block allocation on grow indepth
  ext4: get rid of code duplication
  ext4: fix over-defensive complaint after journal abort
  ext4: fix return value of ext4_do_update_inode
  ext4: fix mmap data corruption when blocksize < pagesize
  vfs: fix data corruption when blocksize < pagesize for mmaped data
  ext4: fold ext4_nojournal_sops into ext4_sops
  ext4: support freezing ext2 (nojournal) file systems
  ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
  ext4: don't check quota format when there are no quota files
  jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
  jbd2: avoid pointless scanning of checkpoint lists
  ...
2014-10-20 09:50:11 -07:00
Linus Torvalds
0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Dmitry Monakhov
dfe076c106 ext4: get rid of code duplication
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:26:17 -04:00
Theodore Ts'o
754cfed6bb ext4: drop the EXT4_STATE_DELALLOC_RESERVED flag
Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:08:22 -04:00
Theodore Ts'o
e3cf5d5d9a ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVED
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().

Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.

This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything.  If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:07:25 -04:00
Christoph Lameter
a0b6bc63a2 block: Replace __this_cpu_ptr with raw_cpu_ptr
__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was
introduced in 3.15-rc1.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:45 -04:00
Theodore Ts'o
c99d1e6e83 ext4: fix BUG_ON in mb_free_blocks()
If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():

	BUG_ON(last >= (sb->s_blocksize << 3));

Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.

Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().

Google-Bug-Id: 16844242

Fixes: 86f0afd463
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-23 17:47:28 -04:00
Theodore Ts'o
86f0afd463 ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
If there is a failure while allocating the preallocation structure, a
number of blocks can end up getting marked in the in-memory buddy
bitmap, and then not getting released.  This can result in the
following corruption getting reported by the kernel:

EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126,
12793 clusters in bitmap, 12729 in gd

In that case, we need to release the blocks using mb_free_blocks().

Tested: fs smoke test; also demonstrated that with injected errors,
	the file system is no longer getting corrupted

Google-Bug-Id: 16657874

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-30 22:17:17 -04:00
Xiaoguang Wang
b27b1535ac ext4: fix wrong size computation in ext4_mb_normalize_request()
As the member fe_len defined in struct ext4_free_extent is expressed as
number of clusters, the variable "size" computation is wrong, we need to
first translate fe_len to block number, then to bytes.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-27 22:26:36 -04:00
Theodore Ts'o
71d4f7d032 ext4: remove metadata reservation checks
Commit 27dd438542 ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed.  Given that, tracking the reservation of metadata blocks is
no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:02:38 -04:00
Theodore Ts'o
94d4c066a4 ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
We are spending a lot of time explaining to users what this error
means.  Let's try to improve the message to avoid this problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:15:50 -04:00
Namjae Jeon
e43bb4e612 ext4: decrement free clusters/inodes counters when block group declared bad
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result.  User can get immediately ENOSPC error from
write begin without reaching for the writepages.

Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-26 10:11:53 -04:00
Linus Torvalds
f8409abdc5 Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions.  In addition,
 improve the scalability of adding and remove inodes from the orphan
 list.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTk9x7AAoJENNvdpvBGATwQQ4QAN85xkNWWiq0feLGZjUVTre/
 JUgRQWXZYVogAQckQoTDXqJt1qKYxO45A8oIoUMI4uzgcFJm7iJIZJAv3Hjd2ftz
 48RVwjWHblmBz6e+CdmETpzJUaJr3KXbnk3EDQzagWg3Q64dBU/yT0c4foBO8wfX
 FI1MNin70r5NGQv6Mp4xNUfMoU6liCrsMO2RWkyxY2rcmxy6tkpNO/NBAPwhmn0e
 vwKHvnnqKM08Frrt6Lz3MpXGAJ+rhTSvmL+qSRXQn9BcbphdGa4jy+i3HbviRX4N
 z77UZMgMbfK1V3YHm8KzmmbIHrmIARXUlCM7jp4HPSnb4qhyERrhVmGCJZ8civ6Q
 3Cm9WwA93PQDfRX6Kid3K1tR/ql+ryac55o9SM990osrWp4C0IH+P/CdlSN0GspN
 3pJTLHUVVcxF6gSnOD+q/JzM8Iudl87Rxb17wA+6eg3AJRaPoQSPJoqtwZ89ZwOz
 RiZGuugFp7gDOxqo32lJ53fivO/e1zxXxu0dVHHjOnHBVWX063hlcibTg8kvFWg1
 7bBvUkvgT5jR+UuDX81wPZ+c0kkmfk4gxT5sHg6RlMKeCYi3uuLmAYgla3AM4j9G
 GeNNdVTmilH7wMgYB2wxd0C5HofgKgM5YFLZWc0FVSXMeFs5ST2kbLMXAZqzrKPa
 szHFEJHIGZByXfkP/jix
 =C1ZV
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Clean ups and miscellaneous bug fixes, in particular for the new
  collapse_range and zero_range fallocate functions.  In addition,
  improve the scalability of adding and remove inodes from the orphan
  list"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
  ext4: handle symlink properly with inline_data
  ext4: fix wrong assert in ext4_mb_normalize_request()
  ext4: fix zeroing of page during writeback
  ext4: remove unused local variable "stored" from ext4_readdir(...)
  ext4: fix ZERO_RANGE test failure in data journalling
  ext4: reduce contention on s_orphan_lock
  ext4: use sbi in ext4_orphan_{add|del}()
  ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
  ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
  ext4: remove unnecessary double parentheses
  ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
  ext4: make local functions static
  ext4: fix block bitmap validation when bigalloc, ^flex_bg
  ext4: fix block bitmap initialization under sparse_super2
  ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
  ext4: avoid unneeded lookup when xattr name is invalid
  ext4: fix data integrity sync in ordered mode
  ext4: remove obsoleted check
  ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
  ext4: fix locking for O_APPEND writes
  ...
2014-06-08 13:03:35 -07:00
Mel Gorman
2457aec637 mm: non-atomically mark page accessed during page cache allocation where possible
aops->write_begin may allocate a new page and make it visible only to have
mark_page_accessed called almost immediately after.  Once the page is
visible the atomic operations are necessary which is noticable overhead
when writing to an in-memory filesystem like tmpfs but should also be
noticable with fast storage.  The objective of the patch is to initialse
the accessed information with non-atomic operations before the page is
visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial
allocation of a page cache page.  This patch adds an init_page_accessed()
helper which behaves like the first call to mark_page_accessed() but may
called before the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its
behavior such as whether the page should be marked accessed or not.  Then
old API is preserved but is basically a thin wrapper around this core
function.

Each of the filesystems are then updated to avoid calling
mark_page_accessed when it is known that the VM interfaces have already
done the job.  There is a slight snag in that the timing of the
mark_page_accessed() has now changed so in rare cases it's possible a page
gets to the end of the LRU as PageReferenced where as previously it might
have been repromoted.  This is expected to be rare but it's worth the
filesystem people thinking about it in case they see a problem with the
timing change.  It is also the case that some filesystems may be marking
pages accessed that previously did not but it makes sense that filesystems
have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations.  The size of the
file is 1/10th physical memory to avoid dirty page balancing.  In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO.  The sync results are expected to be
more stable.  The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or NUMA
artifacts.  Throughput and wall times are presented for sync IO, only wall
times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison.  As async results were variable
do to writback timings, I'm only reporting the maximum figures.  The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit strange as it managed to avoid a worst case by
sheer luck but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

[akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:10 -07:00
Maurizio Lombardi
b5b6077855 ext4: fix wrong assert in ext4_mb_normalize_request()
The variable "size" is expressed as number of blocks and not as
number of clusters, this could trigger a kernel panic when using
ext4 with the size of a cluster different from the size of a block.

Cc: stable@vger.kernel.org
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-05-27 12:48:56 -04:00
liang xie
5d60125530 ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
Make them more consistently

Signed-off-by: xieliang <xieliang@xiaomi.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-05-12 22:06:43 -04:00
Andrey Tsyvarev
029b10c5a8 ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
Caches from 'ext4_groupinfo_caches' may be in use by other mounts,
which have already existed.  So, it is incorrect to destroy them when
newly requested mount fails.

Found by Linux File System Verification project (linuxtesting.org).

Signed-off-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-05-12 12:34:21 -04:00
jon ernst
e2cbd58741 ext4: silence sparse check warning for function ext4_trim_extent
This fixes the following sparse warning:

     CHECK   fs/ext4/mballoc.c
   fs/ext4/mballoc.c:5019:9: warning: context imbalance in
   'ext4_trim_extent' - unexpected unlock

Signed-off-by: "Jon Ernst" <jonernst07@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-12 23:01:28 -04:00
Younger Liu
c57ab39b96 ext4: return ENOMEM rather than EIO when find_###_page() fails
Return ENOMEM rather than EIO when find_get_page() fails in
ext4_mb_get_buddy_page_lock() and find_or_create_page() fails in
ext4_mb_load_buddy().

Signed-off-by: Younger Liu <younger.liucn@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-04-10 23:03:43 -04:00
Eric Sandeen
dc9ddd984d ext4: remove unused ac_ex_scanned
When looking at a bug report with:

> kernel: EXT4-fs: 0 scanned, 0 found

I thought wow, 0 scanned, that's odd?  But it's not odd; it's printing
a variable that is initialized to 0 and never touched again.

It's never been used since the original merge, so I don't really even
know what the original intent was, either.

If anyone knows how to hook it up, speak now via patch, otherwise just
yank it so it's not making a confusing situation more confusing in
kernel logs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 13:32:10 -05:00
Theodore Ts'o
ab0c00fccf ext4: make sure ex.fe_logical is initialized
The lowest levels of mballoc set all of the fields of struct
ext4_free_extent except for fe_logical, since they are just trying to
find the requested free set of blocks, and the logical block hasn't
been set yet.  This makes some static code checkers sad.  Set it to
various different debug values, which would be useful when
debugging mballoc if these values were to ever show up due to the
parts of mballoc triyng to use ac->ac_b_ex.fe_logical before it is
properly upper layers of mballoc failing to properly set, usually by
ext4_mb_use_best_found().

Addresses-Coverity-Id: #139697
Addresses-Coverity-Id: #139698
Addresses-Coverity-Id: #139699

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2014-02-20 00:36:41 -05:00
Theodore Ts'o
f5a44db5d2 ext4: add explicit casts when masking cluster sizes
The missing casts can cause the high 64-bits of the physical blocks to
be lost.  Set up new macros which allows us to make sure the right
thing happen, even if at some point we end up supporting larger
logical block numbers.

Thanks to the Emese Revfy and the PaX security team for reporting this
issue.

Reported-by: PaX Team <pageexec@freemail.hu>
Reported-by: Emese Revfy <re.emese@gmail.com>                                 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-20 09:29:35 -05:00
Junho Ryu
4e8d213980 ext4: fix use-after-free in ext4_mb_new_blocks
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.

* Free sequence
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa

* Use sequence
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_use_preallocated (4):		increase pa->pa_count
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_release_context:	access pa

* Use-after-free sequence
[initial status]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
[pa_count decremented]		<pa->pa_deleted = 0, pa_count = 0>
ext4_mb_use_preallocated (4):		increase pa->pa_count
[pa_count incremented]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
[race condition!]		<pa->pa_deleted = 1, pa_count = 1>
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa
ext4_mb_release_context:	access pa

AddressSanitizer has detected use-after-free in ext4_mb_new_blocks
Bug report: http://goo.gl/rG1On3

Signed-off-by: Junho Ryu <jayr@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 18:10:28 -05:00
Lukas Czerner
8f9ff18920 ext4: fix FITRIM in no journal mode
When using FITRIM ioctl on a file system without journal it will
only trim the block group once, no matter how many times you invoke
FITRIM ioctl and how many block you release from the block group.

It is because we only clear EXT4_GROUP_INFO_WAS_TRIMMED_BIT in journal
callback. Fix this by clearing the bit in no journal mode as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jorge Fábregas <jorge.fabregas@gmail.com>
2013-10-30 11:10:52 -04:00
Darrick J. Wong
163a203ddb ext4: mark block group as corrupt on block bitmap error
When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent
further block allocations/deallocations from it. Currently, we end up
generating one error message for every block in the bitmap. This
potentially could make the system unstable as noticed in some
bugs. With this patch, the error will be printed only the first time
and mark the entire block group as corrupted. This prevents future
access allocations/deallocations from it.

Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.

Google-Bug-Id: 7258357

[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only.  Set the corruption flag if the block group bitmap
verification fails.

Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 17:35:51 -04:00
Jan Kara
7d7345322d ext4: fix warning in ext4_da_update_reserve_space()
reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():

EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)

The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.

Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-17 09:36:54 -04:00
Theodore Ts'o
e7676a704e ext4: don't allow ext4_free_blocks() to fail due to ENOMEM
The filesystem should not be marked inconsistent if ext4_free_blocks()
is not able to allocate memory.  Unfortunately some callers (most
notably ext4_truncate) don't have a way to reflect an error back up to
the VFS.  And even if we did, most userspace applications won't deal
with most system calls returning ENOMEM anyway.

Reported-by: Nagachandra P <nagachandra@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-07-13 00:40:35 -04:00
Alexey Khoroshilov
2c00ef3ee3 ext4: implement error handling of ext4_mb_new_preallocation()
If memory allocation in ext4_mb_new_group_pa() is failed,
it returns error code, ext4_mb_new_preallocation() propages it,
but ext4_mb_new_blocks() ignores it.

An observed result was:

- allocation fail means ext4_mb_new_group_pa() does not update
  ext4_allocation_context;

- ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
  ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
  of number of blocks requested (1);

- that activates update cycle in ext4_splice_branch():
    for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
      *(where->p + i) = cpu_to_le32(current_block++);

- it iterates 511 times and corrupts a chunk of memory including inode
  structure;

- page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();

- system hangs with 'scheduling while atomic' BUG.

The patch implements a check for ext4_mb_new_preallocation() error
code and handles its failure as if ext4_mb_regular_allocator() fails.

Found by Linux File System Verification project (linuxtesting.org).

[ Patch restructed by tytso to make the flow of control easier to follow. ]

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-07-01 08:12:36 -04:00
Theodore Ts'o
2ed5724d5a ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator()
For a file systems with a very large number of block groups, if all of
the block group bitmaps are in memory and the file system is
relatively badly fragmented, it's possible ext4_mb_regular_allocator()
to take a long time trying to find a good match.  This is especially
true if the tuning parameter mb_max_to_scan has been sent to a very
large number.  So add a cond_resched() to avoid soft lockup warnings
and to provide better system responsiveness.

For ext4_free_blocks(), if we are deleting a large range of blocks,
and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
the loop to call sb_find_get_block() and to call ext4_forget() can
take over 10-15 milliseocnds or more.  So it's better to add a
cond_resched() here a well.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 11:43:02 -04:00
Linus Torvalds
b973425cbb Fixed regressions (two stability regressions and a performance
regression) introduced during the 3.10-rc1 merge window.  Also
 included is a bug fix relating to allocating blocks after resizing an
 ext3 file system when using the ext4 file system driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJRkZBlAAoJENNvdpvBGATwLYQP/iWOBs2z93WG23cqkgqvL8o6
 ZyeJdgy9dkFCArVDX5SSnGkJXZ3iqIKi5HoTKTJKfytgMzgiDAZcLsIHVv6NczwR
 UGhjgS3HEdV5tJ46E6JnpB3NLSb+rAdc5kCdlsbzU46CP+JjFiYEhxVpK7ELuM/G
 yctChbIH9FY+1OwxHccacBOaJU2ELhnH6B/8Ry/6gM2H0vfKeTNOdocOHdxvbNqg
 ooGjytMfVopMQEfVG8aXtTfy341NFJH5fAYEahCcXxeO9ta6Unj9yOu5JV2wVrTt
 39+DBsquGX6AVQsc9IxJ6YAN6ldwWN7l3huE9/AI0o/alwGsfVi5M+M/d1MMjDqf
 Fgl2EzzBpZQeKKY9UXNi4LLgYdBiILMgKDOGoRKhRb8ynSSf/JX43+24FvidEi3o
 o//J4aR+oSZfaovGAeikqyF1cumayhoNN8MINRN8igIinBiC4GjBFEl/Kl/1eAY/
 lREGcsmYPXOkVPpM72waRYlP4GwNdOg4QSEY0SGljpwluO+dYtKQjHXcv/s/xL5v
 j3GemzYVyjx4zaq1g3PxGfuD6VKFHr0T6jvzd6cHu17lnPlw9fwznHbEm9BEcXDY
 gbGx9u+a2ZTqDwYVALbeoRpf9Zz6DUCse3ts4N3rbkXUQQiBYo7tybfVopIMAukb
 CexvidDE/ryJrJJFBwoK
 =6cRD
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Fixed regressions (two stability regressions and a performance
  regression) introduced during the 3.10-rc1 merge window.

  Also included is a bug fix relating to allocating blocks after
  resizing an ext3 file system when using the ext4 file system driver"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
  ext4: revert "ext4: use io_end for multiple bios"
  ext4: limit group search loop for non-extent files
  ext4: fix fio regression
2013-05-14 09:30:54 -07:00
Lachlan McIlroy
e6155736ad ext4: limit group search loop for non-extent files
In the case where we are allocating for a non-extent file,
we must limit the groups we allocate from to those below
2^32 blocks, and ext4_mb_regular_allocator() attempts to
do this initially by putting a cap on ngroups for the
subsequent search loop.

However, the initial target group comes in from the 
allocation context (ac), and it may already be beyond
the artificially limited ngroups.  In this case,
the limit

	if (group == ngroups)
		group = 0;

at the top of the loop is never true, and the loop will
run away.

Catch this case inside the loop and reset the search to
start at group 0.

[sandeen@redhat.com: add commit msg & comments]

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-05-05 23:10:00 -04:00
Linus Torvalds
20b4fb4852 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
  don't bother with deferred freeing of fdtables
  proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
  proc: Make the PROC_I() and PDE() macros internal to procfs
  proc: Supply a function to remove a proc entry by PDE
  take cgroup_open() and cpuset_open() to fs/proc/base.c
  ppc: Clean up scanlog
  ppc: Clean up rtas_flash driver somewhat
  hostap: proc: Use remove_proc_subtree()
  drm: proc: Use remove_proc_subtree()
  drm: proc: Use minor->index to label things, not PDE->name
  drm: Constify drm_proc_list[]
  zoran: Don't print proc_dir_entry data in debug
  reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
  proc: Supply an accessor for getting the data from a PDE's parent
  airo: Use remove_proc_subtree()
  rtl8192u: Don't need to save device proc dir PDE
  rtl8187se: Use a dir under /proc/net/r8180/
  proc: Add proc_mkdir_data()
  proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
  proc: Move PDE_NET() to fs/proc/proc_net.c
  ...
2013-05-01 17:51:54 -07:00
Dmitri Monakho
8c8e0ca622 ext4: fix usless declarations
This patch should fix sparse complains about shadow declatations.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 22:48:36 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Andrey Sidorov
eabe0444df ext4: speed-up releasing blocks on commit
Improve mb_free_blocks speed by clearing entire range at once instead
of iterating over each bit. Freeing block-by-block also makes buddy
bitmap subtree flip twice making most of the work a no-op. Very few
bits in buddy bitmap require change, e.g. freeing entire group is a 1
bit flip only.  As a result, releasing blocks of 60G file now takes
5ms instead of 2.7s.  This is especially good for non-preemptive
kernels as there is no rescheduling during release.

Signed-off-by: Andrey Sidorov <qrxd43@motorola.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-09 12:22:29 -04:00
Lukas Czerner
bd86298e60 ext4: introduce ext4_get_group_number()
Currently on many places in ext4 we're using
ext4_get_group_no_and_offset() even though we're only interested in
knowing the block group of the particular block, not the offset within
the block group so we can use more efficient way to compute block
group.

This patch introduces ext4_get_group_number() which computes block
group for a given block much more efficiently. Use this function
instead of ext4_get_group_no_and_offset() everywhere where we're only
interested in knowing the block group.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-04-03 23:32:34 -04:00
Dmitry Monakhov
5d3ee20855 ext4: fix journal callback list traversal
It is incorrect to use list_for_each_entry_safe() for journal callback
traversial because ->next may be removed by other task:
->ext4_mb_free_metadata()
  ->ext4_mb_free_metadata()
    ->ext4_journal_callback_del()

This results in the following issue:

WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
Hardware name:
list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
Call Trace:
 [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
 [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
 [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
 [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
 [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
 [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
 [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
 [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
 [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
 [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
 [<ffffffff810ac6be>] kthread+0x10e/0x120
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70

This patch fix the issue as follows:
- ext4_journal_commit_callback() make list truly traversial safe
  simply by always starting from list_head
- fix race between two ext4_journal_callback_del() and
  ext4_journal_callback_try_del()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.com
2013-04-03 22:08:52 -04:00
Theodore Ts'o
b10a44c369 ext4: add might_sleep() annotations
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2013-04-03 22:00:52 -04:00
Theodore Ts'o
90ba983f68 ext4: use atomic64_t for the per-flexbg free_clusters count
A user who was using a 8TB+ file system and with a very large flexbg
size (> 65536) could cause the atomic_t used in the struct flex_groups
to overflow.  This was detected by PaX security patchset:

http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551

This bug was introduced in commit 9f24e4208f, so it's been around
since 2.6.30.  :-(

Fix this by using an atomic64_t for struct orlav_stats's
free_clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2013-03-11 23:39:59 -04:00
Lukas Czerner
bb8b20ed94 ext4: do not use yield()
Using yield() is strongly discouraged (see sched/core.c) especially
since we can just use cond_resched().

Replace all use of yield() with cond_resched().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:28:09 -04:00
Lukas Czerner
e3d85c3660 ext4: remove unused variable in ext4_free_blocks()
Remove unused variable 'freed' in ext4_free_blocks().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-03-10 22:21:49 -04:00
Lukas Czerner
810da240f2 ext4: convert number of blocks to clusters properly
We're using macro EXT4_B2C() to convert number of blocks to number of
clusters for bigalloc file systems.  However, we should be using
EXT4_NUM_B2C().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-03-02 17:18:58 -05:00
Theodore Ts'o
a0b30c1229 ext4: use module parameters instead of debugfs for mballoc_debug
There are multiple reasons to move away from debugfs.  First of all,
we are only using it for a single parameter, and it is much more
complicated to set up (some 30 lines of code compared to 3), and one
more thing that might fail while loading the ext4 module.

Secondly, as a module paramter it can be specified as a boot option if
ext4 is built into the kernel, or as a parameter when the module is
loaded, and it can also be manipulated dynamically under
/sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.

Ultimately we want to move away from using mb_debug() towards
tracepoints, but for now this is still a useful simplification of the
code base.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-02-09 16:28:20 -05:00
Theodore Ts'o
40ae348762 ext4: optimize mballoc for large allocations
The ext4 block allocator only maintains buddy bitmaps for chunks which
are less than or equal to one quarter of a block group.  That is, for
a file aystem with a 1k blocksize, and where the number of blocks in a
block group is 8192 blocks, the largest chunk size tracked by buddy
bitmaps is 2048 blocks.

For a file system with a 4k blocksize, and where the number of blocks
in a block group is 32768 blocks, the largest chunk size tracked by
buddy bitmaps is 8192 blocks.

To work around this code, mballoc.c before this commit would truncate
allocation requests to the number of blocks in a block group minus 10.
Why 10?  Aside from being a completely arbitrary number, it avoids
block allocation to be a power of two larger than 25% of the block
group.  If you try to explicitly fallocate 50% of the block group
size, this will demonstrate the problem; the block allocation code
will scan the all of the blocks in the file system with cr==0 (since
the request is for a natural power of two), but then completely fail
for all blocks groups, since the buddy bitmaps don't track chunk sizes
of 50% of the block group.

To fix this, in these we use ext4_mb_complex_scan_group() instead of
ext4_mb_simple_scan_group().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger@dilger.ca>
2013-02-04 15:08:40 -05:00
Niu Yawei
f116700971 ext4: fix race in ext4_mb_add_n_trim()
In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when
changing the lg_prealloc_list.

Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-02-01 21:31:27 -05:00
Lukas Czerner
d71c1ae23a ext4: warn when discard request fails other than EOPNOTSUPP
We should warn user then the discard request fails. However we need to
exclude -EOPNOTSUPP case since parts of the device might not support it
while other parts can. So print the kernel warning when the error !=
-EOPNOTSUPP is returned from ext4_issue_discard().

We should also handle error cases in batched discard, again excluding
EOPNOTSUPP.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 14:04:52 -05:00
Alan Cox
d8ec0c3960 ext4: remove unused assignment
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 12:19:58 -05:00
Eric Sandeen
6d138ced75 ext4: fix awful goto in ext4_mb_new_blocks()
I think the whole function could be made prettier, but
that goto really took the cake for too-clever-by-half.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-11-08 11:11:59 -05:00
Linus Torvalds
e589db7a6a Various bug fixes for ext4. The most serious of them fixes a security
bug (CVE-2012-4508) which leads to stale data exposure when we have
 fallocate racing against writes to files undergoing delayed
 allocation.  We also have two fixes for the metadata checksum feature,
 the most serious of which can cause the superblock to have a invalid
 checksum after a power failure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQhcJVAAoJENNvdpvBGATwlc0QAJ5eRVSXoQ9DL/rpycZtWsiR
 1HofZCBbeVJq7JkazypYZPV+ncm2Nxljx61EBMpReDgx+hgJS8VD7BcjxblXT1gK
 cvIk7tYXS1E5++TWZzQd0v3GMDoMJsfzb0Ao6vefaVgqh07MKE9Zvx0L8JR4tsH1
 YlRs2/ZALFqqMficemXpDuWRRoBTEcYkvaW9PtUIpeuk9i71iSCDiHvi0mRy4dYe
 nLftjBOjcsIuK0I7DfUYrbZNQuYacFcFTM5foE6lhdT+tlL1/od2M00IpopSSjF8
 7RoqV351FqL74Stu71wDp+q9n8t8bR9gnvEuDisHXXH6PKIYo83vawvuDKtP05lt
 lF0l2nKy/QorQtUNRnrWiRshPNEplmKM1yfRXwzfq5CX4Mjox1PM9g1AfMT/Pzbq
 wNPMqtiaNnVzfcSP94MTExKMR5axFgeFsIwuCtPVNDAUEbEFuwKARIeFjCGxYYsr
 81rIKD4lgvJjaHChtE/NzslQysMmr6qiZa17s+NteCwNRJX7U4xN99SO2BXSW7lW
 xGb1ZjdESiBZGzsmuOXqAQw7KWIRS7bQ+s4dewEbqQJomPD3NQUKsRVo1wdWeUqI
 6SI1YsBYEDPiViiohsFyn1Zl3BHgIvypWMW/ChhKtYsmEJapTnJ3SR3a+eafFZ99
 HgCMCF1i0KN1tvMDrLRn
 =qL29
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Various bug fixes for ext4.  The most serious of them fixes a security
  bug (CVE-2012-4508) which leads to stale data exposure when we have
  fallocate racing against writes to files undergoing delayed
  allocation.  We also have two fixes for the metadata checksum feature,
  the most serious of which can cause the superblock to have a invalid
  checksum after a power failure."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Avoid underflow in ext4_trim_fs()
  ext4: Checksum the block bitmap properly with bigalloc enabled
  ext4: fix undefined bit shift result in ext4_fill_flex_info
  ext4: fix metadata checksum calculation for the superblock
  ext4: race-condition protection for ext4_convert_unwritten_extents_endio
  ext4: serialize fallocate with ext4_convert_unwritten_extents
2012-10-23 08:48:26 +03:00
Lukas Czerner
5de35e8d5c ext4: Avoid underflow in ext4_trim_fs()
Currently if len argument in ext4_trim_fs() is smaller than one block,
the 'end' variable underflow. Avoid that by returning EINVAL if len is
smaller than file system block.

Also remove useless unlikely().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-10-22 18:01:19 -04:00
Tao Ma
79f1ba4956 ext4: Checksum the block bitmap properly with bigalloc enabled
In mke2fs, we only checksum the whole bitmap block and it is right.
While in the kernel, we use EXT4_BLOCKS_PER_GROUP to indicate the
size of the checksumed bitmap which is wrong when we enable bigalloc.
The right size should be EXT4_CLUSTERS_PER_GROUP and this patch fixes
it.

Also as every caller of ext4_block_bitmap_csum_set and
ext4_block_bitmap_csum_verify pass in EXT4_BLOCKS_PER_GROUP(sb)/8,
we'd better removes this parameter and sets it in the function itself.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: stable@vger.kernel.org
2012-10-22 00:34:32 -04:00
Linus Torvalds
6432f21284 The big new feature added this time is supporting online resizing
using the meta_bg feature.  This allows us to resize file systems
 which are greater than 16TB.  In addition, the speed of online
 resizing has been improved in general.
 
 We also fix a number of races, some of which could lead to deadlocks,
 in ext4's Asynchronous I/O and online defrag support, thanks to good
 work by Dmitry Monakhov.
 
 There are also a large number of more minor bug fixes and cleanups
 from a number of other ext4 contributors, quite of few of which have
 submitted fixes for the first time.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQbxMXAAoJENNvdpvBGATwlg4QAJZ4mHNSL2eaaxjRtTbL1pAz
 +FVXpJ3lhw1lSfE9hJGqPVE8EfU2fWjIqxEI7dgh95Tukc5pUnPAQ2/hBz8ZA0qq
 o0AFMk3mRnvCEh6HsZfumsV83eqpR3k/zEy4uFH+KtxBskPe2sEKy3B7qOxvgdKW
 Gh8B2WqF2BpIj9WIT1P9G6xsxZW64EMHTbWcgRhuoRD7bakDNnwQ3kElz/TJQU5q
 bM/5wE7pqKwU2J1L0Ho0mxDi0f/BbXeJdA9k1tQy2KM1pZwHtpj4Ls0qmfoi49GE
 KyZqQOXlFbAz/9tidPDceY5KoRRQm1MwZ+1MimQX1P+40cs/w3pNu3yiibcaXIru
 UZ63AQMCj5JHMcFNVi20sVCwjU/ibNtEO75cfDD4bzPgHJvfCj73EbHTLl21nbTu
 izIMffhJEHmRnmRXiiortYVuI4b19oIfnXg7eclrJoUWSuGwKKsJOc5nMjDqidG4
 B7Gq4TD89sGkIYzx+50E+ll2ispcBN0BQnGqp4k2BzgDyEHhuFYk7VuVQvJgCGTi
 eobzQJj7JUXPWxyemcAVkQTtUq4vVbkm/IwS+/GA9b9Z80X8hR8x6EVHUW5lX3qC
 YHoBSCU4XKZXXWqzx0fIVCXyKKFiBzM+OXcgHOKH90vK8k6kPmPODhNCxvV3pITU
 jfl9q+X1dY4SpybZjLt5
 =iYeV
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The big new feature added this time is supporting online resizing
  using the meta_bg feature.  This allows us to resize file systems
  which are greater than 16TB.  In addition, the speed of online
  resizing has been improved in general.

  We also fix a number of races, some of which could lead to deadlocks,
  in ext4's Asynchronous I/O and online defrag support, thanks to good
  work by Dmitry Monakhov.

  There are also a large number of more minor bug fixes and cleanups
  from a number of other ext4 contributors, quite of few of which have
  submitted fixes for the first time."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
  ext4: fix ext4_flush_completed_IO wait semantics
  ext4: fix mtime update in nodelalloc mode
  ext4: fix ext_remove_space for punch_hole case
  ext4: punch_hole should wait for DIO writers
  ext4: serialize truncate with owerwrite DIO workers
  ext4: endless truncate due to nonlocked dio readers
  ext4: serialize unlocked dio reads with truncate
  ext4: serialize dio nonlocked reads with defrag workers
  ext4: completed_io locking cleanup
  ext4: fix unwritten counter leakage
  ext4: give i_aiodio_unwritten a more appropriate name
  ext4: ext4_inode_info diet
  ext4: convert to use leXX_add_cpu()
  ext4: ext4_bread usage audit
  fs: reserve fallocate flag codepoint
  ext4: remove redundant offset check in mext_check_arguments()
  ext4: don't clear orphan list on ro mount with errors
  jbd2: fix assertion failure in commit code due to lacking transaction credits
  ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
  ext4: enable FITRIM ioctl on bigalloc file system
  ...
2012-10-08 06:36:39 +09:00
Linus Torvalds
99dbb1632f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull the trivial tree from Jiri Kosina:
 "Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  doc: fix old config name of kprobetrace
  fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
  btrfs: fix the commment for the action flags in delayed-ref.h
  btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
  vfs: fix kerneldoc for generic_fh_to_parent()
  treewide: fix comment/printk/variable typos
  ipr: fix small coding style issues
  doc: fix broken utf8 encoding
  nfs: comment fix
  platform/x86: fix asus_laptop.wled_type module parameter
  mfd: printk/comment fixes
  doc: getdelays.c: remember to close() socket on error in create_nl_socket()
  doc: aliasing-test: close fd on write error
  mmc: fix comment typos
  dma: fix comments
  spi: fix comment/printk typos in spi
  Coccinelle: fix typo in memdup_user.cocci
  tmiofb: missing NULL pointer checks
  tools: perf: Fix typo in tools/perf
  tools/testing: fix comment / output typos
  ...
2012-10-01 09:06:36 -07:00
Lukas Czerner
aaf7d73e54 ext4: enable FITRIM ioctl on bigalloc file system
With a minor tweaks regarding minimum extent size to discard and
discarded bytes reporting the FITRIM can be enabled on bigalloc file
system and it works without any problem.

This patch fixes minlen handling and discarded bytes reporting to
take into consideration bigalloc enabled file systems and finally
removes the restriction and allow FITRIM to be used on file system with
bigalloc feature enabled.

Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-26 22:21:21 -04:00
Wei Yongjun
85556c9a50 ext4: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-26 20:43:37 -04:00
Yongqiang Yang
838cd0cf9a ext4: check free block counters in ext4_mb_find_by_goal
Free block counters should be checked before doing allocation.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-23 23:10:51 -04:00
Theodore Ts'o
b5e2368bae ext4: re-enable -o discard functionality in no-journal mode
This is a revert of commit b56ff9d397, which removed the call to
ext4_issue_discard() to fix a BUG reported because
ext4_issue_discard() was being called from inside a block group
spinlock.  As it turns out this bug had already been fixed by Lukas
Czerner in commit 53fdcf992d by the simple expedient of moving when
we call ext4_issue_discard() outside the spinlock.

So it should be safe to re-enable this functionality, which I tested
by putting an BUG_ON(in_atomic) just after the restored callsite to
ext4_issue_discard().

Addresses-Google-Bug: #6750518

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
2012-09-18 13:33:44 -04:00
Theodore Ts'o
28623c2f5b ext4: grow the s_group_info array as needed
Previously we allocated the s_group_info array with enough space for
any future possible growth of the file system via online resize.  This
is unfortunate because it wastes memory, and it doesn't work for the
meta_bg scheme, since there is no limit based on the number of
reserved gdt blocks.  So add the code to grow the s_group_info array
as needed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-09-05 01:31:50 -04:00
Anatol Pomozov
4907cb7b19 treewide: fix comment/printk/variable typos
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-01 10:33:05 -07:00
Robin Dong
15c006a22f ext4: remove unused function argument 'order' in mb_find_extent()
All the routines call mb_find_extent are setting argument 'order' to 0
just like:

	mb_find_extent(e4b, 0, ex.fe_start, ex.fe_len, &ex);

therefore the useless argument should be removed.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 10:02:17 -04:00
Theodore Ts'o
01fc48e892 ext4: don't load the block bitmap for block groups which have no space
Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available.  (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=45741
Reported-by: mirek@me.com
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-08-17 09:46:17 -04:00
Jan Kara
97a7406880 ext4: remove useless marking of superblock dirty
Commit a0375156 properly notes that superblock doesn't need to be marked
as dirty when only number of free inodes / blocks / number of directories
changes since that is recomputed on each mount anyway. However that comment
leaves some unnecessary markings as dirty in place. Remove these.

Artem: tested using xfstests for both journalled and non-journalled ext4.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2012-07-22 20:29:31 -04:00
Haibo Liu
62a1391ddd ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()
In this patch, the statement "poff = block % blocks_per_page"
in ext4_mb_get_buddy_page_lock has no effect.

It will be optimized out by the compiler, but it's better to remove it.

Signed-off-by: Haibo Liu <HaiboLiu6@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-07-09 16:29:28 -04:00
Aditya Kali
1c8457cadc ext4: avoid uneeded calls to ext4_mb_load_buddy() while reading mb_groups
Currently ext4_mb_load_buddy is called for every group, irrespective
of whether the group info is already in memory, while reading
/proc/fs/ext4/<partition>/mb_groups proc file.  For the purpose of
mb_groups proc file, it is unnecessary to load the file group info
from disk if it was loaded in past.  These calls to ext4_mb_load_buddy
make reading the mb_groups proc file expensive.

Also, the locks around ext4_get_group_info are not required.

This patch modifies the code to call ext4_mb_load_buddy only if the
group info had never been loaded into memory in past. It also removes
the mb group locking around ext4_get_group_info call.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-06-30 19:10:57 -04:00
Salman Qazi
95599968d1 ext4: remove mb_groups before tearing down the buddy_cache
We can't have references held on pages in the s_buddy_cache while we are
trying to truncate its pages and put the inode.  All the pages must be
gone before we reach clear_inode.  This can only be gauranteed if we
can prevent new users from grabbing references to s_buddy_cache's pages.

The original bug can be reproduced and the bug fix can be verified by:

while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \
	umount /export/hda3/ram0; done &

while true; do cat /proc/fs/ext4/ram0/mb_groups; done

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-05-31 23:52:14 -04:00
Salman Qazi
02b7831019 ext4: add ext4_mb_unload_buddy in the error path
ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching
ext4_mb_unload_buddy when it fails a memory allocation.

Signed-off-by: Salman Qazi <sqazi@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2012-05-31 23:51:27 -04:00
Zheng Liu
400db9d301 ext4: cleanup in ext4_discard_allocated_blocks()
remove 'len' variable in ext4_discard_allocated_blocks() because it is
useless.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 17:53:53 -04:00
Akira Fujita
9d99012ff2 ext4: remove needs_recovery in ext4_mb_init()
needs_recovery in ext4_mb_init() is not used, remove it.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.ne.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-05-28 14:19:25 -04:00
Darrick J. Wong
feb0ab32a5 ext4: make block group checksums use metadata_csum algorithm
metadata_csum supersedes uninit_bg.  Convert the ROCOMPAT uninit_bg
flag check to a helper function that covers both, and make the
checksum calculation algorithm use either crc16 or the metadata_csum
chosen algorithm depending on which flag is set.  Print a warning if
we try to mount a filesystem with both feature flags set.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:45:10 -04:00
Darrick J. Wong
fa77dcfafe ext4: calculate and verify block bitmap checksum
Compute and verify the checksum of the block bitmap; this checksum is
stored in the block group descriptor.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-04-29 18:35:10 -04:00
Lukas Czerner
a7967f055a ext4: always set then trimmed blocks count into len
Currently if the range to trim is too small, for example on 1K fs
the request to trim the first block, then the 'range->len' is not set
reporting wrong number of discarded block to the caller.

Fix this by always setting the 'range->len' before we return. Note that
when there is a failure (-EINVAL) caller can not depend on 'range->len'
being set properly.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 21:26:22 -04:00
Lukas Czerner
21e7fd22a5 ext4: fix trimmed block count accunting
Currently when there is not enough free blocks in the block group to
discard (grp->bb_free < minlen) the 'trimmed' is bumped up anyway with
the number of discarded blocks from the previous iteration. Fix this
by bumping up 'trimmed' only if the ext4_trim_all_free() was actually
run.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 21:24:22 -04:00
Lukas Czerner
913eed83ed ext4: fix start and len arguments handling in ext4_trim_fs()
The overflow can happen when we are calling get_group_no_and_offset()
which stores the group number in the ext4_grpblk_t type which is
actually int. However when the blocknr is big enough the group number
might be bigger than ext4_grpblk_t resulting in overflow. This will
most likely happen with FITRIM default argument len = ULLONG_MAX.

Fix this by using "end" variable instead of "start+len" as it is easier
to get right and specifically check that the end is not beyond the end
of the file system, so we are sure that the result of
get_group_no_and_offset() will not overflow. Otherwise truncate it to
the size of the file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-21 21:22:22 -04:00
Theodore Ts'o
1084f252e3 ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
The functions ext4_msg() and ext4_error() already tack on a trailing
newline, so remove the unnecessary extra newline.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:13:43 -04:00
Joe Perches
7f6a11e73d ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
ext4_msg adds "EXT4-fs: " to the messsage output.
Remove the redundant bits from uses.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-19 23:09:43 -04:00
Theodore Ts'o
c5e8f3f3bc ext4: remove EXT4_MB_{BITMAP,BUDDY} macros
The EXT4_MB_BITMAP and EXT4_MB_BUDDY macros obfuscate more than they
provide any abstraction.   So remove them.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:54:06 -05:00
Curt Wohlgemuth
1592d2c557 ext4: mark possibly unused variable in ext4_mb_normalize_request()
The 'orig_size' local variable is only used in a call to
mb_debug().  Mark it with '__maybe_unused'.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:03 -05:00
Bobi Jam
18aadd47f8 ext4: expand commit callback and
The per-commit callback was used by mballoc code to manage free space
bitmaps after deleted blocks have been released.  This patch expands
it to support multiple different callbacks, to allow other things to
be done after the commit has been completed.

Signed-off-by: Bobi Jam <bobijam@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:02 -05:00
Theodore Ts'o
813e57276f ext4: fix race when setting bitmap_uptodate flag
In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate()
before submitting the buffer for read.  The is bad, since we check
bitmap_uptodate() without locking the buffer, and so if another
process is racing with us, it's possible that they will think the
bitmap is uptodate even though the read has not completed yet,
resulting in inodes and blocks potentially getting allocated more than
once if we get really unlucky.

Addresses-Google-Bug: 2828254

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:52:46 -05:00
Yongqiang Yang
60e07cf515 ext4: do not reference pa_inode from group_pa
pa_inode in group_pa is set NULL in ext4_mb_new_group_pa, so
pa_inode should be not referenced.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-18 15:49:54 -05:00
Robin Dong
0a10da73e1 ext4: fix a wrong comment in __mb_check_buddy()
The comment says the bit should be 0, but the after code assert the
bit to be 1.  This makes people confused, so fix it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 08:48:54 -04:00
Robin Dong
b051d8dc4e ext4: remove unused variable in mb_find_extent()
The variable 'ord' in function mb_find_extent() is redundant, so
remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:30:30 -04:00
Robin Dong
66a83cde47 ext4: remove unused variable in ext4_mb_generate_from_pa()
The variable 'count' in function ext4_mb_generate_from_pa() looks
useless, so remove it.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:29:21 -04:00
Robin Dong
ebbe027797 ext4: use stream-alloc when mb_group_prealloc set to zero
The kernel will crash on 

ext4_mb_mark_diskspace_used:
	BUG_ON(ac->ac_b_ex.fe_len <= 0);

after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem.

The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request
because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC.

I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION,
so we should set alloc-strategy to STREAM in this case.

Signed-off-by: Robin Dong <sanbai@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-26 05:14:27 -04:00
Dmitry Monakhov
45dc63e7d8 ext4: Allow quota file use root reservation
Quota file is fs's metadata, so it is reasonable  to permit use
root resevation if necessary. This patch fix 265'th xfstest failure

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-20 20:07:23 -04:00
Tao Ma
7aa0baeaba ext4: Free resources in ext4_mb_init()'s error paths
In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation
of s_locality_group to avoid memory leak in error path, but there are
still some other error paths in ext4_mb_init that need to do the same
work. So this patch adds all the error patch for ext4_mb_init. And all
the pointers are reset to NULL in case the caller may double free them.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-06 10:22:28 -04:00
Theodore Ts'o
e7d5f3156e ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:14:51 -04:00
Theodore Ts'o
cff1dfd767 ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:12:51 -04:00
Theodore Ts'o
021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Aditya Kali
7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o
27baebb849 ext4: tune mballoc's default group prealloc size for bigalloc file systems
The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size.  This is
probably too big for large cluster sizes.  So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:02:51 -04:00
Theodore Ts'o
24aaa8ef4e ext4: convert the free_blocks field in s_flex_groups to be free_clusters
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:58:51 -04:00
Theodore Ts'o
5704265188 ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:56:51 -04:00
Theodore Ts'o
84130193e0 ext4: teach ext4_free_blocks() about bigalloc and clusters
The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not.  That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.

We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:50:51 -04:00
Theodore Ts'o
53accfa9f8 ext4: teach mballoc preallocation code about bigalloc clusters
In most of mballoc.c, we do everything in units of clusters, since the
block allocation bitmaps and buddy bitmaps are all denominated in
clusters.  The one place where we do deal with absolute block numbers
is in the code that handles the preallocation regions, since in the
case of inode-based preallocation regions, the start of the
preallocation region can't be relative to the beginning of the group.

So this adds a bit of complexity, where pa_pstart and pa_lstart are
block numbers, while pa_free, pa_len, and fe_len are denominated in
units of clusters.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:48:51 -04:00
Theodore Ts'o
7137d7a48e ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP
Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems).  There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:38:51 -04:00
Yu Jian
79a77c5ac3 ext4: prevent memory leaks from ext4_mb_init_backend() on error path
In ext4_mb_init(), if the s_locality_group allocation fails it will
currently cause the allocations made in ext4_mb_init_backend() to
be leaked.  Moving the ext4_mb_init_backend() allocation after the
s_locality_group allocation avoids that problem.

Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 17:41:46 -04:00
Yu Jian
48e6061bf4 ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
Signed-off-by: Yu Jian <yujian@whamcloud.com>
Signed-off-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 17:41:39 -04:00
Theodore Ts'o
9d8b9ec442 ext4: use ext4_msg() instead of printk in mballoc
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 17:41:35 -04:00
Theodore Ts'o
f18a5f21c2 ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:38 -04:00
Yongqiang Yang
c3e94d1df9 ext4: let setup_new_group_blocks() set multiple bits at a time
Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 22:05:53 -04:00
Yongqiang Yang
4740b830ed ext4: let ext4_group_add_blocks() handle 0 blocks quickly
If ext4_group_add_blocks() is called with 0 block, make it return 0
without doing any extra work.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:51:08 -04:00
Yongqiang Yang
cc7365dfe4 ext4: let ext4_group_add_blocks() return an error code
This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:46:07 -04:00
Yongqiang Yang
0529155e8a ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:43:56 -04:00
Tao Ma
ced156e464 ext4: don't increment s_mb_buddies_generated in ext4_mb_release
In ext4_mb_release, we use s_mb_buddies_generated++.  Although
the output is OK, but I don't think we need this extra ++.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-23 16:18:05 -04:00
Tao Ma
529da704ad ext4: remove unnecessary ext4_get_group_info in ext4_mb_load_buddy
ext4_mb_load_buddy() calls ext4_get_group_info() for setting both
"grp" and "e4b->bd_info", but it could do "e4b->bd_info = grp".

Reported-by:  Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-23 16:07:26 -04:00
Dan Ehrenberg
d7a1fee135 ext4: make the preallocation size be a multiple of stripe size
Previously, if a stripe width was provided, then it would be used
as the preallocation granularity, with no santiy checking and no
way to override this. Now, mb_prealloc_size defaults to the smallest
multiple of stripe size that is greater than or equal to the old
default mb_prealloc_size, and this can be overridden with the sysfs
interface.

Signed-off-by: Dan Ehrenberg <dehrenberg@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-17 21:11:30 -04:00
Tao Ma
caaf7a29d3 ext4: Fix a double free of sbi->s_group_info in ext4_mb_init_backend
If we meet with an error in ext4_mb_add_groupinfo, we kfree
sbi->s_group_info[group >> EXT4_DESC_PER_BLOCK_BITS(sb)], but fail to
reset it to NULL. So the caller ext4_mb_init_backend will try to kfree
it again and causes a double free. So fix it by resetting it to NULL.

Some typo in comments of mballoc.c are also changed.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 18:42:42 -04:00
Tao Ma
823ba01fc0 ext4: fix a race which could leak memory in ext4_groupinfo_create_slab()
In ext4_groupinfo_create_slab, we create ext4_groupinfo_caches within
ext4_grpinfo_slab_create_mutex, but set it outside the lock, and there
does exist some case that we may create it twice and causes a memory
leak.  So set it before we call mutex_unlock.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 18:26:01 -04:00
Tao Ma
22612283f7 ext4: Change the wrong param comment for ext4_trim_all_free
at ext4_trim_all_free() comment, there is no longer an @e4b parameter,
instead it is @group.

Reported-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 00:04:34 -04:00
Tao Ma
3d56b8d2c7 ext4: Speed up FITRIM by recording flags in ext4_group_info
In ext4, when FITRIM is called every time, we iterate all the
groups and do trim one by one. It is a bit time wasting if the
group has been trimmed and there is no change since the last
trim.

So this patch adds a new flag in ext4_group_info->bb_state to
indicate that the group has been trimmed, and it will be cleared
if some blocks is freed(in release_blocks_on_commit). Another
trim_minlen is added in ext4_sb_info to record the last minlen
we use to trim the volume, so that if the caller provide a small
one, we will go on the trim regardless of the bb_state.

A simple test with my intel x25m ssd:
df -h shows:
/dev/sdb1              40G   21G   17G  56% /mnt/ext4
Block size:               4096

run the FITRIM with the following parameter:
range.start = 0;
range.len = UINT64_MAX;
range.minlen = 1048576;

without the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.505s
user	0m0.000s
sys	0m1.224s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.359s
user	0m0.000s
sys	0m1.178s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.228s
user	0m0.000s
sys	0m1.151s

with the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.625s
user	0m0.000s
sys	0m1.269s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m0.002s
user	0m0.000s
sys	0m0.001s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m0.002s
user	0m0.000s
sys	0m0.001s

A big improvement for the 2nd and 3rd run.

Even after I delete some big image files, it is still much
faster than iterating the whole disk.

[root@boyu-tm test]# time ./ftrim /mnt/ext4/a
real	0m1.217s
user	0m0.000s
sys	0m0.196s

Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 00:03:38 -04:00
Tao Ma
b3d4c2b10b ext4: Add new ext4 trim tracepoints
Add ext4_trim_extent and ext4_trim_all_free.

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 00:01:52 -04:00
Tao Ma
169ddc3ec8 ext4: speed up group trim with the right free block count
When we trim some free blocks in a group of ext4, we need to 
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
isn't appropriate.

Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M.  With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M.  But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 00:00:07 -04:00
Tao Ma
22f1045743 ext4: fix trim length underflow with small trim length
In 0f0a25b, we adjust 'len' with s_first_data_block - start, but
it could underflow in case blocksize=1K, fstrim_range.len=512 and
fstrim_range.start = 0. In this case, when we run the code:
len -= first_data_blk - start; len will be underflow to -1ULL.
In the end, although we are safe that last_group check later will limit
the trim to the whole volume, but that isn't what the user really want.

So this patch fix it. It also adds the check for 'start' like ext3 so that
we can break immediately if the start is invalid.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-10 23:52:37 -04:00
Maxim Patlasov
7132de744b ext4: fix i_blocks/quota accounting when extent insertion fails
The current implementation of ext4_free_blocks() always calls
dquot_free_block This looks quite sensible in the most cases: blocks
to be freed are associated with inode and were accounted in quota and
i_blocks some time ago.

However, there is a case when blocks to free were not accounted by the
time calling ext4_free_blocks() yet:

1. delalloc is on, write_begin pre-allocated some space in quota
2. write-back happens, ext4 allocates some blocks in ext4_ext_map_blocks()
3. then ext4_ext_map_blocks() gets an error (e.g.  ENOSPC) from
   ext4_ext_insert_extent() and calls ext4_free_blocks().

In this scenario, ext4_free_blocks() calls dquot_free_block() who, in
turn, decrements i_blocks for blocks which were not accounted yet (due
to delalloc) After clean umount, e2fsck reports something like:

> Inode 21, i_blocks is 5080, should be 5128.  Fix<y>?
because i_blocks was erroneously decremented as explained above.

The patch fixes the problem by passing the new flag
EXT4_FREE_BLOCKS_NO_QUOT_UPDATE to ext4_free_blocks(), to request
that the dquot_free_block() call be skipped.

Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-07-10 19:37:48 -04:00
Yongqiang Yang
9331b62610 ext4: quiet 'unused variables' compile warnings
Unused variables was deleted.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-28 10:19:05 -04:00
Lukas Czerner
a9c667f8f0 ext4: fixed tracepoints cleanup
While creating fixed tracepoints for ext3, basically by porting them
from ext4, I found a lot of useless retyping, wrong type usage, useless
variable passing and other inconsistencies in the ext4 fixed tracepoint
code.

This patch cleans the fixed tracepoint code for ext4 and also simplify
some of them.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-06 09:51:52 -04:00
Yongqiang Yang
5def136025 ext4: correct comments for ext4_free_blocks()
metadata is not parameter of ext4_free_blocks() any more.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-06-05 23:26:40 -04:00
Allison Henderson
55f020db66 ext4: add flag to ext4_has_free_blocks
This patch adds an allocation request flag to the ext4_has_free_blocks
function which enables the use of reserved blocks.  This will allow a
punch hole to proceed even if the disk is full.  Punching a hole may
require additional blocks to first split the extents.

Because ext4_has_free_blocks is a low level function, the flag needs
to be passed down through several functions listed below:

ext4_ext_insert_extent
ext4_ext_create_new_leaf
ext4_ext_grow_indepth
ext4_ext_split
ext4_ext_new_meta_block
ext4_mb_new_blocks
ext4_claim_free_blocks
ext4_has_free_blocks

[ext4 punch hole patch series 1/5 v7]

Signed-off-by: Allison Henderson <achender@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Mingming Cao <cmm@us.ibm.com>
2011-05-25 07:41:26 -04:00
Lukas Czerner
28739eea9c ext4: protect bb_first_free in ext4_trim_all_free() with group lock
We should protect reading bd_info->bb_first_free with the group lock
because otherwise we might miss some free blocks. This is not a big deal
at all, but the change to do right thing is really simple, so lets do
that.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:28:07 -04:00
Lukas Czerner
7894408666 ext4: only load buddy bitmap in ext4_trim_fs() when it is needed
Currently we are loading buddy ext4_mb_load_buddy() for every block
group we are going through in ext4_trim_fs() in many cases just to find
out that there is not enough space to be bothered with. As Amir Goldstein
suggested we can use bb_free information directly from ext4_group_info.

This commit removes ext4_mb_load_buddy() from ext4_trim_fs() and rather
get the ext4_group_info via ext4_get_group_info() and use the bb_free
information directly from that. This avoids unnecessary call to load
buddy in the case the group does not have enough free space to trim.
Loading buddy is now moved to ext4_trim_all_free().

Tested by me with xfstests 251.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-24 18:16:27 -04:00
Amir Goldstein
44183d4231 ext4: remove alloc_semp
After taking care of all group init races, all that remains is to
remove alloc_semp from ext4_allocation_context and ext4_buddy structs.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 21:52:36 -04:00
Amir Goldstein
9b8b7d353f ext4: teach ext4_mb_init_cache() to skip uptodate buddy caches
After online resize which adds new groups, some of the groups
in a buddy page may be initialized and uptodate, while other
(new ones) may be uninitialized.

The indication for init of new block groups is when ext4_mb_init_cache()
is called with an uptodate buddy page. In this case, initialized groups
on that buddy page must be skipped when initializing the buddy cache.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 21:49:42 -04:00
Amir Goldstein
2de8807b25 ext4: synchronize ext4_mb_init_group() with buddy page lock
The old routines ext4_mb_[get|put]_buddy_cache_lock(), which used
to take grp->alloc_sem for all groups on the buddy page have been
replaced with the routines ext4_mb_[get|put]_buddy_page_lock().

The new routines take both buddy and bitmap page locks to protect
against concurrent init of groups on the same buddy page.

The GROUP_NEED_INIT flag is tested again under page lock to check
if the group was initialized by another caller.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 21:48:13 -04:00
Amir Goldstein
e73a347b77 ext4: implement ext4_add_groupblocks() by freeing blocks
The old imlementation used to take grp->alloc_sem and set the
GROUP_NEED_INIT flag, so that the buddy cache would be reloaded.

The new implementation updates the buddy cache by freeing the added
blocks and making them available for use, so there is no need to
reload the buddy cache and there is no need to take grp->alloc_sem.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 21:40:01 -04:00
Theodore Ts'o
2cd05cc393 ext4: remove unneeded ext4_journal_get_undo_access
The block allocation code used to use jbd2_journal_get_undo_access as
a way to make changes that wouldn't show up until the commit took
place.  The new multi-block allocation code has a its own way of
preventing newly freed blocks from getting reused until the commit
takes place (it avoids updating the buddy bitmaps until the commit is
done), so we don't need to use jbd2_journal_get_undo_access(), which
has extra overhead compared to jbd2_journal_get_write_access().

There was one last vestigal use of ext4_journal_get_undo_access() in
ext4_add_groupblocks(); change it to use ext4_journal_get_write_access()
and then remove the ext4_journal_get_undo_access() support.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:58:45 -04:00
Amir Goldstein
2846e82004 ext4: move ext4_add_groupblocks() to mballoc.c
In preparation for the next patch, the function ext4_add_groupblocks()
is moved to mballoc.c, where it could use some static functions.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-05-09 10:46:41 -04:00
Theodore Ts'o
d9f34504e6 ext4: ignore errors when issuing discards
This is an effective revert of commit a30eec2a8: "ext4: stop issuing
discards if not supported by device".  The problem is that there are
some devices that may return errors in response to a discard request
some times but not others.  (One example would be a hybrid dm device
which concatenates an SSD and an HDD device).

By this logic, I also removed the error checking from ext4's FITRIM
code; so that an error from a discard will not stop the FITRIM from
trying to trim the rest of the file system.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-04-30 13:47:24 -04:00
Yang Ruirui
26626f1172 ext4: release page cache in ext4_mb_load_buddy error path
Add missing page_cache_release in the error path of ext4_mb_load_buddy

Signed-off-by: Yang Ruirui <ruirui.r.yang@tieto.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2011-04-16 19:17:48 -04:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Tao Ma
0ba0851714 ext4: fix a BUG in mb_mark_used during trim.
In a bs=4096 volume, if we call FITRIM with the following parameter as
fstrim_range(start = 102400, len = 134144000, minlen = 10240),
we will trigger this BUG_ON:

	BUG_ON(start + len > (e4b->bd_sb->s_blocksize << 3));

Mar  4 00:55:52 boyu-tm kernel: ------------[ cut here ]------------
Mar  4 00:55:52 boyu-tm kernel: kernel BUG at fs/ext4/mballoc.c:1506!
Mar  4 01:21:09 boyu-tm kernel: Code: d4 00 00 00 00 49 89 fe 8b 56 0c 44 8b 7e 04 89 55 c4 48 8b 4f 28 89 d6 44 01 fe 48 63 d6 48 8b 41 18 48 c1 e0 03 48 39 c2 76 04 <0f> 0b eb fe 48 8b 55 b0 8b 47 34 3b 42 08 74 04 0f 0b eb fe 48
Mar  4 01:21:09 boyu-tm kernel: RIP  [<ffffffffa053eb42>] mb_mark_used+0x47/0x26c [ext4]
Mar  4 01:21:09 boyu-tm kernel:  RSP <ffff880121e45c38>
Mar  4 01:21:09 boyu-tm kernel: ---[ end trace 9f461696f6a9dcf2 ]---

Fix this bug by doing the accounting correctly.

Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-23 15:48:11 -04:00
Eric Sandeen
4596fe0767 ext4: don't kfree uninitialized s_group_info members
We can call kfree on uninitialized members of the s_group_info array
on an the error path.  We can avoid this by kzalloc'ing the array.

This doesn't entirely solve the oops on mount if we fail down this
path; failed_mount4: frees the sbi, for one, which gets referenced
later in the failed mount paths - I haven't worked that out yet.

https://bugzilla.kernel.org/show_bug.cgi?id=30872

Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-03-21 21:25:13 -04:00