I recently found a case where de->name_len is 0 in f2fs_fill_dentries()
easily reproduced, and finally set the fsck flag.
Thread A Thread B
- f2fs_readdir
- f2fs_read_inline_dir
- ctx->pos = d.max
- f2fs_add_dentry
- f2fs_add_inline_entry
- do_convert_inline_dir
- f2fs_add_regular_entry
- f2fs_readdir
- f2fs_fill_dentries
- set_sbi_flag(sbi, SBI_NEED_FSCK)
Process A opens the folder, and has been reading without closing it.
During this period, Process B created a file under the folder (occupying
multiple f2fs_dir_entry, exceeding the d.max of the inline dir). After
creation, process A uses the d.max of inline dir to read it again, and
it will read that de->name_len is 0.
And Chao pointed out that w/o inline conversion, the race condition still
can happen as below:
dir_entry1: A
dir_entry2: B
dir_entry3: C
free slot: _
ctx->pos: ^
Thread A is traversing directory,
ctx-pos moves to below position after readdir() by thread A:
AAAABBBB___
^
Then thread B delete dir_entry2, and create dir_entry3.
Thread A calls readdir() to lookup dirents starting from middle
of new dirent slots as below:
AAAACCCCCC_
^
In these scenarios, the file system is not damaged, and it's hard to
avoid it. But we can bypass tagging FSCK flag if:
a) bit_pos (:= ctx->pos % d->max) is non-zero and
b) before bit_pos moves to first valid dir_entry.
Fixes: ddf06b753a ("f2fs: fix to trigger fsck if dirent.name_len is zero")
Signed-off-by: Yangtao Li <frank.li@vivo.com>
[Chao: clean up description]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a slab cache: "f2fs_casefolded_name" for memory allocation
of casefold name.
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Restruct f2fs page private layout for below reasons:
There are some cases that f2fs wants to set a flag in a page to
indicate a specified status of page:
a) page is in transaction list for atomic write
b) page contains dummy data for aligned write
c) page is migrating for GC
d) page contains inline data for inline inode flush
e) page belongs to merkle tree, and is verified for fsverity
f) page is dirty and has filesystem/inode reference count for writeback
g) page is temporary and has decompress io context reference for compression
There are existed places in page structure we can use to store
f2fs private status/data:
- page.flags: PG_checked, PG_private
- page.private
However it was a mess when we using them, which may cause potential
confliction:
page.private PG_private PG_checked page._refcount (+1 at most)
a) -1 set +1
b) -2 set
c), d), e) set
f) 0 set +1
g) pointer set
The other problem is page.flags has no free slot, if we can avoid set
zero to page.private and set PG_private flag, then we use non-zero value
to indicate PG_private status, so that we may have chance to reclaim
PG_private slot for other usage. [1]
The other concern is f2fs has bad scalability in aspect of indicating
more page status.
So in this patch, let's restructure f2fs' page.private as below to
solve above issues:
Layout A: lowest bit should be 1
| bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... |
bit 0 PAGE_PRIVATE_NOT_POINTER
bit 1 PAGE_PRIVATE_ATOMIC_WRITE
bit 2 PAGE_PRIVATE_DUMMY_WRITE
bit 3 PAGE_PRIVATE_ONGOING_MIGRATION
bit 4 PAGE_PRIVATE_INLINE_INODE
bit 5 PAGE_PRIVATE_REF_RESOURCE
bit 6- f2fs private data
Layout B: lowest bit should be 0
page.private is a wrapped pointer.
After the change:
page.private PG_private PG_checked page._refcount (+1 at most)
a) 11 set +1
b) 101 set +1
c) 1001 set +1
d) 10001 set +1
e) set
f) 100001 set +1
g) pointer set +1
[1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we added a new mount option, "checkpoint_merge", which introduces
a kernel thread dealing with the f2fs checkpoints. Once we start to manage the
IO priority along with blk-cgroup, the checkpoint operation can be processed in
a lower priority under the process context. Since the checkpoint holds all the
filesystem operations, we give a higher priority to the checkpoint thread all
the time.
Enhancement:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fix:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In addition,
we fixed some minor race conditions and error handling routines.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCQhVIACgkQQBSofoJI
UNIggA/8DZINzFLMCj6+6P5wNAWj3nYtx/FnwZ7C31f8qkiZjgA4LfONUnDvV7sU
GS8MuLQz4eTYfqU2rVgGiSm+aCkEOovnk7C7Huo7pezgqYb+5J6ACXsqdU3dcD5M
kShJMqLKcTKqtOMbnJrdGvmw1/ysuAi7UhSSgVV+9NQxlhxADnagOGbQ7lXNSV3R
spGMWazGY2uA5DFCCa4lMX79lyFATCzEKB3SKAW5r+8QSmxJY8ViK2Er7AnvwRJz
XJ/QJ8ALNb/GGyHzBWFv3P6Yxo/G3FkUvTIc5Rhi9P2lUgjI2NALokj7AOnfNh4a
uSXHVlNrrfH+gpx9xr5z8MUmroCYCCOJ6EhnVweqViRmekY8jSb2HxmUtDTIf19U
LWl3gtD2GDQx6CY0a0K58Oa2Lp0Bp9MWUdPA/4P21EymZwXum7aCkhV+DnigcoCj
yCmKlI8nIpCS97dIO/7MsnG6Tu/7c+Prytd2ezUo+6hlkXPZk8shs+elnjlWu/6V
3ZpSWKzQsPJL8U7eB9H04AxEokrrXm3fRhR86C7JdkEe0gyGFf3dB/G+jqzcNi5m
ZpxiCeZ8RbNmPpnH8NWeYHk9uDKMOXuUPFYDoaOwImNWfqj0jfhiTxHo4MyBLAuk
MT632ICcuJLvwgnSbMAI3U7v6+dZXKH4y6U7IHFjxhI7beMzxmI=
=P5uk
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we added a new mount option, "checkpoint_merge", which
introduces a kernel thread dealing with the f2fs checkpoints. Once we
start to manage the IO priority along with blk-cgroup, the checkpoint
operation can be processed in a lower priority under the process
context. Since the checkpoint holds all the filesystem operations, we
give a higher priority to the checkpoint thread all the time.
Enhancements:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fixes:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In
addition, we fixed some minor race conditions and error handling
routines"
* tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (48 commits)
f2fs: drop inplace IO if fs status is abnormal
f2fs: compress: remove unneed check condition
f2fs: clean up left deprecated IO trace codes
f2fs: avoid using native allocate_segment_by_default()
f2fs: remove unnecessary struct declaration
f2fs: fix to avoid NULL pointer dereference
f2fs: avoid duplicated codes for cleanup
f2fs: document: add description about compressed space handling
f2fs: clean up build warnings
f2fs: fix the periodic wakeups of discard thread
f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
f2fs: fix to avoid GC/mmap race with f2fs_truncate()
f2fs: set checkpoint_merge by default
f2fs: Fix a hungtask problem in atomic write
f2fs: fix to restrict mount condition on readonly block device
f2fs: introduce gc_merge mount option
f2fs: fix to cover __allocate_new_section() with curseg_lock
f2fs: fix wrong alloc_type in f2fs_do_replace_block
f2fs: delete empty compress.h
f2fs: fix a typo in inode.c
...
This patch combined the below three clean-up patches.
- modify open brace '{' following function definitions
- ERROR: spaces required around that ':'
- ERROR: spaces required before the open parenthesis '('
- ERROR: spaces prohibited before that ','
- Made suggested modifications from checkpatch in reference to WARNING:
Missing a blank line after declarations
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've made more work into per-file compression support. For
example, F2FS_IOC_GET|SET_COMPRESS_OPTION provides a way to change the
algorithm or cluster size per file. F2FS_IOC_COMPRESS|DECOMPRESS_FILE provides
a way to compress and decompress the existing normal files manually along with
a new mount option, compress_mode=fs|user, which can control who compresses the
data. Chao also added a checksum feature with a mount option so that we are able
to detect any corrupted cluster. In addition, Daniel contributed casefolding
with encryption patch, which will be used for Android devices.
Enhancement:
- add ioctls and mount option to manage per-file compression feature
- support casefolding with encryption
- support checksum for compressed cluster
- avoid IO starvation by replacing mutex with rwsem
- add sysfs, max_io_bytes, to control max bio size
Bug fix:
- fix use-after-free issue when compression and fsverity are enabled
- fix consistency corruption during fault injection test
- fix data offset for lseek
- get rid of buffer_head which has 32bits limit in fiemap
- fix some bugs in multi-partitions support
- fix nat entry count calculation in shrinker
- fix some stat information
And, we've refactored some logics and fix minor bugs as well.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl/a8ywACgkQQBSofoJI
UNLa2RAAjK+6tOs+NuYx2w9SegghKxwCg4Mb362BMdaAGx6GzMqAkCiVdujuoz/r
+wy8sdqO9QE7723ZDNsebNMLRnkNPHnpneSL2p6OsSLJrD3ORTELVRrzNlkemvnK
rRHZyYnNJvQQnD4uU7ABvROKsIDw/nCfcFvzHmLIgEw8EHO0W4n6fTtBdTwXv1qi
N3qXhGuQldonR9XICuGjzj7wh17n9ua6Mr12XX3Ok38giMcZb9KFBwgvlhl35cxt
htEmUpxWD3NTSw6zJmV4VAiajpiIkW6QRQuVA1nzdLZK644gaJMhM1EUsOnZhfDl
wX0ZtKoNkXxb0glD34O3aYqeHJ3tHWgPmmpVm9TECJP9A/X7kmEHgQYpH/eJ9I7d
tk51Uz28Mz1RShXU4i5RyKZeeoNTLiVlqiC95E2cnq4C1tLOJyI00N9AinrLzvR+
fqUrAwCrBpiYX63mWKYwq7GWxWwp4+PY09kyIZxxJiWhTE/St0bRx2bQL8zA8C6J
Rtxl+QWyQhkFbNu8fAukLFAhC6mqX/FKpXvUqRehBnHRvMWBiVZG0//eOPQLk71u
qsdCgYuEVcg3itDQrZvmsjxi4Pb5E9mNr0s5oC4I2WvBPMheD4esSyG7cKDN0qfS
3FFHlRYLOvnjPMLnKTmZXjFvFyHR8mwsD4Z83MeSrqYnWC14tFY=
=KneU
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've made more work into per-file compression support.
For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to
change the algorithm or cluster size per file. F2FS_IOC_COMPRESS |
DECOMPRESS_FILE provides a way to compress and decompress the existing
normal files manually.
There is also a new mount option, compress_mode=fs|user, which can
control who compresses the data.
Chao also added a checksum feature with a mount option so that
we are able to detect any corrupted cluster.
In addition, Daniel contributed casefolding with encryption patch,
which will be used for Android devices.
Summary:
Enhancements:
- add ioctls and mount option to manage per-file compression feature
- support casefolding with encryption
- support checksum for compressed cluster
- avoid IO starvation by replacing mutex with rwsem
- add sysfs, max_io_bytes, to control max bio size
Bug fixes:
- fix use-after-free issue when compression and fsverity are enabled
- fix consistency corruption during fault injection test
- fix data offset for lseek
- get rid of buffer_head which has 32bits limit in fiemap
- fix some bugs in multi-partitions support
- fix nat entry count calculation in shrinker
- fix some stat information
And, we've refactored some logics and fix minor bugs as well"
* tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
f2fs: compress: fix compression chksum
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
f2fs: fix race of pending_pages in decompression
f2fs: fix to account inline xattr correctly during recovery
f2fs: inline: fix wrong inline inode stat
f2fs: inline: correct comment in f2fs_recover_inline_data
f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
f2fs: convert to F2FS_*_INO macro
f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
f2fs: don't allow any writes on readonly mount
f2fs: avoid race condition for shrinker count
f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
f2fs: add compress_mode mount option
f2fs: Remove unnecessary unlikely()
f2fs: init dirty_secmap incorrectly
f2fs: remove buffer_head which has 32bits limit
f2fs: fix wrong block count instead of bytes
f2fs: use new conversion functions between blks and bytes
f2fs: rename logical_to_blk and blk_to_logical
f2fs: fix kbytes written stat for multi-device case
...
Expand f2fs's casefolding support to include encrypted directories. To
index casefolded+encrypted directories, we use the SipHash of the
casefolded name, keyed by a key derived from the directory's fscrypt
master key. This ensures that the dirhash doesn't leak information
about the plaintext filenames.
Encryption keys are unavailable during roll-forward recovery, so we
can't compute the dirhash when recovering a new dentry in an encrypted +
casefolded directory. To avoid having to force a checkpoint when a new
file is fsync'ed, store the dirhash on-disk appended to i_name.
This patch incorporates work by Eric Biggers <ebiggers@google.com>
and Jaegeuk Kim <jaegeuk@kernel.org>.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.
Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.
Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The last remaining use of fscrypt_get_encryption_info() from filesystems
is for readdir (->iterate_shared()). Every other call is now in
fs/crypto/ as part of some other higher-level operation.
We need to add a new argument to fscrypt_get_encryption_info() to
indicate whether the encryption policy is allowed to be unrecognized or
not. Doing this is easier if we can work with high-level operations
rather than direct filesystem use of fscrypt_get_encryption_info().
So add a function fscrypt_prepare_readdir() which wraps the call to
fscrypt_get_encryption_info() for the readdir use case.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201203022041.230976-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Since encrypted directories can be opened and searched without their key
being available, and each readdir and ->lookup() tries to set up the
key, trying to set up the key in ->open() too isn't really useful.
Just remove it so that directories don't need an ->open() method
anymore, and so that we eliminate a use of fscrypt_get_encryption_info()
(which I'd like to stop exporting to filesystems).
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20201203022041.230976-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
In this round, we've added new features such as zone capacity for ZNS and
a new GC policy, ATGC, along with in-memory segment management. In addition,
we could improve the decompression speed significantly by changing virtual
mapping method. Even though we've fixed lots of small bugs in compression
support, I feel that it becomes more stable so that I could give it a try in
production.
Enhancement:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fix:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and minor
bug fixes for compression and zoned device support.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl+I0nQACgkQQBSofoJI
UNIJdw//Rj0YapYXSu1nlOQppzSAkCiL6wxrrG2qkisRE7uXSYGZfWBqrMT6Asnf
i0f25i6ywZOZ00dgN/klZRBh4YYSgJqYx9BPkTxZsQZ/S/EmZJPpr8m4VUB69LKL
VijwUdgcW9vNDJ2/DkDQDVBd/ZqRxXnltffWtP4pS96Gj089/dE2q8KXqQrt3LM1
lLQjDfHj+0AyWRzKpErTO0W9DOgO7wmmelS0h6m2RYttkbb328JEZezg5bjWNNlk
eTXxuAFFy8Ap9DngkC/sqvY2NRTv1YgOPfrT8XWwdDIiFTZ+LoYdFI5Ap/UW7QwG
iz7B/0wPj3+9ncl536LRbFPiLisbYrArYGmZKF6t8w1cP6mTVlileMT4Q6s9+qhn
GJS88tTBVlR9vbzDu2brjI6qRQVTBdsohIGoA1g6lz0ogbphhmTzujPbFQ6GTSBi
3sKKp59urkBpVH3TVJU1oshLjIEG2yToMgYwZH9DU7zlzJS6XpetJrzReqItEThc
VNixg2DxdIFQ+nrMt+LtWaOHs5qzxIIPksguGEhqSkLL5lI75n2MZxrhKXmUsaZa
qItJE0ndJFfi6vggIkJID+a0bpTss7+AxF1AmSZDafMLkZy8j14DNQAnmBUYRX0J
5QExZ+LyyaKjWQ/k9SsSzV1Y3dbguDyB+gkeMhr/6XEr9DSwZc8=
=exgv
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added new features such as zone capacity for ZNS
and a new GC policy, ATGC, along with in-memory segment management. In
addition, we could improve the decompression speed significantly by
changing virtual mapping method. Even though we've fixed lots of small
bugs in compression support, I feel that it becomes more stable so
that I could give it a try in production.
Enhancements:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fixes:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and
minor bug fixes for compression and zoned device support"
* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: code cleanup by removing unnecessary check
f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
f2fs: fix writecount false positive in releasing compress blocks
f2fs: introduce check_swap_activate_fast()
f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
f2fs: handle errors of f2fs_get_meta_page_nofail
f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
f2fs: reject CASEFOLD inode flag without casefold feature
f2fs: fix memory alignment to support 32bit
f2fs: fix slab leak of rpages pointer
f2fs: compress: fix to disallow enabling compress on non-empty file
f2fs: compress: introduce cic/dic slab cache
f2fs: compress: introduce page array slab cache
f2fs: fix to do sanity check on segment/section count
f2fs: fix to check segment boundary during SIT page readahead
f2fs: fix uninit-value in f2fs_lookup
f2fs: remove unneeded parameter in find_in_block()
f2fs: fix wrong total_sections check and fsmeta check
f2fs: remove duplicated code in sanity_check_area_boundary
f2fs: remove unused check on version_bitmap
...
As syzbot reported:
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x21c/0x280 lib/dump_stack.c:118
kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122
__msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219
f2fs_lookup+0xe05/0x1a80 fs/f2fs/namei.c:503
lookup_open fs/namei.c:3082 [inline]
open_last_lookups fs/namei.c:3177 [inline]
path_openat+0x2729/0x6a90 fs/namei.c:3365
do_filp_open+0x2b8/0x710 fs/namei.c:3395
do_sys_openat2+0xa88/0x1140 fs/open.c:1168
do_sys_open fs/open.c:1184 [inline]
__do_compat_sys_openat fs/open.c:1242 [inline]
__se_compat_sys_openat+0x2a4/0x310 fs/open.c:1240
__ia32_compat_sys_openat+0x56/0x70 fs/open.c:1240
do_syscall_32_irqs_on arch/x86/entry/common.c:80 [inline]
__do_fast_syscall_32+0x129/0x180 arch/x86/entry/common.c:139
do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:162
do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:205
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
In f2fs_lookup(), @res_page could be used before being initialized,
because in __f2fs_find_entry(), once F2FS_I(dir)->i_current_depth was
been fuzzed to zero, then @res_page will never be initialized, causing
this kmsan warning, relocating @res_page initialization place to fix
this bug.
Reported-by: syzbot+0eac6f0bbd558fd866d7@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can relocate @res_page assignment in find_in_block() to
its caller, so unneeded parameter could be removed for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Currently we're using the term "ciphertext name" ambiguously because it
can mean either the actual ciphertext filename, or the encoded filename
that is shown when an encrypted directory is listed without its key.
The latter we're now usually calling the "no-key name"; and while it's
derived from the ciphertext name, it's not the same thing.
To avoid this ambiguity, rename fscrypt_name::is_ciphertext_name to
fscrypt_name::is_nokey_name, and update comments that say "ciphertext
name" (or "encrypted name") to say "no-key name" instead when warranted.
Link: https://lore.kernel.org/r/20200924042624.98439-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert f2fs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from under f2fs_lock_op(), which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
This also fixes a f2fs-specific deadlock when the filesystem is mounted
with '-o test_dummy_encryption' and a file is created in an unencrypted
directory other than the root directory:
INFO: task touch:207 blocked for more than 30 seconds.
Not tainted 5.9.0-rc4-00099-g729e3d0919844 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch state:D stack: 0 pid: 207 ppid: 167 flags:0x00000000
Call Trace:
[...]
lock_page include/linux/pagemap.h:548 [inline]
pagecache_get_page+0x25e/0x310 mm/filemap.c:1682
find_or_create_page include/linux/pagemap.h:348 [inline]
grab_cache_page include/linux/pagemap.h:424 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2395 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2373 [inline]
__get_node_page.part.0+0x39/0x2d0 fs/f2fs/node.c:1350
__get_node_page fs/f2fs/node.c:35 [inline]
f2fs_get_node_page+0x2e/0x60 fs/f2fs/node.c:1399
read_inline_xattr+0x88/0x140 fs/f2fs/xattr.c:288
lookup_all_xattrs+0x1f9/0x2c0 fs/f2fs/xattr.c:344
f2fs_getxattr+0x9b/0x160 fs/f2fs/xattr.c:532
f2fs_get_context+0x1e/0x20 fs/f2fs/super.c:2460
fscrypt_get_encryption_info+0x9b/0x450 fs/crypto/keysetup.c:472
fscrypt_inherit_context+0x2f/0xb0 fs/crypto/policy.c:640
f2fs_init_inode_metadata+0xab/0x340 fs/f2fs/dir.c:540
f2fs_add_inline_entry+0x145/0x390 fs/f2fs/inline.c:621
f2fs_add_dentry+0x31/0x80 fs/f2fs/dir.c:757
f2fs_do_add_link+0xcd/0x130 fs/f2fs/dir.c:798
f2fs_add_link fs/f2fs/f2fs.h:3234 [inline]
f2fs_create+0x104/0x290 fs/f2fs/namei.c:344
lookup_open.isra.0+0x2de/0x500 fs/namei.c:3103
open_last_lookups+0xa9/0x340 fs/namei.c:3177
path_openat+0x8f/0x1b0 fs/namei.c:3365
do_filp_open+0x87/0x130 fs/namei.c:3395
do_sys_openat2+0x96/0x150 fs/open.c:1168
[...]
That happened because f2fs_add_inline_entry() locks the directory
inode's page in order to add the dentry, then f2fs_get_context() tries
to lock it recursively in order to read the encryption xattr. This
problem is specific to "test_dummy_encryption" because normally the
directory's fscrypt_info would be set up prior to
f2fs_add_inline_entry() in order to encrypt the new filename.
Regardless, the new design fixes this test_dummy_encryption deadlock as
well as potential deadlocks with fs reclaim, by setting up any needed
fscrypt_info structs prior to taking so many locks.
The test_dummy_encryption deadlock was reported by Daniel Rosenberg.
Reported-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
This switches f2fs over to the generic support provided in
the previous patch.
Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
stakable/stackable
Signed-off-by: Liu Song <fishland@aliyun.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
it may be concurrently modified by a rename. This can cause undefined
behavior (possibly out-of-bounds memory accesses or crashes) in
utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
that may be concurrently modified.
Fix this by first copying the filename to a stack buffer if needed.
This way we get a stable snapshot of the filename.
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rework f2fs's handling of filenames to use a new 'struct f2fs_filename'.
Similar to 'struct ext4_filename', this stores the usr_fname, disk_name,
dirhash, crypto_buf, and casefolded name. Some of these names can be
NULL in some cases. 'struct f2fs_filename' differs from
'struct fscrypt_name' mainly in that the casefolded name is included.
For user-initiated directory operations like lookup() and create(),
initialize the f2fs_filename by translating the corresponding
fscrypt_name, then computing the dirhash and casefolded name if needed.
This makes the dirhash and casefolded name be cached for each syscall,
so we don't have to recompute them repeatedly. (Previously, f2fs
computed the dirhash once per directory level, and the casefolded name
once per directory block.) This improves performance.
This rework also makes it much easier to correctly handle all
combinations of normal, encrypted, casefolded, and encrypted+casefolded
directories. (The fourth isn't supported yet but is being worked on.)
The only other cases where an f2fs_filename gets initialized are for two
filesystem-internal operations: (1) when converting an inline directory
to a regular one, we grab the needed disk_name and hash from an existing
f2fs_dir_entry; and (2) when roll-forward recovering a new dentry, we
grab the needed disk_name from f2fs_inode::i_name and compute the hash.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Sharing f2fs_ci_compare() between comparing cached dentries
(f2fs_d_compare()) and comparing on-disk dentries (f2fs_match_name())
doesn't work as well as intended, as these actions fundamentally differ
in several ways (e.g. whether the task may sleep, whether the directory
is stable, whether the casefolded name was precomputed, whether the
dentry will need to be decrypted once we allow casefold+encrypt, etc.)
Just make f2fs_d_compare() implement what it needs directly, and rework
f2fs_ci_compare() to be specialized for f2fs_match_name().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Merge below two conditions into f2fs_may_encrypt() for cleanup
- IS_ENCRYPTED()
- DUMMY_ENCRYPTION_ENABLED()
Check IS_ENCRYPTED(inode) condition in f2fs_init_inode_metadata()
is enough since we have already set encrypt flag in f2fs_new_inode().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this series, we've implemented transparent compression experimentally. It
supports LZO and LZ4, but will add more later as we investigate in the field
more. At this point, the feature doesn't expose compressed space to user
directly in order to guarantee potential data updates later to the space.
Instead, the main goal is to reduce data writes to flash disk as much as
possible, resulting in extending disk life time as well as relaxing IO
congestion. Alternatively, we're also considering to add ioctl() to reclaim
compressed space and show it to user after putting the immutable bit.
Enhancement:
- add compression support
- avoid unnecessary locks in quota ops
- harden power-cut scenario for zoned block devices
- use private bio_set to avoid IO congestion
- replace GC mutex with rwsem to serialize callers
Bug fix:
- fix dentry consistency and memory corruption in rename()'s error case
- fix wrong swap extent reports
- fix casefolding bugs
- change lock coverage to avoid deadlock
- avoid GFP_KERNEL under f2fs_lock_op
And, we've cleaned up sysfs entries to prepare no debugfs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl4zInwACgkQQBSofoJI
UNL4Tg/+JBbVEFa3IUBGMdbjfgd/g0Jye++iMAYYGRWT6Ll/IGcHRV9NunITjgWU
mBZqdhI28kXeiGCcewB1ZvivjLx22X4n6yevHk2B5A6PNe9IDCHi0HOAhJJHkjPH
ecv2L+vX3Oj4y0+H7JNz9Fo3OIPJvMPtCQWlg1z+VQyhB85zNP7fZlvvIY4tG8yw
ERo0YNotLqwcF1BxCwNbAhV3aJGDxar+MI//yNzpiwDX7IptVpqestfcoIYc9kKL
4kSWRyEIGwcuIeyoM6aofGS9t4Z/Oe/gdqcxNr6l5n0Q/tMTpb4b/fJFGNr6RRx9
X9NQo8flkQb2DEIOP0DVpO2aPebzsVtzg3LZUOLA83+wCHfwINtHai2Dy2zDJ2my
BrVdou8fe2oxoaYihJg/Tz9cd0nA/6mZArtpYvDImAmX/xuGOvVk9zZkXNwc9nVX
EyVzy0vW4lA6gAIJ95aG6DDhJcAtVoy0MhBRWG92Pufxhn9aW24AV63ChWUf9DRx
/3RqpMAuQ3UC2gOxXKKnr54lsdhUIMn/y9sjROkVvQ1BvgRVxO8I4GFvMHMKv9pR
9KXiVRdzyYERyoL4+MF7A2zTnw+RHL4RVILa85p2ALGy2jQ1UuNUQi0BN9x2u1v8
S1ifNNX8SwOP+83ImFJhhn3HybpFQ45aLO3F7ZjKBQAnufJu+xw=
=zeoY
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this series, we've implemented transparent compression
experimentally. It supports LZO and LZ4, but will add more later as we
investigate in the field more.
At this point, the feature doesn't expose compressed space to user
directly in order to guarantee potential data updates later to the
space. Instead, the main goal is to reduce data writes to flash disk
as much as possible, resulting in extending disk life time as well as
relaxing IO congestion.
Alternatively, we're also considering to add ioctl() to reclaim
compressed space and show it to user after putting the immutable bit.
Enhancements:
- add compression support
- avoid unnecessary locks in quota ops
- harden power-cut scenario for zoned block devices
- use private bio_set to avoid IO congestion
- replace GC mutex with rwsem to serialize callers
Bug fixes:
- fix dentry consistency and memory corruption in rename()'s error case
- fix wrong swap extent reports
- fix casefolding bugs
- change lock coverage to avoid deadlock
- avoid GFP_KERNEL under f2fs_lock_op
And, we've cleaned up sysfs entries to prepare no debugfs"
* tag 'f2fs-for-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (31 commits)
f2fs: fix race conditions in ->d_compare() and ->d_hash()
f2fs: fix dcache lookup of !casefolded directories
f2fs: Add f2fs stats to sysfs
f2fs: delete duplicate information on sysfs nodes
f2fs: change to use rwsem for gc_mutex
f2fs: update f2fs document regarding to fsync_mode
f2fs: add a way to turn off ipu bio cache
f2fs: code cleanup for f2fs_statfs_project()
f2fs: fix miscounted block limit in f2fs_statfs_project()
f2fs: show the CP_PAUSE reason in checkpoint traces
f2fs: fix deadlock allocating bio_post_read_ctx from mempool
f2fs: remove unneeded check for error allocating bio_post_read_ctx
f2fs: convert inline_dir early before starting rename
f2fs: fix memleak of kobject
f2fs: fix to add swap extent correctly
f2fs: run fsck when getting bad inode during GC
f2fs: support data compression
f2fs: free sysfs kobject
f2fs: declare nested quota_sem and remove unnecessary sems
f2fs: don't put new_page twice in f2fs_rename
...
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL. For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:
int main()
{
if (fork()) {
for (;;) {
mkdir("subdir", 0700);
rmdir("subdir");
}
} else {
for (;;)
access("subdir/file", 0);
}
}
... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.
I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.
Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Do the name comparison for non-casefolded directories correctly.
This is analogous to ext4's commit 66883da1ee ("ext4: fix dcache
lookup of !casefolded directories").
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If we hit an error during rename, we'll get two dentries in different
directories.
Chao adds to check the room in inline_dir which can avoid needless
inversion. This should be done by inode_lock(&old_dir).
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Eric mentioned, bare printk{,_ratelimited} won't show which
filesystem instance these message is coming from, this patch tries
to show fs instance with sb->s_id field in all places we missed
before.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If inode is newly created, inode page may not synchronize with inode cache,
so fields like .i_inline or .i_extra_isize could be wrong, in below call
path, we may access such wrong fields, result in failing to migrate valid
target block.
Thread A Thread B
- f2fs_create
- f2fs_add_link
- f2fs_add_dentry
- f2fs_init_inode_metadata
- f2fs_add_inline_entry
- f2fs_new_inode_page
- f2fs_put_page
: inode page wasn't updated with inode cache
- gc_data_segment
- is_alive
- f2fs_get_node_page
- datablock_addr
- offset_in_addr
: access uninitialized fields
Fixes: 7a2af766af ("f2fs: enhance on-disk inode structure scalability")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces f2fs_match_name() for cleanup.
BTW, it avoids to fallback to normal comparison once it doesn't
match casefolded name.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Modeled after commit b886ee3e77 ("ext4: Support case-insensitive file
name lookups")
"""
This patch implements the actual support for case-insensitive file name
lookups in f2fs, based on the feature bit and the encoding stored in the
superblock.
A filesystem that has the casefold feature set is able to configure
directories with the +F (F2FS_CASEFOLD_FL) attribute, enabling lookups
to succeed in that directory in a case-insensitive fashion, i.e: match
a directory entry even if the name used by userspace is not a byte per
byte match with the disk name, but is an equivalent case-insensitive
version of the Unicode string. This operation is called a
case-insensitive file name lookup.
The feature is configured as an inode attribute applied to directories
and inherited by its children. This attribute can only be enabled on
empty directories for filesystems that support the encoding feature,
thus preventing collision of file names that only differ by case.
* dcache handling:
For a +F directory, F2Fs only stores the first equivalent name dentry
used in the dcache. This is done to prevent unintentional duplication of
dentries in the dcache, while also allowing the VFS code to quickly find
the right entry in the cache despite which equivalent string was used in
a previous lookup, without having to resort to ->lookup().
d_hash() of casefolded directories is implemented as the hash of the
casefolded string, such that we always have a well-known bucket for all
the equivalencies of the same string. d_compare() uses the
utf8_strncasecmp() infrastructure, which handles the comparison of
equivalent, same case, names as well.
For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.
* on-disk data:
Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetched from the
disk is a byte-per-byte match with what the user requested, making this
implementation 'name-preserving'. i.e. no actual information is lost
when writing to storage.
DX is supported by modifying the hashes used in +F directories to make
them case/encoding-aware. The new disk hashes are calculated as the
hash of the full casefolded string, instead of the string directly.
This allows us to efficiently search for file names in the htree without
requiring the user to provide an exact name.
* Dealing with invalid sequences:
By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that case-insensitive
file name lookup will not work only for that file. An optional bit can
be set in the superblock telling the filesystem code and userspace tools
to enforce the encoding. When that optional bit is set, any attempt to
create a file name using an invalid UTF-8 sequence will fail and return
an error to userspace.
* Normalization algorithm:
The UTF-8 algorithms used to compare strings in f2fs is implemented
in fs/unicode, and is based on a previous version developed by
SGI. It implements the Canonical decomposition (NFD) algorithm
described by the Unicode specification 12.1, or higher, combined with
the elimination of ignorable code points (NFDi) and full
case-folding (CF) as documented in fs/unicode/utf8_norm.c.
NFD seems to be the best normalization method for F2FS because:
- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like
compatibility decompositions.
Although:
- This implementation is not completely linguistic accurate, because
different languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more than
one language.
"""
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs uses EFAULT as error number to indicate filesystem is corrupted
all the time, but generic filesystems use EUCLEAN for such condition,
we need to change to follow others.
This patch adds two new macros as below to wrap more generic error
code macros, and spread them in code.
EFSBADCRC EBADMSG /* Bad CRC detected */
EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Add and use f2fs_<level> macros
- Convert f2fs_msg to f2fs_printk
- Remove level from f2fs_printk and embed the level in the format
- Coalesce formats and align multi-line arguments
- Remove unnecessary duplicate extern f2fs_msg f2fs.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We've continued mainly to fix bugs in this round, as f2fs has been shipped
in more devices. Especially, we've focused on stabilizing checkpoint=disable
feature, and provided some interfaces for QA.
Enhancement:
- expose FS_NOCOW_FL for pin_file
- run discard jobs at unmount time with timeout
- tune discarding thread to avoid idling which consumes power
- some checking codes to address vulnerabilities
- give random value to i_generation
- shutdown with more flags for QA
Bug fix:
- clean up stale objects when mount is failed along with checkpoint=disable
- fix system being stuck due to wrong count by atomic writes
- handle some corrupted disk cases
- fix a deadlock in f2fs_read_inline_dir
We've also added some minor build errors and clean-up patches.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlyKk4YACgkQQBSofoJI
UNIMVw//Rb3nmbQkMW/86DxtHDxuS8GEJmle0DiHeFMHgwy0ET0uZs9/AEfmuejC
95cXnF44QfVaFwkOXCK6aKXJXwN0+ZS0YvV/gPE8lgU6sdQhJBox5DC+rx+OwFq5
rZiF8qvE8iyM9Xt+RfMBGufzUb+LKBz0ozQFZpKJiNTBBf5vpeqMYASEEfxiEmZz
GvvUNSBRw39OB5zTl5l2hnoNqkoFu6XHnf4f9+DnraVi8SuQzj6hdqsx0nYTHfLi
Rax8kA4HUwoVgjhaLLXFbbhWIQ83bcZ0cj6wq7Lr7NbbIi7bKYP6sxtKjbe2Fuql
m9Chm2LIvD1BfJnjdTk2krqY7Z4bX/4gmXukno/8X/cjWkpBV6HFWS73iTgrJjU2
d8kBFXwlIn+JlATSjsTtdfvKkTwxUhaGw1bBA96Am4c5tLQyOqyYWcfQA/tam/v4
dM9EQX5ZeRb6NXDeIxkXNfTSpDRnqlhJsTV5aK8qporyF1RkKVbyCpSt1P4q3KO5
UwsGZLFAVMzFaUVfyIS7dR5QVczQUTCH4g0yFNpBMvF8epOA4+jbYxQeGZfqFK3H
mTC/Ba+VWWdYW2pZRNc9TnBsHg/xadMJq7EQb/ykGBe6JZJfB0wREj4LSr1lGK9a
cU8JFGyqg1Rt/uRP0bb5IIec1YVton3Lq8ND9VZPNcV/mS5Gehg=
=9BoH
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"We've continued mainly to fix bugs in this round, as f2fs has been
shipped in more devices. Especially, we've focused on stabilizing
checkpoint=disable feature, and provided some interfaces for QA.
Enhancements:
- expose FS_NOCOW_FL for pin_file
- run discard jobs at unmount time with timeout
- tune discarding thread to avoid idling which consumes power
- some checking codes to address vulnerabilities
- give random value to i_generation
- shutdown with more flags for QA
Bug fixes:
- clean up stale objects when mount is failed along with
checkpoint=disable
- fix system being stuck due to wrong count by atomic writes
- handle some corrupted disk cases
- fix a deadlock in f2fs_read_inline_dir
We've also added some minor build error fixes and clean-up patches"
* tag 'f2fs-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (53 commits)
f2fs: set pin_file under CAP_SYS_ADMIN
f2fs: fix to avoid deadlock in f2fs_read_inline_dir()
f2fs: fix to adapt small inline xattr space in __find_inline_xattr()
f2fs: fix to do sanity check with inode.i_inline_xattr_size
f2fs: give some messages for inline_xattr_size
f2fs: don't trigger read IO for beyond EOF page
f2fs: fix to add refcount once page is tagged PG_private
f2fs: remove wrong comment in f2fs_invalidate_page()
f2fs: fix to use kvfree instead of kzfree
f2fs: print more parameters in trace_f2fs_map_blocks
f2fs: trace f2fs_ioc_shutdown
f2fs: fix to avoid deadlock of atomic file operations
f2fs: fix to dirty inode for i_mode recovery
f2fs: give random value to i_generation
f2fs: no need to take page lock in readdir
f2fs: fix to update iostat correctly in IPU path
f2fs: fix encrypted page memory leak
f2fs: make fault injection covering __submit_flush_wait()
f2fs: fix to retry fill_super only if recovery failed
f2fs: silence VM_WARN_ON_ONCE in mempool_alloc
...
As Gao Xiang reported in bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=202749
f2fs may skip pageout() due to incorrect page reference count.
The problem here is that MM defined the rule [1] very clearly that
once page was set with PG_private flag, we should increment the
refcount in that page, also main flows like pageout(), migrate_page()
will assume there is one additional page reference count if
page_has_private() returns true.
But currently, f2fs won't add/del refcount when changing PG_private
flag. Anyway, f2fs should follow MM's rule to make MM's related flows
running as expected.
[1] https://lore.kernel.org/lkml/2b19b3c4-2bc4-15fa-15cc-27a13e5c7af1@aol.com/
Reported-by: Gao Xiang <gaoxiang25@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
VFS will take inode_lock for readdir, therefore no need to
take page lock in readdir at all just as the majority of
other generic filesystems.
This patch improves concurrency since .iterate_shared
was introduced to VFS years ago.
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This commit removes the f2fs specific f2fs_encrypted_inode() and makes
use of the generic IS_ENCRYPTED() macro to check for the encryption
status of an inode.
Acked-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Dentry bitmap is not enough to detect incorrect dentries. So this patch
also checks the namelen value of a dentry.
Signed-off-by: Gong Chen <gongchen4@huawei.com>
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While traversing dirents in f2fs_fill_dentries(), if bitmap is valid,
filename length should not be zero, otherwise, directory structure
consistency could be corrupted, in this case, let's print related
info and set SBI_NEED_FSCK to trigger fsck for repairing.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For all ordered cases in f2fs_wait_on_page_writeback(), we need to
check PageWriteback status, so let's clean up to relocate the check
into f2fs_wait_on_page_writeback().
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If namelen is corrupted to have very long value, fill_dentries can copy
wrong memory area.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull XArray conversion from Matthew Wilcox:
"The XArray provides an improved interface to the radix tree data
structure, providing locking as part of the API, specifying GFP flags
at allocation time, eliminating preloading, less re-walking the tree,
more efficient iterations and not exposing RCU-protected pointers to
its users.
This patch set
1. Introduces the XArray implementation
2. Converts the pagecache to use it
3. Converts memremap to use it
The page cache is the most complex and important user of the radix
tree, so converting it was most important. Converting the memremap
code removes the only other user of the multiorder code, which allows
us to remove the radix tree code that supported it.
I have 40+ followup patches to convert many other users of the radix
tree over to the XArray, but I'd like to get this part in first. The
other conversions haven't been in linux-next and aren't suitable for
applying yet, but you can see them in the xarray-conv branch if you're
interested"
* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
radix tree: Remove multiorder support
radix tree test: Convert multiorder tests to XArray
radix tree tests: Convert item_delete_rcu to XArray
radix tree tests: Convert item_kill_tree to XArray
radix tree tests: Move item_insert_order
radix tree test suite: Remove multiorder benchmarking
radix tree test suite: Remove __item_insert
memremap: Convert to XArray
xarray: Add range store functionality
xarray: Move multiorder_check to in-kernel tests
xarray: Move multiorder_shrink to kernel tests
xarray: Move multiorder account test in-kernel
radix tree test suite: Convert iteration test to XArray
radix tree test suite: Convert tag_tagged_items to XArray
radix tree: Remove radix_tree_clear_tags
radix tree: Remove radix_tree_maybe_preload_order
radix tree: Remove split/join code
radix tree: Remove radix_tree_update_node_t
page cache: Finish XArray conversion
dax: Convert page fault handlers to XArray
...
We need to drop PG_checked flag on page as well when we clear PG_uptodate
flag, in order to avoid treating the page as GCing one later.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The REQ_TIME should be updated only in case of success cases
as followed at all other places in the file system.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the verbose license text from f2fs files and replace them with
SPDX tags. This does not change the license of any of the code.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When CONFIG_F2FS_FAULT_INJECTION is disabled, we get a warning about an
unused label:
fs/f2fs/segment.c: In function '__submit_discard_cmd':
fs/f2fs/segment.c:1059:1: error: label 'submit' defined but not used [-Werror=unused-label]
This could be fixed by adding another #ifdef around it, but the more
reliable way of doing this seems to be to remove the other #ifdefs
where that is easily possible.
By defining time_to_inject() as a trivial stub, most of the checks for
CONFIG_F2FS_FAULT_INJECTION can go away. This also leads to nicer
formatting of the code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As Ted reported:
"Hi, I was looking at f2fs's sources recently, and I noticed that there
is a very large number of non-static symbols which don't have a f2fs
prefix. There's well over a hundred (see attached below).
As one example, in fs/f2fs/dir.c there is:
unsigned char get_de_type(struct f2fs_dir_entry *de)
This function is clearly only useful for f2fs, but it has a generic
name. This means that if any other file system tries to have the same
symbol name, there will be a symbol conflict and the kernel would not
successfully build. It also means that when someone is looking f2fs
sources, it's not at all obvious whether a function such as
read_data_page(), invalidate_blocks(), is a generic kernel function
found in the fs, mm, or block layers, or a f2fs specific function.
You might want to fix this at some point. Hopefully Kent's bcachefs
isn't similarly using genericly named functions, since that might
cause conflicts with f2fs's functions --- but just as this would be a
problem that we would rightly insist that Kent fix, this is something
that we should have rightly insisted that f2fs should have fixed
before it was integrated into the mainline kernel.
acquire_orphan_inode
add_ino_entry
add_orphan_inode
allocate_data_block
allocate_new_segments
alloc_nid
alloc_nid_done
alloc_nid_failed
available_free_memory
...."
This patch adds "f2fs_" prefix for all non-static symbols in order to:
a) avoid conflict with other kernel generic symbols;
b) to indicate the function is f2fs specific one instead of generic
one;
Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Introduce clear_radix_tree_dirty_tag to include common codes for cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.
[willy@infradead.org: fix nds32, fs/dax.c]
Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch introduces a new mount option `test_dummy_encryption'
to allow fscrypt to create a fake fscrypt context. This is used
by xfstests.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When setting COLD_BIT_SHIFT flag in node block, we only need to call
set_cold_node() in new_node_page() and recover_inode_page() during
node page initialization. So remove unneeded set_cold_node() in other
places.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch merges miscellaneous mount options into struct f2fs_mount_info,
After this patch, once we add new mount option, we don't need to worry
about recovery of it in remount_fs(), since we will recover the
f2fs_sb_info.mount_opt including all options.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit "0a007b97aad6"(f2fs: recover directory operations by fsync)
fixed xfstest generic/342 case, but it also increased the written
data and caused the performance degradation. In most cases, there's
no need to do so heavy fsync actually.
So we introduce new mount option "fsync_mode={posix,strict}" to
control the policy of fsync. "fsync_mode=posix" is set by default,
and means that f2fs uses a light fsync, which follows POSIX semantics.
And "fsync_mode=strict" means that it's a heavy fsync, which behaves
in line with xfs, ext4 and btrfs, where generic/342 will pass, but
the performance will regress.
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous dentry page uses highmem, which will cause panic in platforms
using highmem (such as arm), since the address space of dentry pages
from highmem directly goes into the decryption path via the function
fscrypt_fname_disk_to_usr. But sg_init_one assumes the address is not
from highmem, and then cause panic since it doesn't call kmap_high but
kunmap_high is triggered at the end. To fix this problem in a simple
way, this patch avoids to put dentry page in pagecache into highmem.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This fixes generic/342 which doesn't recover renamed file which was fsynced
before. It will be done via another fsync on newly created file.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch introduces a sysfs interface readdir_ra to enable/disable
readaheading inode block in f2fs_readdir. When readdir_ra is enabled,
it improves the performance of "readdir + stat".
For 300,000 files:
time find /data/test > /dev/null
disable readdir_ra: 1m25.69s real 0m01.94s user 0m50.80s system
enable readdir_ra: 0m18.55s real 0m00.44s user 0m15.39s system
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch fixes to update ctx->pos correctly when hitting hole in
directory.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previously, for large directory, we just do readahead only once in
readdir(), readdir()'s performance may drop when traversing latter
blocks. In order to avoid this, relocate readahead codes to covering
all traverse flow.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch follows ext4 to allow readdir() in large empty directory to
be interrupted. Referenced commit of ext4: 1f60fbe727 ("ext4: allow
readdir()'s of large empty directories to be interrupted").
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
On a senario like writing out the first dirty page of the inode
as the inline data, we only cleared dirty flags of the pages, but
didn't clear the dirty tags of those pages in the radix tree.
If we don't clear the dirty tags of the pages in the radix tree, the
inodes which contain the pages will be marked with I_DIRTY_PAGES again
and again, and writepages() for the inodes will be invoked in every
writeback period. As a result, nothing will be done in every
writepages() for the inodes and it will just consume CPU time
meaninglessly.
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since directories will be written back with checkpoint and fsync a
directory will always write CP, there is no need to set LOST_PINO
after creating a directory.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've focused on enhancing performance with regards to block
allocation, GC, and discard/in-place-update IO controls. There are a bunch
of clean-ups as well as minor bug fixes.
= Enhancement
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
= Bug fix
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJZEKXrAAoJEEAUqH6CSFDSJJ8P/1Zy0NS9TM/PFtT7Sevb6vgC
LcKLtX1bVhUuX9wAt5Q6BZ9927tCQPt5vLEYUxtniqEQaC0fsJAMbRYot+gR/dvN
4bGgv1TeVST5pKbmctzhAL30PvZ1w4QS6dLvPMm2sPQSrPKGUGt0J8wPiHHZuvH4
pygKzDxbrIJTeMhLm9tgFg7dWTJXV3VDb57WpA1AM1LAFVsIPF4vZnryLv3GsRmY
eGRxgZEtt/90hCRbEcPirPZrtpv/O5f12K4Vp/NPw+4XGMEk+nTYndq6rlUWVNjg
iPEDuxONyk/yb274SqB6sbNDuxHOqn7stGJepdUpSbprIsLZ0RmMaYWjSNsLU3Vh
p4fAzRqvfSqAHCt0FEL/vT8M9ST5xQRVr9P/l0kDK5Ww95RROd05bEaGm/sKc7NB
PHiWUoMIFFmuVsoCi6sM0AKps53ZGON8GEUyVKyM7NWTw1oWLPWifGMthEkysmwm
08SdU5+XqbCeyMPAA2GURqMA5A8ssuA8+F0Citf4JPckQHPPj5pAydmx2wVlfBlc
/bneR7T/8OsUbxgG8JSbdHUiPcjb20F0GTxSOTXiV/AaZAMCtyETnw64K2V6E0n7
uraKcYYhypyphCj/IYc4vnQ3dCu3U2/NvTYEVX8DBvboN38/JVqmNWgQx9g+tLzj
+r5s7PqTDuXv5Cfzc5NC
=SBUb
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've focused on enhancing performance with regards to
block allocation, GC, and discard/in-place-update IO controls. There
are a bunch of clean-ups as well as minor bug fixes.
Enhancements:
- disable heap-based allocation by default
- issue small-sized discard commands by default
- change the policy of data hotness for logging
- distinguish IOs in terms of size and wbc type
- start SSR earlier to avoid foreground GC
- enhance data structures managing discard commands
- enhance in-place update flow
- add some more fault injection routines
- secure one more xattr entry
Bug fixes:
- calculate victim cost for GC correctly
- remain correct victim segment number for GC
- race condition in nid allocator and initializer
- stale pointer produced by atomic_writes
- fix missing REQ_SYNC for flush commands
- handle missing errors in more corner cases"
* tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
f2fs: fix a mount fail for wrong next_scan_nid
f2fs: enhance scalability of trace macro
f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
f2fs: Make flush bios explicitely sync
f2fs: show available_nids in f2fs/status
f2fs: flush dirty nats periodically
f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
f2fs: allow cpc->reason to indicate more than one reason
f2fs: release cp and dnode lock before IPU
f2fs: shrink size of struct discard_cmd
f2fs: don't hold cmd_lock during waiting discard command
f2fs: nullify fio->encrypted_page for each writes
f2fs: sanity check segment count
f2fs: introduce valid_ipu_blkaddr to clean up
f2fs: lookup extent cache first under IPU scenario
f2fs: reconstruct code to write a data page
f2fs: introduce __wait_discard_cmd
f2fs: introduce __issue_discard_cmd
f2fs: enable small discard by default
f2fs: delay awaking discard thread
...
Switch f2fs directory searches to use the fscrypt_match_name() helper
function. There should be no functional change.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When accessing an encrypted directory without the key, userspace must
operate on filenames derived from the ciphertext names, which contain
arbitrary bytes. Since we must support filenames as long as NAME_MAX,
we can't always just base64-encode the ciphertext, since that may make
it too long. Currently, this is solved by presenting long names in an
abbreviated form containing any needed filesystem-specific hashes (e.g.
to identify a directory block), then the last 16 bytes of ciphertext.
This needs to be sufficient to identify the actual name on lookup.
However, there is a bug. It seems to have been assumed that due to the
use of a CBC (ciphertext block chaining)-based encryption mode, the last
16 bytes (i.e. the AES block size) of ciphertext would depend on the
full plaintext, preventing collisions. However, we actually use CBC
with ciphertext stealing (CTS), which handles the last two blocks
specially, causing them to appear "flipped". Thus, it's actually the
second-to-last block which depends on the full plaintext.
This caused long filenames that differ only near the end of their
plaintexts to, when observed without the key, point to the wrong inode
and be undeletable. For example, with ext4:
# echo pass | e4crypt add_key -p 16 edir/
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir/ -type f | xargs stat -c %i | sort | uniq | wc -l
2004
# rm -rf edir/
rm: cannot remove 'edir/_A7nNFi3rhkEQlJ6P,hdzluhODKOeWx5V': Structure needs cleaning
...
To fix this, when presenting long encrypted filenames, encode the
second-to-last block of ciphertext rather than the last 16 bytes.
Although it would be nice to solve this without depending on a specific
encryption mode, that would mean doing a cryptographic hash like SHA-256
which would be much less efficient. This way is sufficient for now, and
it's still compatible with encryption modes like HEH which are strong
pseudorandom permutations. Also, changing the presented names is still
allowed at any time because they are only provided to allow applications
to do things like delete encrypted directories. They're not designed to
be used to persistently identify files --- which would be hard to do
anyway, given that they're encrypted after all.
For ease of backports, this patch only makes the minimal fix to both
ext4 and f2fs. It leaves ubifs as-is, since ubifs doesn't compare the
ciphertext block yet. Follow-on patches will clean things up properly
and make the filesystems use a shared helper function.
Fixes: 5de0b4d0cd ("ext4 crypto: simplify and speed up filename encryption")
Reported-by: Gwendal Grignou <gwendal@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If user has no key under an encrypted dir, fscrypt gives digested dentries.
Previously, when looking up a dentry, f2fs only checks its hash value with
first 4 bytes of the digested dentry, which didn't handle hash collisions fully.
This patch enhances to check entire dentry bytes likewise ext4.
Eric reported how to reproduce this issue by:
# seq -f "edir/abcdefghijklmnopqrstuvwxyz012345%.0f" 100000 | xargs touch
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
100000
# sync
# echo 3 > /proc/sys/vm/drop_caches
# keyctl new_session
# find edir -type f | xargs stat -c %i | sort | uniq | wc -l
99999
Cc: <stable@vger.kernel.org>
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
(fixed f2fs_dentry_hash() to work even when the hash is 0)
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having
same name) does not cover the scenario where inline dentry is enabled.
In that case, F2FS_I(dir)->task will be NULL, and __f2fs_add_link will
lookup dentries one more time.
This patch fixes it by moving the assigment of current task to a upper
level to cover both normal and inline dentry.
Cc: <stable@vger.kernel.org>
Fixes: 88c5c13a50 (f2fs: fix multiple f2fs_add_link() calls having same name)
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since callers statically know which type to use, make_dentry_ptr()
can simply be splitted into two inline functions. This way, the code
has less inlined, fewer arguments, and no cast.
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As discuss with Jaegeuk and Chao,
"Once checkpoint is done, f2fs doesn't need to update there-in filename at all."
The disk-level filename is used only one case,
1. create a file A under a dir
2. sync A
3. godown
4. umount
5. mount (roll_forward)
Only the rename/cross_rename changes the filename, if it happens,
a. between step 1 and 2, the sync A will caused checkpoint, so that,
the roll_forward at step 5 never happens.
b. after step 2, the roll_forward happens, file A will roll forward
to the result as after step 1.
So that, any updating the disk filename is useless, just cleanup it.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This round introduces several interesting features such as on-disk NAT bitmaps,
IO alignment, and a discard thread. And it includes a couple of major bug fixes
as below.
== Enhancement ==
- introduce on-disk bitmaps to avoid scanning NAT blocks when getting free nids
- support IO alignment to prepare open-channel SSD integration in future
- introduce a discard thread to avoid long latency during checkpoint and fstrim
- use SSR for warm node and enable inline_xattr by default
- introduce in-memory bitmaps to check FS consistency for debugging
- improve write_begin by avoiding needless read IO
== Bug fix ==
- fix broken zone_reset behavior for SMR drive
- fix wrong victim selection policy during GC
- fix missing behavior when preparing discard commands
- fix bugs in atomic write support and fiemap
- workaround to handle multiple f2fs_add_link calls having same name
And it includes a bunch of clean-up patches as well.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJYtmdOAAoJEEAUqH6CSFDSs0UP/AzngT37xVIhVBD13J9oHIuv
rFA/eHVGRJmU1xc4SG1bghKm45xq8rwUX7irarfvLLc5aL+6VPGSdaRBykUr4A5N
MN/bgK//EPp7If8EF+8PpY+9x7g67i0mtz5iD8dDrK+bUKV/IDKV1LWw5pR3g/g6
RwMH0dUVOiD/HJ5iFp1ykTdVPe4vFY013uVmyPxUq+nCBlqlQm1nOvrGjF/HeYyX
kqcD2LEc79GPfS5ebQIKfCfLE0rsWVnnS6YaqlDNCD5/oRim71CUtA4MPTYv29vp
R/SebWlayEm+u68+uQUu6AyIk/1IdP0+AtRuQd/VxuteoyXmkTMHER662DqN4F8J
npPdNrbNdlzwuAP77avy+hplqbD19yUa7o7Fl1No5rfheT3CiNTSj2uoriyEAffH
1AM6tES7S7n5ttrXOr9iOxrK0u/vuaf7fbKVtK+RI09hwzdvyGB5HUdQB0iP/XR+
obw8dru79ISMVZ9YuDhSfjI5ohAcfthfuqgjUt2RAfDv19IRsg5eayAp3T6nUfEX
AGQbV/52dkO9svZztMbcBW95zmqkE0cMeX66KIMCPXNuDiE474t8k115K6kHpFwP
e4Kx+mTSNhR1LEAaVdmCjbLb0gVrumVHTdjaZopnxTFmE70u/M6h1vY90m1LkReF
ZDK5mhfMmGzU4wkvbgP8
=tw8c
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This round introduces several interesting features such as on-disk NAT
bitmaps, IO alignment, and a discard thread. And it includes a couple
of major bug fixes as below.
Enhancements:
- introduce on-disk bitmaps to avoid scanning NAT blocks when getting
free nids
- support IO alignment to prepare open-channel SSD integration in
future
- introduce a discard thread to avoid long latency during checkpoint
and fstrim
- use SSR for warm node and enable inline_xattr by default
- introduce in-memory bitmaps to check FS consistency for debugging
- improve write_begin by avoiding needless read IO
Bug fixes:
- fix broken zone_reset behavior for SMR drive
- fix wrong victim selection policy during GC
- fix missing behavior when preparing discard commands
- fix bugs in atomic write support and fiemap
- workaround to handle multiple f2fs_add_link calls having same name
... and it includes a bunch of clean-up patches as well"
* tag 'for-f2fs-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (97 commits)
f2fs: avoid to flush nat journal entries
f2fs: avoid to issue redundant discard commands
f2fs: fix a plint compile warning
f2fs: add f2fs_drop_inode tracepoint
f2fs: Fix zoned block device support
f2fs: remove redundant set_page_dirty()
f2fs: fix to enlarge size of write_io_dummy mempool
f2fs: fix memory leak of write_io_dummy mempool during umount
f2fs: fix to update F2FS_{CP_}WB_DATA count correctly
f2fs: use MAX_FREE_NIDS for the free nids target
f2fs: introduce free nid bitmap
f2fs: new helper cur_cp_crc() getting crc in f2fs_checkpoint
f2fs: update the comment of default nr_pages to skipping
f2fs: drop the duplicate pval in f2fs_getxattr
f2fs: Don't update the xattr data that same as the exist
f2fs: kill __is_extent_same
f2fs: avoid bggc->fggc when enough free segments are avaliable after cp
f2fs: select target segment with closer temperature in SSR mode
f2fs: show simple call stack in fault injection message
f2fs: no need lock_op in f2fs_write_inline_data
...
Previously kernel message can show that in which function we do the
injection, but unfortunately, most of the caller are the same, for
tracking more information of injection path, it needs to show upper
caller's name. This patch supports that ability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
It turns out a stakable filesystem like sdcardfs in AOSP can trigger multiple
vfs_create() to lower filesystem. In that case, f2fs will add multiple dentries
having same name which breaks filesystem consistency.
Until upper layer fixes, let's work around by f2fs, which shows actually not
much performance regression.
Cc: <stable@vger.kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As part of an effort to clean up fscrypt-related error codes, make
attempting to create a file in an encrypted directory that hasn't been
"unlocked" fail with ENOKEY. Previously, several error codes were used
for this case, including ENOENT, EACCES, and EPERM, and they were not
consistent between and within filesystems. ENOKEY is a better choice
because it expresses that the failure is due to lacking the encryption
key. It also matches the error code returned when trying to open an
encrypted regular file without the key.
I am not aware of any users who might be relying on the previous
inconsistent error codes, which were never documented anywhere.
This failure case will be exercised by an xfstest.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Report error of f2fs_fill_dentries to ->iterate_shared, otherwise when
error ocurrs, user may just list part of dirents in target directory
without any hints.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is to avoid no free segment bug during checkpoint caused by a number of
dirty inodes.
The case was reported by Chao like this.
1. mount with lazytime option
2. fill 4k file until disk is full
3. sync filesystem
4. read all files in the image
5. umount
In this case, we actually don't need to flush dirty inode to inode page during
checkpoint.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs contained a number of endianness conversion bugs.
Also, one function should have been 'static'.
Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/'
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJX9pA6AAoJEPL5WVaVDYGj7fwH/0YcdQWBg0O5d7iXFnTcimh9
fiYkqKniBWQhgBAOFPMoNPRIW4tyeQmTtu8Rywx2Hr+v4lzJvuOaT18NDANdq/pp
u5eDrnJ4R+uqPJlgxVOzopLVJ6I2glgSSRdvAKYxwTYcv8F88ObzVfsJ4M415gPq
cbEKF+JT3l5hTGENR5sqmYvHYaNfOFkOqt4gulPtgk1eshy+BH/05M+qBSeA5a6k
srdon0pFRoUV68m+T4G8FqOZxdybeT5Yx6X0GJf0eQJoX7IaiQTPcDrXzlrbDBbN
rrzbpwsDeDKtgSOckbarCBroZKdToHFekfnOJ7IPWYq8IwYTSnZKFCWIRKO6z38=
=IvhS
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of bug fixes and cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
ext4: remove unused variable
ext4: use journal inode to determine journal overhead
ext4: create function to read journal inode
ext4: unmap metadata when zeroing blocks
ext4: remove plugging from ext4_file_write_iter()
ext4: allow unlocked direct IO when pages are cached
ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
fscrypto: use standard macros to compute length of fname ciphertext
ext4: do not unnecessarily null-terminate encrypted symlink data
ext4: release bh in make_indexed_dir
ext4: Allow parallel DIO reads
ext4: allow DAX writeback for hole punch
jbd2: fix lockdep annotation in add_transaction_credits()
blockgroup_lock.h: simplify definition of NR_BG_LOCKS
blockgroup_lock.h: remove debris from bgl_lock_ptr() conversion
fscrypto: make filename crypto functions return 0 on success
fscrypto: rename completion callbacks to reflect usage
fscrypto: remove unnecessary includes
fscrypto: improved validation when loading inode encryption metadata
ext4: fix memory leak when symlink decryption fails
...
Previously, we only support global fault injection configuration, so that
when we configure type/rate of fault injection through sysfs, mount
option, it will influence all f2fs partition which is being used.
It is not make sence, since it will be not convenient if developer want
to test separated partitions with different fault injection rate/type
simultaneously, also it's not possible to enable fault injection in one
partition and disable fault injection in other one.
>From now on, we move global configuration of fault injection in module
into per-superblock, hence injection testing can be more flexible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.
CURRENT_TIME is also not y2038 safe.
This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.
Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There is no more user of f2fs_filetype_table outside of dir.c, make it
static.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Several filename crypto functions: fname_decrypt(),
fscrypt_fname_disk_to_usr(), and fscrypt_fname_usr_to_disk(), returned
the output length on success or -errno on failure. However, the output
length was redundant with the value written to 'oname->len'. It is also
potentially error-prone to make callers have to check for '< 0' instead
of '!= 0'.
Therefore, make these functions return 0 instead of a length, and make
the callers who cared about the return value being a length use
'oname->len' instead. For consistency also make other callers check for
a nonzero result rather than a negative result.
This change also fixes the inconsistency of fname_encrypt() actually
already returning 0 on success, not a length like the other filename
crypto functions and as documented in its function comment.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add roll-forward recovery process for encrypted dentry, so the first fsync
issued to an encrypted file does not need writing checkpoint.
This improves the performance of the following test at thousands of small
files: open -> write -> fsync -> close
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify kernel message to show encrypted names]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When decrypting dirents in ->readdir, fscrypt_fname_disk_to_usr won't
change content of original encrypted dirent, we don't need to allocate
additional buffer for storing mirror of it, so get rid of it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When creating new inode, security_inode_init_security will be called for
initializing security info related to the inode, and filename is passed to
security module, it helps security module such as SElinux to know which
rule or label could be applied for the inode with specified name.
Previously, if new inode is created as an encrypted one, f2fs will transfer
encrypted filename to security module which may fail the check of security
policy belong to the inode. So in order to this issue, alter to transfer
original unencrypted filename instead.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In the following scenario,
1) we don't have the key and doing a lookup for encrypted file,
2) and the encrypted filename is big name
we should use fname->hash as name hash value instead of what is
calculated by fname->disk_name. Because in such case,
fname->disk_name is empty.
Signed-off-by: Shuoran Liu <liushuoran@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull qstr constification updates from Al Viro:
"Fairly self-contained bunch - surprising lot of places passes struct
qstr * as an argument when const struct qstr * would suffice; it
complicates analysis for no good reason.
I'd prefer to feed that separately from the assorted fixes (those are
in #for-linus and with somewhat trickier topology)"
* 'work.const-qstr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
qstr: constify instances in adfs
qstr: constify instances in lustre
qstr: constify instances in f2fs
qstr: constify instances in ext2
qstr: constify instances in vfat
qstr: constify instances in procfs
qstr: constify instances in fuse
qstr constify instances in fs/dcache.c
qstr: constify instances in nfs
qstr: constify instances in ocfs2
qstr: constify instances in autofs4
qstr: constify instances in hfs
qstr: constify instances in hfsplus
qstr: constify instances in logfs
qstr: constify dentry_init_security
This patch fixes to report the right error number of f2fs_find_entry to
its caller.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>