NEW_ADDR blocks are purely in-memory preallocated blocks, and thus
equivalent to what the core FS code calls delayed allocations, and not
unwritten extents which do have on-disk blocks allocated from which
reads always return zeroes until they are converted to written status.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When testing with a mixed zoned / convention device combination, there
are regular but not 100% reproducible failures in xfstests generic/113
where the __is_valid_data_blkaddr assert hits due to finding a hole.
This seems to be because f2fs_map_blocks can set this flag on a hole
when it was found in the extent cache.
Rework f2fs_iomap_begin to just check the special block numbers directly.
This has the added benefits of the WARN_ON showing which invalid block
address we found, and being properly error out on delalloc blocks that
are confusingly called unwritten but not actually suitable for direct
I/O.
Fixes: 1517c1a7a4 ("f2fs: implement iomap operations")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
b763f3bedc ("f2fs: restructure f2fs page.private layout") missed
to call clear_page_private_reference() in .{release,invalid}_folio,
fix it, though it's not a big deal since folio_detach_private() was
called to clear all privae info and reference count in the page.
BTW, remove page_private_reference() definition as it never be used.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've added two features: 1) F2FS_IOC_START_ATOMIC_REPLACE and
2) per-block age-based extent cache. 1) is a variant of the previous atomic
write feature which guarantees a per-file atomicity. It would be more efficient
than AtomicFile implementation in Android framework. 2) implements another type
of extent cache in memory which keeps the per-block age in a file, so that block
allocator could split the hot and cold data blocks more accurately.
Enhancement:
- introduce F2FS_IOC_START_ATOMIC_REPLACE
- refactor extent_cache to add a new per-block-age-based extent cache support
- introduce discard_urgent_util, gc_mode, max_ordered_discard sysfs knobs
- add proc entry to show discard_plist info
- optimize iteration over sparse directories
- add barrier mount option
Bug fix
- avoid victim selection from previous victim section
- fix to enable compress for newly created file if extension matches
- set zstd compress level correctly
- initialize locks early in f2fs_fill_super() to fix bugs reported by syzbot
- correct i_size change for atomic writes
- allow to read node block after shutdown
- allow to set compression for inlined file
- fix gc mode when gc_urgent_high_remaining is 1
- should put a page when checking the summary info
Minor fixes and various clean-ups in GC, discard, debugfs, sysfs, and doc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmOaTNUACgkQQBSofoJI
UNIQnw//V7Q8DUHw5YNj04jutwXH2DNMLAmn/NJh5S6dIzy/LiywlSzVg53/0/FP
4K577urUkIhgilRO+yncUMSnSQk7BluQvGSx4ja2AV+dpDomjxM3GwIacGzSvr7D
VfVf8Vig10UEFrrtEEKtv1VFlYHAmo8lLpubzrZHV8aZFLHHYO2fakQhPu8BYsaz
eGCJwxjvTZcQUPkaeG9tWto3ChI3F6PzreiQ5TztHhLWSEgw/o0qijpsc+2SthaV
my7uGjeBY8EGPeSYbeCxRtdx8g8Qu11K3ISuDj8zBybmjG3IWOGt1CVcrY6tZbal
aL70CMtHkMqMn03VqbpCTqBtdWNMrrw5sYSL3qXIUdXlX/2yJBh9fLAeNxKNs5Nu
6veSb2WgYMHqIsClkAAcP0xJ8g6kodGoG60wVr4ek0Vdt4osaQqwq+bnffpwwxtQ
F+7aRuinv+rdrHJ4CuFXAmHPKh2lBe2lTTWZEKg2RptTxZ5DhD2Qn6x1khPD2GFA
mG2Aeiq6PVxxEeIO+w/VBCuAgpGTFV2N/ZIF8VfjFNdWiN5OGLWQNHC2KGj2G2uV
+fA+B91txQWtjY9h72YJb2+aGIixcnLY24ni4mDgDItqtpCB4PW56W8cbnbv9Pl+
aXAWdADqJdDyllHoVB/JQ24gr2fATJGRIDeYDnw+vPP4f5ZT5vg=
=f00t
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added two features: F2FS_IOC_START_ATOMIC_REPLACE
and a per-block age-based extent cache.
F2FS_IOC_START_ATOMIC_REPLACE is a variant of the previous atomic
write feature which guarantees a per-file atomicity. It would be more
efficient than AtomicFile implementation in Android framework.
The per-block age-based extent cache implements another type of extent
cache in memory which keeps the per-block age in a file, so that block
allocator could split the hot and cold data blocks more accurately.
Enhancements:
- introduce F2FS_IOC_START_ATOMIC_REPLACE
- refactor extent_cache to add a new per-block-age-based extent cache support
- introduce discard_urgent_util, gc_mode, max_ordered_discard sysfs knobs
- add proc entry to show discard_plist info
- optimize iteration over sparse directories
- add barrier mount option
Bug fixes:
- avoid victim selection from previous victim section
- fix to enable compress for newly created file if extension matches
- set zstd compress level correctly
- initialize locks early in f2fs_fill_super() to fix bugs reported by syzbot
- correct i_size change for atomic writes
- allow to read node block after shutdown
- allow to set compression for inlined file
- fix gc mode when gc_urgent_high_remaining is 1
- should put a page when checking the summary info
Minor fixes and various clean-ups in GC, discard, debugfs, sysfs, and
doc"
* tag 'f2fs-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (63 commits)
f2fs: reset wait_ms to default if any of the victims have been selected
f2fs: fix some format WARNING in debug.c and sysfs.c
f2fs: don't call f2fs_issue_discard_timeout() when discard_cmd_cnt is 0 in f2fs_put_super()
f2fs: fix iostat parameter for discard
f2fs: Fix spelling mistake in label: free_bio_enrty_cache -> free_bio_entry_cache
f2fs: add block_age-based extent cache
f2fs: allocate the extent_cache by default
f2fs: refactor extent_cache to support for read and more
f2fs: remove unnecessary __init_extent_tree
f2fs: move internal functions into extent_cache.c
f2fs: specify extent cache for read explicitly
f2fs: introduce f2fs_is_readonly() for readability
f2fs: remove F2FS_SET_FEATURE() and F2FS_CLEAR_FEATURE() macro
f2fs: do some cleanup for f2fs module init
MAINTAINERS: Add f2fs bug tracker link
f2fs: remove the unused flush argument to change_curseg
f2fs: open code allocate_segment_by_default
f2fs: remove struct segment_allocation default_salloc_ops
f2fs: introduce discard_urgent_util sysfs node
f2fs: define MIN_DISCARD_GRANULARITY macro
...
As a step towards freeing the PG_error flag for other uses, change ext4
and f2fs to stop using PG_error to track verity errors. Instead, if a
verity error occurs, just mark the whole bio as failed. The coarser
granularity isn't really a problem since it isn't any worse than what
the block layer provides, and errors from a multi-page readahead aren't
reported to applications unless a single-page read fails too.
f2fs supports compression, which makes the f2fs changes a bit more
complicated than desired, but the basic premise still works.
Note: there are still a few uses of PageError in f2fs, but they are on
the write path, so they are unrelated and this patch doesn't touch them.
Reviewed-by: Chao Yu <chao@kernel.org>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221129070401.156114-1-ebiggers@kernel.org
introduce a new ioctl to replace the whole content of a file atomically,
which means it induces truncate and content update at the same time.
We can start it with F2FS_IOC_START_ATOMIC_REPLACE and complete it with
F2FS_IOC_COMMIT_ATOMIC_WRITE. Or abort it with
F2FS_IOC_ABORT_ATOMIC_WRITE.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Wei Chen reports a kernel bug as blew:
INFO: task syz-executor.0:29056 blocked for more than 143 seconds.
Not tainted 5.15.0-rc5 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor.0 state:D stack:14632 pid:29056 ppid: 6574 flags:0x00000004
Call Trace:
__schedule+0x4a1/0x1720
schedule+0x36/0xe0
rwsem_down_write_slowpath+0x322/0x7a0
fscrypt_ioctl_set_policy+0x11f/0x2a0
__f2fs_ioctl+0x1a9f/0x5780
f2fs_ioctl+0x89/0x3a0
__x64_sys_ioctl+0xe8/0x140
do_syscall_64+0x34/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Eric did some investigation on this issue, quoted from reply of Eric:
"Well, the quality of this bug report has a lot to be desired (not on
upstream kernel, reproducer is full of totally irrelevant stuff, not
sent to the mailing list of the filesystem whose disk image is being
fuzzed, etc.). But what is going on is that f2fs_empty_dir() doesn't
consider the case of a directory with an extremely large i_size on a
malicious disk image.
Specifically, the reproducer mounts an f2fs image with a directory
that has an i_size of 14814520042850357248, then calls
FS_IOC_SET_ENCRYPTION_POLICY on it.
That results in a call to f2fs_empty_dir() to check whether the
directory is empty. f2fs_empty_dir() then iterates through all
3616826182336513 blocks the directory allegedly contains to check
whether any contain anything. i_rwsem is held during this, so
anything else that tries to take it will hang."
In order to solve this issue, let's use f2fs_get_next_page_offset()
to speed up iteration by skipping holes for all below functions:
- f2fs_empty_dir
- f2fs_readdir
- find_in_level
The way why we can speed up iteration was described in
'commit 3cf4574705 ("f2fs: introduce get_next_page_offset to speed
up SEEK_DATA")'.
Meanwhile, in f2fs_empty_dir(), let's use f2fs_find_data_page()
instead f2fs_get_lock_data_page(), due to i_rwsem was held in
caller of f2fs_empty_dir(), there shouldn't be any races, so it's
fine to not lock dentry page during lookuping dirents in the page.
Link: https://lore.kernel.org/lkml/536944df-a0ae-1dd8-148f-510b476e1347@kernel.org/T/
Reported-by: Wei Chen <harperchen1110@gmail.com>
Cc: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This round looks fairly small comparing to the previous updates which includes
mostly minor bug fixes. Nevertheless, as we've still interested in improving
the stability, Chao added some debugging methods to diagnoze subtle runtime
inconsistency problem.
Enhancement
- store all the corruption or failure reasons in superblock
- detect meta inode, summary info, and block address inconsistency
- increase the limit for reserve_root for low-end devices
- add the number of compressed IO in iostat
Bug fix
- DIO write fix for zoned devices
- do out-of-place writes for cold files
- fix some stat updates (FS_CP_DATA_IO, dirty page count)
- fix race condition on setting FI_NO_EXTENT flag
- fix data races when freezing super
- fix wrong continue condition check in GC
- do not allow ATGC for LFS mode
In addition, there're some code enhancement and clean-ups as usual.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmNEVIkACgkQQBSofoJI
UNL/Qg//eu7k196yIKflDZmp5aJbb5ybpFmh7XkPiqAV17ns+R2uLGq68BvTs+Tg
rqCjB7j2kkBh1kN32R7aGcx6tcbHjWc94pi59YTGQ6+pwkop3KJxFHSwAaUw6y34
8NZwmsnrm9rv0A0QPhQPK19yWmG/2smUE9b/u7M3+20I1WANaxdS/vOKbZz/amOu
f/BvsIIGS7Zzm9OpBCvGmq9Qpd83jlH6PuYGTC/OVbCrUiAJEmwN8wGsKP/9qB/5
KxVpdlh3vxulS6ixNbMu2qw9GBAQpAOz50+eDL5ZtGvGIQNHZRpGlfpJoW1lz0EO
4fJtpf5OMGqUbNaPCTG4qQGYAtKWA9YnFeWSS7RViQ6MryRXZMK8ka5eIe5Qblcf
AXD/eU2gKzOu0fuvdBRCt/wTSb4gY8sMNhe4psDsZxfhaYIpX8Ee/XVa4d+Z4frg
irN9gid1k3laMTx9dwJL8m7gIFvy3pak6l3B0bA69fAXd3faI40enuyfubFxnDet
OuRNxj8j3J5C140ag5KOuBCRub2/aPaj9YSQqUstf64d8FzN/Ypn5iVPTs2DP/3D
bcAFBwCS2+MCsk9+ra0WldZ5awdd6CRHDkvaYeDEuLCaLHUCo6CXe3aIyWawJBvJ
RnghKNv82RIV+rQlI1/sg8lseoDnEZTp5iwDGw/qZ+ZUyn05apM=
=aZ9y
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This round looks fairly small comparing to the previous updates and
includes mostly minor bug fixes. Nevertheless, as we've still
interested in improving the stability, Chao added some debugging
methods to diagnoze subtle runtime inconsistency problem.
Enhancements:
- store all the corruption or failure reasons in superblock
- detect meta inode, summary info, and block address inconsistency
- increase the limit for reserve_root for low-end devices
- add the number of compressed IO in iostat
Bug fixes:
- DIO write fix for zoned devices
- do out-of-place writes for cold files
- fix some stat updates (FS_CP_DATA_IO, dirty page count)
- fix race condition on setting FI_NO_EXTENT flag
- fix data races when freezing super
- fix wrong continue condition check in GC
- do not allow ATGC for LFS mode
In addition, there're some code enhancement and clean-ups as usual"
* tag 'f2fs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
f2fs: change to use atomic_t type form sbi.atomic_files
f2fs: account swapfile inodes
f2fs: allow direct read for zoned device
f2fs: support recording errors into superblock
f2fs: support recording stop_checkpoint reason into super_block
f2fs: remove the unnecessary check in f2fs_xattr_fiemap
f2fs: introduce cp_status sysfs entry
f2fs: fix to detect corrupted meta ino
f2fs: fix to account FS_CP_DATA_IO correctly
f2fs: code clean and fix a type error
f2fs: add "c_len" into trace_f2fs_update_extent_tree_range for compressed file
f2fs: fix to do sanity check on summary info
f2fs: port to vfs{g,u}id_t and associated helpers
f2fs: fix to do sanity check on destination blkaddr during recovery
f2fs: let FI_OPU_WRITE override FADVISE_COLD_BIT
f2fs: fix race condition on setting FI_NO_EXTENT flag
f2fs: remove redundant check in f2fs_sanity_check_cluster
f2fs: add static init_idisk_time function to reduce the code
f2fs: fix typo
f2fs: fix wrong dirty page count when race between mmap and fallocate.
...
This patch supports to record detail reason of FSCORRUPTED error into
f2fs_super_block.s_errors[].
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to record stop_checkpoint error into
f2fs_super_block.s_stop_reason[].
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Whehter or not error occurs, checking "err == 1" is unnecessary
in f2fs_xattr_fiemap(), and just remove it here.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_inode_info.cp_task was introduced for FS_CP_DATA_IO accounting
since commit b0af6d491a ("f2fs: add app/fs io stat").
However, cp_task usage coverage has been increased due to below
commits:
commit 040d2bb318 ("f2fs: fix to avoid deadloop if data_flush is on")
commit 186857c5a1 ("f2fs: fix potential recursive call when enabling data_flush")
So that, if data_flush mountoption is on, when data flush was
triggered from background, the IO from data flush will be accounted
as checkpoint IO type incorrectly.
In order to fix this issue, this patch splits cp_task into two:
a) cp_task: used for IO accounting
b) wb_task: used to avoid deadlock
Fixes: 040d2bb318 ("f2fs: fix to avoid deadloop if data_flush is on")
Fixes: 186857c5a1 ("f2fs: fix potential recursive call when enabling data_flush")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
ERROR: code indent should use tabs where possible
ERROR: spaces required around that ':'
ERROR: incorrect tab
Found serveral code type errors when review the code and fix it.
There is no function change.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Cold files may be fragmented due to SSR, defragment is needed as
sequential reads are dominant scenarios of these files. FI_OPU_WRITE
should override FADVISE_COLD_BIT to avoid defragment fails.
Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This is a BUG_ON issue as follows when running xfstest-generic-503:
WARNING: CPU: 21 PID: 1385 at fs/f2fs/inode.c:762 f2fs_evict_inode+0x847/0xaa0
Modules linked in:
CPU: 21 PID: 1385 Comm: umount Not tainted 5.19.0-rc5+ #73
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
Call Trace:
evict+0x129/0x2d0
dispose_list+0x4f/0xb0
evict_inodes+0x204/0x230
generic_shutdown_super+0x5b/0x1e0
kill_block_super+0x29/0x80
kill_f2fs_super+0xe6/0x140
deactivate_locked_super+0x44/0xc0
deactivate_super+0x79/0x90
cleanup_mnt+0x114/0x1a0
__cleanup_mnt+0x16/0x20
task_work_run+0x98/0x100
exit_to_user_mode_prepare+0x3d0/0x3e0
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x42/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
Function flow analysis when BUG occurs:
f2fs_fallocate mmap
do_page_fault
pte_spinlock // ---lock_pte
do_wp_page
wp_page_shared
pte_unmap_unlock // unlock_pte
do_page_mkwrite
f2fs_vm_page_mkwrite
down_read(invalidate_lock)
lock_page
if (PageMappedToDisk(page))
goto out;
// set_page_dirty --NOT RUN
out: up_read(invalidate_lock);
finish_mkwrite_fault // unlock_pte
f2fs_collapse_range
down_write(i_mmap_sem)
truncate_pagecache
unmap_mapping_pages
i_mmap_lock_write // down_write(i_mmap_rwsem)
......
zap_pte_range
pte_offset_map_lock // ---lock_pte
set_page_dirty
f2fs_dirty_data_folio
if (!folio_test_dirty(folio)) {
fault_dirty_shared_page
set_page_dirty
f2fs_dirty_data_folio
if (!folio_test_dirty(folio)) {
filemap_dirty_folio
f2fs_update_dirty_folio // ++
}
unlock_page
filemap_dirty_folio
f2fs_update_dirty_folio // page count++
}
pte_unmap_unlock // --unlock_pte
i_mmap_unlock_write // up_write(i_mmap_rwsem)
truncate_inode_pages
up_write(i_mmap_sem)
When race happens between mmap-do_page_fault-wp_page_shared and
fallocate-truncate_pagecache-zap_pte_range, the zap_pte_range calls
function set_page_dirty without page lock. Besides, though
truncate_pagecache has immap and pte lock, wp_page_shared calls
fault_dirty_shared_page without any. In this case, two threads race
in f2fs_dirty_data_folio function. Page is set to dirty only ONCE,
but the count is added TWICE by calling filemap_dirty_folio.
Thus the count of dirty page cannot accord with the real dirty pages.
Following is the solution to in case of race happens without any lock.
Since folio_test_set_dirty in filemap_dirty_folio is atomic, judge return
value will not be at risk of race.
Signed-off-by: Shuqi Zhang <zhangshuqi3@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As a step towards freeing the PG_error flag for other uses, change ext4
and f2fs to stop using PG_error to track decryption errors. Instead, if
a decryption error occurs, just mark the whole bio as failed. The
coarser granularity isn't really a problem since it isn't any worse than
what the block layer provides, and errors from a multi-page readahead
aren't reported to applications unless a single-page read fails too.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org> # for f2fs part
Link: https://lore.kernel.org/r/20220815235052.86545-2-ebiggers@kernel.org
Previously, we supported to account FS_CDATA_READ_IO type IO only,
in this patch, it adds to account more type IO for compressed file:
- APP_BUFFERED_CDATA_IO
- APP_MAPPED_CDATA_IO
- FS_CDATA_IO
- APP_BUFFERED_CDATA_READ_IO
- APP_MAPPED_CDATA_READ_IO
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this cycle, we mainly fixed some corner cases that manipulate a per-file
compression flag inappropriately. And, we found f2fs counted valid blocks in a
section incorrectly when zone capacity is set, and thus, fixed it with
additional sysfs entry to check it easily. Lastly, this series includes
several patches with respect to the new atomic write support such as a
couple of bug fixes and re-adding atomic_write_abort support that we removed
by mistake in the previous release.
Enhancement:
- add sysfs entries to understand atomic write operations and zone
capacity
- introduce memory mode to get a hint for low-memory devices
- adjust the waiting time of foreground GC
- decompress clusters under softirq to avoid non-deterministic latency
- do not skip updating inode when retrying to flush node page
- enforce single zone capacity
Bug fix:
- set the compression/no-compression flags correctly
- revive F2FS_IOC_ABORT_VOLATILE_WRITE
- check inline_data during compressed inode conversion
- understand zone capacity when calculating valid block count
As usual, the series includes several minor clean-ups and sanity checks.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmLxOccACgkQQBSofoJI
UNId/g/+Nx3FK874cyobE1PnPpUtfxLGqO9fjhrbje3bniTpgE9NtJUFg5hRQkxE
XHuufMrW++aBhn2ESMjbfdQ3v6vy5XUy7bi4FR71KxW4qp15mAqjTPfAZBFKZfMv
lCv54NKlura91GhI9Dl6JgGe1+MwNXIxVROyGvjXYogF0DWl+iJh4vYuCFUguiNU
mP6FmnZvbtK89jYxODoqwQaC+b6DV7ceaQ+c0dtS5TRvsUNv5mjWDeTvPMgk3At/
mAuWYXfIrf5xfDY93JPbrJhBLvu7Ey3EfXBnaFGRYbYxYYub9JZ4+/5di/rB9jRc
9AZ6LcLX3aKaT71EWa9vdCIffz8/PcSRjsmpEuVs7KNySwcnolnb1tAzlJPKy2AV
IJliY1Ef0+jrpg2lHYZoMb5qvo80c3xlyxlgZt0LSZKf1Wo41sjJVt6ZS7WLhHXu
OlzeI7lZBS9RKPUtU5cGNWkmZqamvmq09mMvqF4IUIaY40MizKZoV0yh9BjuUoxM
xniBIlC/q0HvwmbQ2OtNKDgv7+FdxrRlaDyhhkppa3UA8ZK3Edch26N9pBoh/r33
zJIR2BwCGmHz7yaX4HGzSt1phex2ABIGuZ4vBaGI7XDuYUD1tCZpC8wMCs2X3pKo
ldQz3uu0GA0BSsNKpRks2dwRF0JJVGTk8UwcSXPwTdTTdqyhmvI=
=dJ41
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we mainly fixed some corner cases that manipulate a
per-file compression flag inappropriately. And, we found f2fs counted
valid blocks in a section incorrectly when zone capacity is set, and
thus, fixed it with additional sysfs entry to check it easily.
Lastly, this series includes several patches with respect to the new
atomic write support such as a couple of bug fixes and re-adding
atomic_write_abort support that we removed by mistake in the previous
release.
Enhancements:
- add sysfs entries to understand atomic write operations and zone
capacity
- introduce memory mode to get a hint for low-memory devices
- adjust the waiting time of foreground GC
- decompress clusters under softirq to avoid non-deterministic
latency
- do not skip updating inode when retrying to flush node page
- enforce single zone capacity
Bug fixes:
- set the compression/no-compression flags correctly
- revive F2FS_IOC_ABORT_VOLATILE_WRITE
- check inline_data during compressed inode conversion
- understand zone capacity when calculating valid block count
As usual, the series includes several minor clean-ups and sanity
checks"
* tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: use onstack pages instead of pvec
f2fs: intorduce f2fs_all_cluster_page_ready
f2fs: clean up f2fs_abort_atomic_write()
f2fs: handle decompress only post processing in softirq
f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
f2fs: do not set compression bit if kernel doesn't support
f2fs: remove device type check for direct IO
f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
f2fs: fix to do sanity check on segment type in build_sit_entries()
f2fs: obsolete unused MAX_DISCARD_BLOCKS
f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
f2fs: introduce sysfs atomic write statistics
f2fs: don't bother wait_ms by foreground gc
f2fs: invalidate meta pages only for post_read required inode
f2fs: allow compression of files without blocks
f2fs: fix to check inline_data during compressed inode conversion
f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs: fix to invalidate META_MAPPING before DIO write
...
Since pvec have 15 pages, it not a multiple of 4, when write compressed
pages, write in 64K as a unit, it will call pagevec_lookup_range_tag
agagin, sometimes this will take a lot of time.
Use onstack pages instead of pvec to mitigate this problem.
Signed-off-by: Fengnan Chang <fengnanchang@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When write total cluster, all pages is uptodate, there is not need to call
f2fs_prepare_compress_overwrite, intorduce f2fs_all_cluster_page_ready
to avoid this.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Now decompression is being handled in workqueue and it makes read I/O
latency non-deterministic, because of the non-deterministic scheduling
nature of workqueues. So, I made it handled in softirq context only if
possible, not in low memory devices, since this modification will
maintain decompresion related memory a little longer.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
introduce the below 4 new sysfs node for atomic write statistics.
- current_atomic_write: the total current atomic write block count,
which is not committed yet.
- peak_atomic_write: the peak value of total current atomic write block
count after boot.
- committed_atomic_block: the accumulated total committed atomic write
block count after boot.
- revoked_atomic_block: the accumulated total revoked atomic write block
count after boot.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write"), invalidate_mapping_pages() will be called to
avoid race condition in between IPU/DIO and readahead for GC.
However, readahead flow is only used for post_read required inode,
so this patch adds check condition to avoids unnecessary page cache
invalidating for non-post_read inode.
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Quoted from commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write")
"
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
"
In previous commit, we missed to cover extent cache hit case, and passed
wrong value for parameter @end of invalidate_mapping_pages(), fix both
issues.
Fixes: 6aa58d8ad2 ("f2fs: readahead encrypted block during GC")
Fixes: e3b49ea368 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
Cc: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-53-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In this round, we've refactored the existing atomic write support implemented
by in-memory operations to have storing data in disk temporarily, which can give
us a benefit to accept more atomic writes. At the same time, we removed the
existing volatile write support. We've also revisited the file pinning and GC
flows and found some corner cases which contributeed abnormal system behaviours.
As usual, there're several minor code refactoring for readability, sanity check,
and clean ups.
Enhancement
- allow compression for mmap files in compress_mode=user
- kill volatile write support
- change the current atomic write way
- give priority to select unpinned section for foreground GC
- introduce data read/write showing path info
- remove unnecessary f2fs_lock_op in f2fs_new_inode
Bug fix
- fix the file pinning flow during checkpoint=disable and GCs
- fix foreground and background GCs to select the right victims and get free
sections on time
- fix GC flags on defragmenting pages
- avoid an infinite loop to flush node pages
- fix fallocate to use file_modified to update permissions consistently
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmKWfyEACgkQQBSofoJI
UNJaAQ/9Hs3aGIyriGV8CMbarklRuQ24o3khQKdia5gHseFVsydMfba8tyvl7vYV
fZnHKp9rnEV1emxWn7hHLaGOvPV8leajZqMLhqG384BIb0yoTnRipnK5t0JkoiJX
53XC5yfxQd01dwS+J4uOSu2jW0Gs6iBLD6H9ahOs86OE6jF1TeQ/fqjsrhm9I8Zr
GsNON6zxafPn248sYyVBB3Y5GjPBPf+USif3ZEidAWimW/TIGbXLUT1hA0B79YoX
DRAmN3tYS75yXauQvFPerMbOmP2gwCPcvdCI/PZ4U/ApsEPP7k1SbOZYAjjGUB30
Qn8cSMxzPZ1cHvzIC96vwJk8XPdcDhICfzROb7jJdeznD8cWTDv0E+Vd33HUf/mG
pi5Lkpc4STvYD+KUaKpdnHVg6ARWw4HOnUtW43MF3OsfuyGEEPlROs6lBVYnk/Hz
smlrgnnLMTOpH9y2JyuyExeHEJ3EAgWbJ8aRpq7Ua7FvKF45Yj1lIytWlvWXSnRf
rp+A5QJhVtYvT+y2Rk2h5oTRj/9l3+pR0X7CTOfSivJuf6aH5XVgI0EmxT2iBTCp
4SDBjLC+nXXP3EK1HamLiz1mU23Qg1Qwvx3Wc4xgdwQf3s+jyYxki9tIjzdwJCCZ
adjd3fc/GrD9UPDmJDXlD5QSoOJ94K/NOwYpu1L1/Q+dVwkl+IE=
=ta8Y
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've refactored the existing atomic write support
implemented by in-memory operations to have storing data in disk
temporarily, which can give us a benefit to accept more atomic writes.
At the same time, we removed the existing volatile write support.
We've also revisited the file pinning and GC flows and found some
corner cases which contributeed abnormal system behaviours.
As usual, there're several minor code refactoring for readability,
sanity check, and clean ups.
Enhancements:
- allow compression for mmap files in compress_mode=user
- kill volatile write support
- change the current atomic write way
- give priority to select unpinned section for foreground GC
- introduce data read/write showing path info
- remove unnecessary f2fs_lock_op in f2fs_new_inode
Bug fixes:
- fix the file pinning flow during checkpoint=disable and GCs
- fix foreground and background GCs to select the right victims and
get free sections on time
- fix GC flags on defragmenting pages
- avoid an infinite loop to flush node pages
- fix fallocate to use file_modified to update permissions
consistently"
* tag 'f2fs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
f2fs: fix to tag gcing flag on page during file defragment
f2fs: replace F2FS_I(inode) and sbi by the local variable
f2fs: add f2fs_init_write_merge_io function
f2fs: avoid unneeded error handling for revoke_entry_slab allocation
f2fs: allow compression for mmap files in compress_mode=user
f2fs: fix typo in comment
f2fs: make f2fs_read_inline_data() more readable
f2fs: fix to do sanity check for inline inode
f2fs: fix fallocate to use file_modified to update permissions consistently
f2fs: don't use casefolded comparison for "." and ".."
f2fs: do not stop GC when requiring a free section
f2fs: keep wait_ms if EAGAIN happens
f2fs: introduce f2fs_gc_control to consolidate f2fs_gc parameters
f2fs: reject test_dummy_encryption when !CONFIG_FS_ENCRYPTION
f2fs: kill volatile write support
f2fs: change the current atomic write way
f2fs: don't need inode lock for system hidden quota
f2fs: stop allocating pinned sections if EAGAIN happens
f2fs: skip GC if possible when checkpoint disabling
f2fs: give priority to select unpinned section for foreground GC
...
Almost all other initialization of variables in f2fs_fill_super are
extraced to a single function. Also do it for write_io[], which can
make code more clean.
This patch just refactors the code, theres no functional change.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Current atomic write has three major issues like below.
- keeps the updates in non-reclaimable memory space and they are even
hard to be migrated, which is not good for contiguous memory
allocation.
- disk spaces used for atomic files cannot be garbage collected, so
this makes it difficult for the filesystem to be defragmented.
- If atomic write operations hit the threshold of either memory usage
or garbage collection failure count, All the atomic write operations
will fail immediately.
To resolve the issues, I will keep a COW inode internally for all the
updates to be flushed from memory, when we need to flush them out in a
situation like high memory pressure. These COW inodes will be tagged
as orphan inodes to be reclaimed in case of sudden power-cut or system
failure during atomic writes.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
While converting f2fs_release_page() to f2fs_release_folio(), cache the
sb_info so we don't need to retrieve it twice, and remove the redundant
call to set_page_private(). The use of folios should be pushed further
into f2fs from here.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Let's attach io_flags to bio only, so that we can merge IOs given original
io_flags only.
Fixes: 64bf0eef01 ("f2fs: pass the bio operation to bio_alloc_bioset")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
I got the return value wrong. Very little checks the return value
from set_page_dirty(), so nobody noticed during testing.
Fixes: 4f5e34f713 ("f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
All filesystems have now been converted to use ->readahead, so
remove the ->readpages operation and fix all the comments that
used to refer to it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
VaTzP71C1Q==
=efaq
-----END PGP SIGNATURE-----
Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block
Pull NVMe write streams removal from Jens Axboe:
"This removes the write streams support in NVMe. No vendor ever really
shipped working support for this, and they are not interested in
supporting it.
With the NVMe support gone, we have nothing in the tree that supports
this. Remove passing around of the hints.
The only discussion point in this patchset imho is the fact that the
file specific write hint setting/getting fcntl helpers will now return
-1/EINVAL like they did before we supported write hints. No known
applications use these functions, I only know of one prototype that I
help do for RocksDB, and that's not used. That said, with a change
like this, it's always a bit controversial. Alternatively, we could
just make them return 0 and pretend it worked. It's placement based
hints after all"
* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
fs: remove fs.f_write_hint
fs: remove kiocb.ki_hint
block: remove the per-bio/request write hint
nvme: remove support or stream based temperature hint
Primarily this series converts some of the address_space operations
to take a folio instead of a page.
->is_partially_uptodate() takes a folio instead of a page and changes the
type of the 'from' and 'count' arguments to make it obvious they're bytes.
->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
->launder_page() becomes ->launder_folio()
->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
=cOiq
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
As congestion is no longer tracked, congestion_wait() is effectively
equivalent to io_schedule_timeout().
So introduce f2fs_io_schedule_timeout() which sets TASK_UNINTERRUPTIBLE
and call that instead.
Link: https://lkml.kernel.org/r/164549983744.9187.6425865370954230902.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In this cycle, f2fs has some performance improvements for Android workloads such
as using read-unfair rwsems and adding some sysfs entries to control GCs and
discard commands in more details. In addtiion, it has some tunings to improve
the recovery speed after sudden power-cut.
Enhancement:
- add reader-unfair rwsems with F2FS_UNFAIR_RWSEM
: will replace with generic API support
- adjust to make the readahead/recovery flow more efficiently
- sysfs entries to control issue speeds of GCs and Discard commands
- enable idmapped mounts
Bug fix:
- correct wrong error handling routines
- fix missing conditions in quota
- fix a potential deadlock between writeback and block plug routines
- fix a deadlock btween freezefs and evict_inode
We've added some boundary checks to avoid kernel panics on corrupted images,
and several minor code clean-ups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmI44c4ACgkQQBSofoJI
UNIqdBAAgBjV/76Gphbpg2lR5+13pWBV0jp66yYaaPiqmM6IsSPYKTlGMJpEBy41
x6M+MRc+NjtwSEHAWOptOIPbP9zwXYJn/KSDMCAP3+454YhBFDLqDAkAxBt1frYT
0EkwCIYw/LqmVnuIttQ01gnT8v5zH4d/x4+gsdM+b7flmpCP/AoZDvI19Zd66F0y
RdOdQQWyhvmmetZbaeaPoxbjS8LJ9b0ZMcxidTv9a+5GylCAXNicBdM9x1iVVmJ1
dT1n2w7USKVdL4ydpwPUiec6RwACRk49CL3FgyyGNRlcpMmU9ArcY2l/Qr+At7ky
tgPODXme/EvH12DsfoixjkNSLc4a7RHPfiJ3qy8XC6dshWYMKIegjateG8lVhf0P
kdifMRCdOa+/l+RoyD1IjKTXPmVl9ihh6RBYDr6YrFclxg3uI4CvJCXht4dSXOCE
5vLIVZEf5yk+6Ee2ozcNTG2hZ8gd+aNy1WqBN3/5lFxhBYVNlTnUYd0URzenwIdW
i2QP99mFrntCL25lhF7f7AeTHxSg/UVXnRA1oQZ+6qIPPLhNdApfd1lov/6+Hhe4
0zDbCbmIfVko/vZJeYOppaj+6jSZ3FafMfH5dDYyis4S4RbX2sjR9wGSd8PEdOTw
/4dZXXfB2XslPb3KQsJSyGz75af3PxZ8PHLxj0HBSQXOA140htY=
=t75l
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, f2fs has some performance improvements for Android
workloads such as using read-unfair rwsems and adding some sysfs
entries to control GCs and discard commands in more details. In
addtiion, it has some tunings to improve the recovery speed after
sudden power-cut.
Enhancement:
- add reader-unfair rwsems with F2FS_UNFAIR_RWSEM: will replace with
generic API support
- adjust to make the readahead/recovery flow more efficiently
- sysfs entries to control issue speeds of GCs and Discard commands
- enable idmapped mounts
Bug fix:
- correct wrong error handling routines
- fix missing conditions in quota
- fix a potential deadlock between writeback and block plug routines
- fix a deadlock btween freezefs and evict_inode
We've added some boundary checks to avoid kernel panics on corrupted
images, and several minor code clean-ups"
* tag 'f2fs-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (27 commits)
f2fs: fix to do sanity check on .cp_pack_total_block_count
f2fs: make gc_urgent and gc_segment_mode sysfs node readable
f2fs: use aggressive GC policy during f2fs_disable_checkpoint()
f2fs: fix compressed file start atomic write may cause data corruption
f2fs: initialize sbi->gc_mode explicitly
f2fs: introduce gc_urgent_mid mode
f2fs: compress: fix to print raw data size in error path of lz4 decompression
f2fs: remove redundant parameter judgment
f2fs: use spin_lock to avoid hang
f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fs
f2fs: remove unnecessary read for F2FS_FITS_IN_INODE
f2fs: introduce F2FS_UNFAIR_RWSEM to support unfair rwsem
f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes
f2fs: fix to do sanity check on curseg->alloc_type
f2fs: fix to avoid potential deadlock
f2fs: quota: fix loop condition at f2fs_quota_sync()
f2fs: Restore rwsem lockdep support
f2fs: fix missing free nid in f2fs_handle_failed_inode
f2fs: support idmapped mounts
f2fs: add a way to limit roll forward recovery time
...
Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption.
There will be a merge conflict with the block pull request in
fs/iomap/direct-io.c, due to some bio interface cleanups. The merge
resolution is straightforward and can be found in linux-next.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYjgCfhQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6vjAQDp8D8OKIyj67KjYwvyHpNy0fZhxeur
RexC0nDfd9BE/AD/fV6zpCglmsuGxqxL0jmqeczKXn2y7nRFmPciCBTi/wY=
=kwNd
-----END PGP SIGNATURE-----
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fscrypt updates from Eric Biggers:
"Add support for direct I/O on encrypted files when blk-crypto (inline
encryption) is being used for file contents encryption"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fscrypt: update documentation for direct I/O support
f2fs: support direct I/O with fscrypt using blk-crypto
ext4: support direct I/O with fscrypt using blk-crypto
iomap: support direct I/O with fscrypt using blk-crypto
fscrypt: add functions for direct I/O support
When compressed file has blocks, f2fs_ioc_start_atomic_write will succeed,
but compressed flag will be remained in inode. If write partial compreseed
cluster and commit atomic write will cause data corruption.
This is the reproduction process:
Step 1:
create a compressed file ,write 64K data , call fsync(), then the blocks
are write as compressed cluster.
Step2:
iotcl(F2FS_IOC_START_ATOMIC_WRITE) --- this should be fail, but not.
write page 0 and page 3.
iotcl(F2FS_IOC_COMMIT_ATOMIC_WRITE) -- page 0 and 3 write as normal file,
Step3:
drop cache.
read page 0-4 -- Since page 0 has a valid block address, read as
non-compressed cluster, page 1 and 2 will be filled with compressed data
or zero.
The root cause is, after commit 7eab7a6968 ("f2fs: compress: remove
unneeded read when rewrite whole cluster"), in step 2, f2fs_write_begin()
only set target page dirty, and in f2fs_commit_inmem_pages(), we will write
partial raw pages into compressed cluster, result in corrupting compressed
cluster layout.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Fixes: 7eab7a6968 ("f2fs: compress: remove unneeded read when rewrite whole cluster")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Removes several calls to __set_page_dirty_nobuffers(). Also turn the
PageSwapCache() case into a BUG() as there's no way for a swapcache page
to make it to a filesystem that doesn't use SWP_FS_OPS.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
This is a minimal change which just accepts the new arguments and passes
the single struct page to the functions which do the work. There is
very little progress here toards making f2fs support large folios.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
Refactor block I/O code so that the bio operation and known flags are set
at bio allocation time. Only the later updated flags are updated on the
fly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set the bdev at bio allocation time by changing the f2fs_target_device
calling conventions, so that no bio needs to be passed in.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Quoted from Jing Xia's report, there is a potential deadlock may happen
between kworker and checkpoint as below:
[T:writeback] [T:checkpoint]
- wb_writeback
- blk_start_plug
bio contains NodeA was plugged in writeback threads
- do_writepages -- sync write inodeB, inc wb_sync_req[DATA]
- f2fs_write_data_pages
- f2fs_write_single_data_page -- write last dirty page
- f2fs_do_write_data_page
- set_page_writeback -- clear page dirty flag and
PAGECACHE_TAG_DIRTY tag in radix tree
- f2fs_outplace_write_data
- f2fs_update_data_blkaddr
- f2fs_wait_on_page_writeback -- wait NodeA to writeback here
- inode_dec_dirty_pages
- writeback_sb_inodes
- writeback_single_inode
- do_writepages
- f2fs_write_data_pages -- skip writepages due to wb_sync_req[DATA]
- wbc->pages_skipped += get_dirty_pages() -- PAGECACHE_TAG_DIRTY is not set but get_dirty_pages() returns one
- requeue_inode -- requeue inode to wb->b_dirty queue due to non-zero.pages_skipped
- blk_finish_plug
Let's try to avoid deadlock condition by forcing unplugging previous bio via
blk_finish_plug(current->plug) once we'v skipped writeback in writepages()
due to valid sbi->wb_sync_req[DATA/NODE].
Fixes: 687de7f101 ("f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data. However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.
Therefore, make f2fs support DIO on files that are using inline
encryption. Since f2fs uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:
- Let DIO proceed when supported, by checking fscrypt_dio_supported()
instead of assuming that encrypted files never support DIO.
- In f2fs_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
length of the mapping in the rare case where a DUN discontiguity
occurs in the middle of an extent. The iomap DIO implementation
requires this, since it assumes that it can submit a bio covering (up
to) the whole mapping, without checking fscrypt constraints itself.
Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Once F2FS_IPU_FORCE policy is enabled in some cases:
a) f2fs forces to use F2FS_IPU_FORCE in a small-sized volume
b) user sets F2FS_IPU_FORCE policy via sysfs
Then we may fail to defragment file due to IPU policy check, it doesn't
make sense, let's introduce a new IPU policy to allow OPU during file
defragmentation.
In small-sized volume, let's enable F2FS_IPU_HONOR_OPU_WRITE policy
by default.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.
Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
f2fs rw_semaphores work better if writers can starve readers,
especially for the checkpoint thread, because writers are strictly
more important than reader threads. This prevents significant priority
inversion between low-priority readers that blocked while trying to
acquire the read lock and a second acquisition of the write lock that
might be blocking high priority work.
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Patch series "remove Xen tmem leftovers".
Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap. This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.
This patch (of 13):
The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dc ("xen: remove tmem driver").
[akpm@linux-foundation.org: remove now-unreachable code]
Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In this round, we've tried to address some performance issues in f2fs_checkpoint
and direct IO flows. Also, there was a work to enhance the page cache management
used for compression. Other than them, we've done typical work including sysfs,
code clean-ups, tracepoint, sanity check, in addition to bug fixes on corner
cases.
Enhancement:
- use iomap for direct IO
- try to avoid lock contention to improve f2fs_ckpt speed
- avoid unnecessary memory allocation in compression flow
- POSIX_FADV_DONTNEED drops the page cache containing compression pages
- add some sysfs entries (gc_urgent_high_remaining, pending_discard)
Bug fix:
- try not to expose unwritten blocks to user by DIO
: this was added to avoid merge conflict; another patch is coming to address
other missing case.
- relax minor error condition for file pinning feature used in Android OTA
- fix potential deadlock case in compression flow
- should not truncate any block on pinned file
In addition, we've done some code clean-ups and tracepoint/sanity check
improvement.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmHnY0sACgkQQBSofoJI
UNIOkg//UmjCSSG63/YZM/lQQQe4kK/tT6QTT8W/VQtzWL9vXcL7bcaxzwX3LQbR
Gb47Zmsw9bzVJt6GQ2VRbODE1py/KPNMl5SDXJXHo6fOZ/dOnHve32gLwcLEzhPd
casB0TbwQJ6bpEsJiZ5ho741mURxUrSCHAAX6QIQVXh8ofm9qAqlWu74OLI6UHiV
MM84XmXcHtGUZG5SCTWfSCJhJM6Az/3A83ws9KVeu86dlE7IrigphU2nI2vdCKiO
trR3CiLC/364fiM+9ssLS3X2wKFPD/unEU7ljBv5UaG36jsVfW+tisjTKldzpiKK
44cNgDv1FEDxC0g3FKUhEGezAhxT8AJZB0in0zn8+5scarKGJtFCy9XhCGMVaeP+
usxvHVy8Ga1I7sMV6oHEBcGiPJWkmurzq1XXobtj6oL/JxN4gqUJeHTcod89hQHA
lx9kZs7MLKm2au+T3gZf5xyx35YCie8sY/N1qoPy8tU9Q7FJ54NdqqAc9JEZ6mSk
k9ybMaa/srHG/EI/XYPw0DrobHg6P5+bYtmsRvw2vP/nsNsD3ZI/EwBBEll2ITxC
V5Dn7MljYWI/5kB41Hl5xz6X65WeIN7koRyTXw5mp9tkNrLugqII5hzhwhSlcqJ1
3k9TAN3RbVpWHBcyryDyLbm/+dcbwIJ4v/eJEMIDk8F2SrBGOZs=
=LCJH
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've tried to address some performance issues in
f2fs_checkpoint and direct IO flows. Also, there was a work to enhance
the page cache management used for compression. Other than them, we've
done typical work including sysfs, code clean-ups, tracepoint, sanity
check, in addition to bug fixes on corner cases.
Enhancements:
- use iomap for direct IO
- try to avoid lock contention to improve f2fs_ckpt speed
- avoid unnecessary memory allocation in compression flow
- POSIX_FADV_DONTNEED drops the page cache containing compression
pages
- add some sysfs entries (gc_urgent_high_remaining, pending_discard)
Bug fixes:
- try not to expose unwritten blocks to user by DIO (this was added
to avoid merge conflict; another patch is coming to address other
missing case)
- relax minor error condition for file pinning feature used in
Android OTA
- fix potential deadlock case in compression flow
- should not truncate any block on pinned file
In addition, we've done some code clean-ups and tracepoint/sanity
check improvement"
* tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
f2fs: do not allow partial truncation on pinned file
f2fs: remove redunant invalidate compress pages
f2fs: Simplify bool conversion
f2fs: don't drop compressed page cache in .{invalidate,release}page
f2fs: fix to reserve space for IO align feature
f2fs: fix to check available space of CP area correctly in update_ckpt_flags()
f2fs: support fault injection to f2fs_trylock_op()
f2fs: clean up __find_inline_xattr() with __find_xattr()
f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()
f2fs: do not bother checkpoint by f2fs_get_node_info
f2fs: avoid down_write on nat_tree_lock during checkpoint
f2fs: compress: fix potential deadlock of compress file
f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
f2fs: add gc_urgent_high_remaining sysfs node
f2fs: fix to do sanity check in is_alive()
f2fs: fix to avoid panic in is_alive() if metadata is inconsistent
f2fs: fix to do sanity check on inode type during garbage collection
f2fs: avoid duplicate call of mark_inode_dirty
f2fs: show number of pending discard commands
f2fs: support POSIX_FADV_DONTNEED drop compressed page cache
...
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying. Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:
- a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
- a need to check for the process being signalled between failures
- the possibility that other recovery actions could be performed
- the allocation is quite deep in support code, and passing down an
extra flag to say if __GFP_NOFAIL is wanted would be clumsy.
Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.
It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.
This patch introduces memalloc_retry_wait() with takes on that
responsibility. Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used. It will wait
however is appropriate.
For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests. If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing. So there is no need for much
further waiting. memalloc_retry_wait() will wait until the current
jiffie ends. If this condition is not met, then alloc_page() won't have
waited much if at all. In that case memalloc_retry_wait() waits about
200ms. This is the delay that most current loops uses.
linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.
Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For compressed inode, in .{invalidate,release}page, we will call
f2fs_invalidate_compress_pages() to drop all compressed page cache of
current inode.
But we don't need to drop compressed page cache synchronously in
.invalidatepage, because, all trancation paths of compressed physical
block has been covered with f2fs_invalidate_compress_page().
And also we don't need to drop compressed page cache synchronously
in .releasepage, because, if there is out-of-memory, we can count
on page cache reclaim on sbi->compress_inode.
BTW, this patch may fix the issue reported below:
https://lore.kernel.org/linux-f2fs-devel/20211202092812.197647-1-changfengnan@vivo.com/T/#u
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch tries to mitigate lock contention between f2fs_write_checkpoint and
f2fs_get_node_info along with nat_tree_lock.
The idea is, if checkpoint is currently running, other threads that try to grab
nat_tree_lock would be better to wait for checkpoint.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Android OTA failed due to SBI_NEED_FSCK flag when pinning the file. Let's avoid
it since we can do in-place-updates.
Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make f2fs_file_read_iter() and f2fs_file_write_iter() use the iomap
direct I/O implementation instead of the fs/direct-io.c one.
The iomap implementation is more efficient, and it also avoids the need
to add new features and optimizations to the old implementation.
This new implementation also eliminates the need for f2fs to hook bio
submission and completion and to allocate memory per-bio. This is
because it's possible to correctly update f2fs's in-flight DIO counters
using __iomap_dio_rw() in combination with an implementation of
iomap_dio_ops::end_io() (as suggested by Christoph Hellwig).
When possible, this new implementation preserves existing f2fs behavior
such as the conditions for falling back to buffered I/O.
This patch has been tested with xfstests by running 'gce-xfstests -c
f2fs -g auto -X generic/017' with and without this patch; no regressions
were seen. (Some tests fail both before and after. generic/017 hangs
both before and after, so it had to be excluded.)
Signed-off-by: Eric Biggers <ebiggers@google.com>
[Jaegeuk Kim: use spin_lock_bh for f2fs_update_iostat in softirq]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
use iomap for direct I/O.
Note that this may be used for other things besides direct I/O in the
future; however, for now I've only tested it for direct I/O.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
DIO preallocates physical blocks before writing data, but if an error occurrs
or power-cut happens, we can see block contents from the disk. This patch tries
to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
unwritten blocks from error or power-cut.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
f2fs_write_begin() assumes that all blocks were preallocated by
default unless FI_NO_PREALLOC is explicitly set. This invites data
corruption, as there are cases in which not all blocks are preallocated.
Commit 47501f87c6 ("f2fs: preallocate DIO blocks when forcing
buffered_io") fixed one case, but there are others remaining.
Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
only gets set if all blocks for the current write were preallocated.
Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
handle some of the logic that was previously in write_iter() directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't alloc new page pointers array to replace old, just use old, introduce
valid_nr_cpages to indicate valid number of page pointers in array, try to
reduce one page array alloc and free when write compress page.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
Fixes: 6aa58d8ad2 ("f2fs: readahead encrypted block during GC")
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
when overwrite only first block of cluster, since cluster is not full, it
will call f2fs_write_raw_pages when f2fs_write_multi_pages, and cause the
whole cluster become uncompressed eventhough data can be compressed.
this may will make random write bench score reduce a lot.
root# dd if=/dev/zero of=./fio-test bs=1M count=1
root# sync
root# echo 3 > /proc/sys/vm/drop_caches
root# f2fs_io get_cblocks ./fio-test
root# dd if=/dev/zero of=./fio-test bs=4K count=1 oflag=direct conv=notrunc
w/o patch:
root# f2fs_io get_cblocks ./fio-test
189
w/ patch:
root# f2fs_io get_cblocks ./fio-test
192
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Commit 3c62be17d4 ("f2fs: support multiple devices") missed
to support direct IO for multiple device feature, this patch
adds to support the missing part of multidevice feature.
In addition, for multiple device image, we should be aware of
any issued direct write IO rather than just buffered write IO,
so that fsync and syncfs can issue a preflush command to the
device where direct write IO goes, to persist user data for
posix compliant.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this cycle, we've addressed some performance issues such as lock contention,
misbehaving compress_cache, allowing extent_cache for compressed files, and new
sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
quickly, we also added an iostat which shows the IO latencies periodically. On
the stability side, we've found two memory leakage cases in the error path in
compression flow. And, we've also fixed various corner cases in fiemap, quota,
checkpoint=disable, zstd, and so on.
Enhancement:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fix:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code paths such
as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
=W67o
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.
In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.
On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.
Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"
* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...
The prepare_compress_overwrite() gets/locks a page to prepare a read, and calls
f2fs_read_multi_pages() which checks EOF first. If there's any page beyond EOF,
we unlock the page and set cc->rpages[i] = NULL, which we can't put the page
anymore. This makes page leak, so let's fix by putting that page.
Fixes: a949dc5f2c ("f2fs: compress: fix race condition of overwrite vs truncate")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In below path, it will return ENOENT if filesystem is shutdown:
- f2fs_map_blocks
- f2fs_get_dnode_of_data
- f2fs_get_node_page
- __get_node_page
- read_node_page
- is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
return -ENOENT
- force return value from ENOENT to 0
It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.
However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.
There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.
xfstest generic/388 reports below warning:
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
iomap_apply+0x1c4/0x7b0
? dax_iomap_rw+0x1c0/0x1c0
dax_iomap_rw+0xad/0x1c0
? dax_iomap_rw+0x1c0/0x1c0
f2fs_file_write_iter+0x5ab/0x970 [f2fs]
do_iter_readv_writev+0x273/0x2e0
do_iter_write+0xab/0x1f0
vfs_iter_write+0x21/0x40
iter_file_splice_write+0x287/0x540
do_splice+0x37c/0xa60
__x64_sys_splice+0x15f/0x3a0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
dax_iomap_fault+0x44/0x70
f2fs_dax_huge_fault+0x155/0x400 [f2fs]
f2fs_dax_fault+0x18/0x30 [f2fs]
__do_fault+0x4e/0x120
do_fault+0x3cf/0x7a0
__handle_mm_fault+0xa8c/0xf20
? find_held_lock+0x39/0xd0
handle_mm_fault+0x1b6/0x480
do_user_addr_fault+0x320/0xcd0
? rcu_read_lock_sched_held+0x67/0xc0
exc_page_fault+0x77/0x3f0
? asm_exc_page_fault+0x8/0x30
asm_exc_page_fault+0x1e/0x30
Fixes: 83a3bfdb5a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.
So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.
[samples]
f2fs_ckpt-254:1-507 [003] .... 2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
f2fs_ckpt-254:1-507 [002] .... 2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch adds f2fs_sanity_check_cluster() to support doing
sanity check on cluster of compressed file, it will be triggered
from below two paths:
- __f2fs_cluster_blocks()
- f2fs_map_blocks(F2FS_GET_BLOCK_FIEMAP)
And it can detect below three kind of cluster insanity status.
C: COMPRESS_ADDR
N: NULL_ADDR or NEW_ADDR
V: valid blkaddr
*: any value
1. [*|C|*|*]
2. [C|*|C|*]
3. [C|N|N|V]
Signed-off-by: Chao Yu <chao@kernel.org>
[Nathan Chancellor: fix missing inline warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to inject fault into f2fs_kmem_cache_alloc().
Usage:
a) echo 32768 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=32768 <dev> <mountpoint>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since cluster is basic unit of compression, one cluster is compressed or
not, so we can calculate valid blocks only for first page in cluster,
the other pages just skip.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Compressed inode may suffer read performance issue due to it can not
use extent cache, so I propose to add this unaligned extent support
to improve it.
Currently, it only works in readonly format f2fs image.
Unaligned extent: in one compressed cluster, physical block number
will be less than logical block number, so we add an extra physical
block length in extent info in order to indicate such extent status.
The idea is if one whole cluster blocks are contiguous physically,
once its mapping info was readed at first time, we will cache an
unaligned (or aligned) extent info entry in extent cache, it expects
that the mapping info will be hitted when rereading cluster.
Merge policy:
- Aligned extents can be merged.
- Aligned extent and unaligned extent can not be merged.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Turned back the remmaped sector address to the address in the partition,
when ending io, for compress cache to work properly.
Fixes: 6ce19aff0b ("f2fs: compress: add compress_inode to cache
compressed blocks")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Hyeong Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
when we overwrite the whole page in cluster, we don't need read original
data before write, because after write_end(), writepages() can help to
load left data in that cluster.
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We can just check f2fs_lfs_mode() directly. The block_unaligned_IO()
check is redundant because in LFS mode, f2fs doesn't do direct I/O
writes that aren't block-aligned (due to f2fs_force_buffered_io()
returning true in this case, triggering the fallback to buffered I/O).
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Make f2fs_write_failed() take a 'struct inode' directly rather than a
'struct address_space', as this simplifies it slightly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
SBI_NEED_FSCK is an indicator that fsck.f2fs needs to be triggered, so it
is not fully critical to stop any IO writes. So, let's allow to write data
instead of reporting EIO forever given SBI_NEED_FSCK, but do keep OPU.
Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Cc: <stable@kernel.org> # v5.13+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use invalidate_lock instead of f2fs' private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: Chao Yu <yuchao0@huawei.com>
CC: linux-f2fs-devel@lists.sourceforge.net
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch supports to migrate swapfile in aligned write mode during
swapon in order to keep swapfile being aligned to section as much as
possible, then pinned swapfile will locates fully filled section which
may not affected by GC.
However, for the case that swapfile's size is not aligned to section
size, it will still leave last extent in file's tail as unaligned due
to its size is smaller than section size, like case #2.
case #1
xfs_io -f /mnt/f2fs/file -c "pwrite 0 4M" -c "fsync"
Before swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..3047]: 1123352..1126399 3048 0x1000
1: [3048..7143]: 237568..241663 4096 0x1000
2: [7144..8191]: 245760..246807 1048 0x1001
After swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..8191]: 249856..258047 8192 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)
case #2
xfs_io -f /mnt/f2fs/file -c "pwrite 0 3M" -c "fsync"
Before swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..3047]: 246808..249855 3048 0x1000
1: [3048..6143]: 237568..240663 3096 0x1001
After swapon:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 258048..262143 4096 0x1000
1: [4096..6143]: 238616..240663 2048 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile: last extent is not aligned to section
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
After commit af4b6b8edf ("f2fs: introduce check_swap_activate_fast()"),
we will never run into original logic of check_swap_activate() before
f2fs supports non 4k-sized page, so let's delete those dead codes.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Support to use address space of inner inode to cache compressed block,
in order to improve cache hit ratio of random read.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Update the logging uses that have unnecessary newlines as the f2fs_printk
function and so its f2fs_<level> macro callers already adds one.
This allows searching single line logging entries with an easier grep and
also avoids unnecessary blank lines in the logging.
Miscellanea:
o Coalesce formats
o Align to open parenthesis
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The kernel writes to swap files on f2fs directly without the assistance
of the filesystem. This direct write by kernel can be non-sequential
even when the f2fs is in LFS mode. Such non-sequential write conflicts
with the LFS semantics. Especially when f2fs is set up on zoned block
devices, the non-sequential write causes unaligned write command errors.
To avoid the non-sequential writes to swap files, prevent swap file
activation when the filesystem is in LFS mode.
Fixes: 4969c06a0d ("f2fs: support swap file w/ DIO")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.10+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Restruct f2fs page private layout for below reasons:
There are some cases that f2fs wants to set a flag in a page to
indicate a specified status of page:
a) page is in transaction list for atomic write
b) page contains dummy data for aligned write
c) page is migrating for GC
d) page contains inline data for inline inode flush
e) page belongs to merkle tree, and is verified for fsverity
f) page is dirty and has filesystem/inode reference count for writeback
g) page is temporary and has decompress io context reference for compression
There are existed places in page structure we can use to store
f2fs private status/data:
- page.flags: PG_checked, PG_private
- page.private
However it was a mess when we using them, which may cause potential
confliction:
page.private PG_private PG_checked page._refcount (+1 at most)
a) -1 set +1
b) -2 set
c), d), e) set
f) 0 set +1
g) pointer set
The other problem is page.flags has no free slot, if we can avoid set
zero to page.private and set PG_private flag, then we use non-zero value
to indicate PG_private status, so that we may have chance to reclaim
PG_private slot for other usage. [1]
The other concern is f2fs has bad scalability in aspect of indicating
more page status.
So in this patch, let's restructure f2fs' page.private as below to
solve above issues:
Layout A: lowest bit should be 1
| bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... |
bit 0 PAGE_PRIVATE_NOT_POINTER
bit 1 PAGE_PRIVATE_ATOMIC_WRITE
bit 2 PAGE_PRIVATE_DUMMY_WRITE
bit 3 PAGE_PRIVATE_ONGOING_MIGRATION
bit 4 PAGE_PRIVATE_INLINE_INODE
bit 5 PAGE_PRIVATE_REF_RESOURCE
bit 6- f2fs private data
Layout B: lowest bit should be 0
page.private is a wrapped pointer.
After the change:
page.private PG_private PG_checked page._refcount (+1 at most)
a) 11 set +1
b) 101 set +1
c) 1001 set +1
d) 10001 set +1
e) set
f) 100001 set +1
g) pointer set +1
[1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The final solution can be migrating blocks to form a section-aligned file
internally. Meanwhile, let's ask users to do that when preparing the swap
file initially like:
1) create()
2) ioctl(F2FS_IOC_SET_PIN_FILE)
3) fallocate()
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 36e4d95891 ("f2fs: check if swapfile is section-alligned")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we added a new mount option, "checkpoint_merge", which introduces
a kernel thread dealing with the f2fs checkpoints. Once we start to manage the
IO priority along with blk-cgroup, the checkpoint operation can be processed in
a lower priority under the process context. Since the checkpoint holds all the
filesystem operations, we give a higher priority to the checkpoint thread all
the time.
Enhancement:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fix:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In addition,
we fixed some minor race conditions and error handling routines.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCQhVIACgkQQBSofoJI
UNIggA/8DZINzFLMCj6+6P5wNAWj3nYtx/FnwZ7C31f8qkiZjgA4LfONUnDvV7sU
GS8MuLQz4eTYfqU2rVgGiSm+aCkEOovnk7C7Huo7pezgqYb+5J6ACXsqdU3dcD5M
kShJMqLKcTKqtOMbnJrdGvmw1/ysuAi7UhSSgVV+9NQxlhxADnagOGbQ7lXNSV3R
spGMWazGY2uA5DFCCa4lMX79lyFATCzEKB3SKAW5r+8QSmxJY8ViK2Er7AnvwRJz
XJ/QJ8ALNb/GGyHzBWFv3P6Yxo/G3FkUvTIc5Rhi9P2lUgjI2NALokj7AOnfNh4a
uSXHVlNrrfH+gpx9xr5z8MUmroCYCCOJ6EhnVweqViRmekY8jSb2HxmUtDTIf19U
LWl3gtD2GDQx6CY0a0K58Oa2Lp0Bp9MWUdPA/4P21EymZwXum7aCkhV+DnigcoCj
yCmKlI8nIpCS97dIO/7MsnG6Tu/7c+Prytd2ezUo+6hlkXPZk8shs+elnjlWu/6V
3ZpSWKzQsPJL8U7eB9H04AxEokrrXm3fRhR86C7JdkEe0gyGFf3dB/G+jqzcNi5m
ZpxiCeZ8RbNmPpnH8NWeYHk9uDKMOXuUPFYDoaOwImNWfqj0jfhiTxHo4MyBLAuk
MT632ICcuJLvwgnSbMAI3U7v6+dZXKH4y6U7IHFjxhI7beMzxmI=
=P5uk
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we added a new mount option, "checkpoint_merge", which
introduces a kernel thread dealing with the f2fs checkpoints. Once we
start to manage the IO priority along with blk-cgroup, the checkpoint
operation can be processed in a lower priority under the process
context. Since the checkpoint holds all the filesystem operations, we
give a higher priority to the checkpoint thread all the time.
Enhancements:
- introduce gc_merge mount option to introduce a checkpoint thread
- improve to run discard thread efficiently
- allow modular compression algorithms
- expose # of overprivision segments to sysfs
- expose runtime compression stat to sysfs
Bug fixes:
- fix OOB memory access by the node id lookup
- avoid touching checkpointed data in the checkpoint-disabled mode
- fix the resizing flow to avoid kernel panic and race conditions
- fix block allocation issues on pinned files
- address some swapfile issues
- fix hugtask problem and kernel panic during atomic write operations
- don't start checkpoint thread in RO
And, we've cleaned up some kernel coding style and build warnings. In
addition, we fixed some minor race conditions and error handling
routines"
* tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (48 commits)
f2fs: drop inplace IO if fs status is abnormal
f2fs: compress: remove unneed check condition
f2fs: clean up left deprecated IO trace codes
f2fs: avoid using native allocate_segment_by_default()
f2fs: remove unnecessary struct declaration
f2fs: fix to avoid NULL pointer dereference
f2fs: avoid duplicated codes for cleanup
f2fs: document: add description about compressed space handling
f2fs: clean up build warnings
f2fs: fix the periodic wakeups of discard thread
f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
f2fs: fix to avoid GC/mmap race with f2fs_truncate()
f2fs: set checkpoint_merge by default
f2fs: Fix a hungtask problem in atomic write
f2fs: fix to restrict mount condition on readonly block device
f2fs: introduce gc_merge mount option
f2fs: fix to cover __allocate_new_section() with curseg_lock
f2fs: fix wrong alloc_type in f2fs_do_replace_block
f2fs: delete empty compress.h
f2fs: fix a typo in inode.c
...
This patch combined the below three clean-up patches.
- modify open brace '{' following function definitions
- ERROR: spaces required around that ':'
- ERROR: spaces required before the open parenthesis '('
- ERROR: spaces prohibited before that ','
- Made suggested modifications from checkpatch in reference to WARNING:
Missing a blank line after declarations
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>