mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-16 13:50:32 +00:00
d845cd901b
2133 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
de16588a77 |
v6.6-vfs.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc
okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV
AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM=
=pSEW
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual filesystems.
Features:
- Block mode changes on symlinks and rectify our broken semantics
- Report file modifications via fsnotify() for splice
- Allow specifying an explicit timeout for the "rootwait" kernel
command line option. This allows to timeout and reboot instead of
always waiting indefinitely for the root device to show up
- Use synchronous fput for the close system call
Cleanups:
- Get rid of open-coded lockdep workarounds for async io submitters
and replace it all with a single consolidated helper
- Simplify epoll allocation helper
- Convert simple_write_begin and simple_write_end to use a folio
- Convert page_cache_pipe_buf_confirm() to use a folio
- Simplify __range_close to avoid pointless locking
- Disable per-cpu buffer head cache for isolated cpus
- Port ecryptfs to kmap_local_page() api
- Remove redundant initialization of pointer buf in pipe code
- Unexport the d_genocide() function which is only used within core
vfs
- Replace printk(KERN_ERR) and WARN_ON() with WARN()
Fixes:
- Fix various kernel-doc issues
- Fix refcount underflow for eventfds when used as EFD_SEMAPHORE
- Fix a mainly theoretical issue in devpts
- Check the return value of __getblk() in reiserfs
- Fix a racy assert in i_readcount_dec
- Fix integer conversion issues in various functions
- Fix LSM security context handling during automounts that prevented
NFS superblock sharing"
* tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
cachefiles: use kiocb_{start,end}_write() helpers
ovl: use kiocb_{start,end}_write() helpers
aio: use kiocb_{start,end}_write() helpers
io_uring: use kiocb_{start,end}_write() helpers
fs: create kiocb_{start,end}_write() helpers
fs: add kerneldoc to file_{start,end}_write() helpers
io_uring: rename kiocb_end_write() local helper
splice: Convert page_cache_pipe_buf_confirm() to use a folio
libfs: Convert simple_write_begin and simple_write_end to use a folio
fs/dcache: Replace printk and WARN_ON by WARN
fs/pipe: remove redundant initialization of pointer buf
fs: Fix kernel-doc warnings
devpts: Fix kernel-doc warnings
doc: idmappings: fix an error and rephrase a paragraph
init: Add support for rootwait timeout parameter
vfs: fix up the assert in i_readcount_dec
fs: Fix one kernel-doc comment
docs: filesystems: idmappings: clarify from where idmappings are taken
fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs
vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing
...
|
||
|
|
ecd7db2047 |
v6.6-vfs.tmpfs
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTkgAKCRCRxhvAZXjc
ouZsAPwNBHB2aPKtzWURuKx5RX02vXTzHX+A/LpuDz5WBFe8zQD+NlaBa4j0MBtS
rVYM+CjOXnjnsLc8W0euMnfYNvViKgQ=
=L2+2
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull libfs and tmpfs updates from Christian Brauner:
"This cycle saw a lot of work for tmpfs that required changes to the
vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this
cycle. Things will go back to mm next cycle.
Features
========
- By far the biggest work is the quota support for tmpfs. New tmpfs
quota infrastructure is added to support it and a new QFMT_SHMEM
uapi option is exposed.
This offers user and group quotas to tmpfs (project quotas will be
added later). Similar to other filesystems tmpfs quota are not
supported within user namespaces yet.
- Add support for user xattrs. While tmpfs already supports security
xattrs (security.*) and POSIX ACLs for a long time it lacked
support for user xattrs (user.*). With this pull request tmpfs will
be able to support a limited number of user xattrs.
This is accompanied by a fix (see below) to limit persistent simple
xattr allocations.
- Add support for stable directory offsets. Currently tmpfs relies on
the libfs provided cursor-based mechanism for readdir. This causes
issues when a tmpfs filesystem is exported via NFS.
NFS clients do not open directories. Instead, each server-side
readdir operation opens the directory, reads it, and then closes
it. Since the cursor state for that directory is associated with
the opened file it is discarded after each readdir operation. Such
directory offsets are not just cached by NFS clients but also
various userspace libraries based on these clients.
As it stands there is no way to invalidate the caches when
directory offsets have changed and the whole application depends on
unchanging directory offsets.
At LSFMM we discussed how to solve this problem and decided to
support stable directory offsets. libfs now allows filesystems like
tmpfs to use an xarrary to map a directory offset to a dentry. This
mechanism is currently only used by tmpfs but can be supported by
others as well.
Fixes
=====
- Change persistent simple xattrs allocations in libfs from
GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory
cgroup limits. Since this is a change to libfs it affects both
tmpfs and kernfs.
- Correctly verify {g,u}id mount options.
A new filesystem context is created via fsopen() which records the
namespace that becomes the owning namespace of the superblock when
fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are
mountable in namespaces. However, fsconfig() calls can occur in a
namespace different from the namespace where fsopen() has been
called.
Currently, when fsconfig() is called to set {g,u}id mount options
the requested {g,u}id is mapped into a k{g,u}id according to the
namespace where fsconfig() was called from. The resulting k{g,u}id
is not guaranteed to be resolvable in the namespace of the
filesystem (the one that fsopen() was called in).
This means it's possible for an unprivileged user to create files
owned by any group in a tmpfs mount since it's possible to set the
setid bits on the tmpfs directory.
The contract for {g,u}id mount options and {g,u}id values in
general set from userspace has always been that they are translated
according to the caller's idmapping. In so far, tmpfs has been
doing the correct thing. But since tmpfs is mountable in
unprivileged contexts it is also necessary to verify that the
resulting {k,g}uid is representable in the namespace of the
superblock to avoid such bugs.
The new mount api's cross-namespace delegation abilities are
already widely used. Having talked to a bunch of userspace this is
the most faithful solution with minimal regression risks"
* tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs
mm: invalidation check mapping before folio_contains
tmpfs: trivial support for direct IO
tmpfs,xattr: enable limited user extended attributes
tmpfs: track free_ispace instead of free_inodes
xattr: simple_xattr_set() return old_xattr to be freed
tmpfs: verify {g,u}id mount options correctly
shmem: move spinlock into shmem_recalc_inode() to fix quota support
libfs: Remove parent dentry locking in offset_iterate_dir()
libfs: Add a lock class for the offset map's xa_lock
shmem: stable directory offsets
shmem: Refactor shmem_symlink()
libfs: Add directory operations for stable offsets
shmem: fix quota lock nesting in huge hole handling
shmem: Add default quota limit mount options
shmem: quota support
shmem: prepare shmem quota infrastructure
quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks
shmem: make shmem_get_inode() return ERR_PTR instead of NULL
shmem: make shmem_inode_acct_block() return error
|
||
|
|
615e95831e |
v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
=eJz5
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
|
||
|
|
3fb5a6562a |
New code for 6.6:
* Allow the kernel to initiate a freeze of a filesystem. The kernel
and userspace can both hold a freeze on a filesystem at the same
time; the freeze is not lifted until /both/ holders lift it. This
will enable us to fix a longstanding bug in XFS online fsck.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZLVnJwAKCRBKO3ySh0YR
pqVIAP9u9CZEJ2Zcc7YpBj1MLUQGr2xBmz8RJEVJbQHKVgYcQwEA9BNb4eH4i2Af
K7Qp0OGNgyzZw37lN23Uf/SDuBK2QgM=
=seMl
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXo2AAKCRCRxhvAZXjc
ojDfAQDguc2saF8WLeXtn2O0pGOW8vTrhpwiFHNI6hwdzf07/AD+LGBpFEqYKyX5
NHPzdR7YYpJoTsQzR4JFJVZqN9Q1xgU=
=wDq0
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.6-merge-2' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull filesystem freezing updates from Darrick Wong:
New code for 6.6:
* Allow the kernel to initiate a freeze of a filesystem. The kernel
and userspace can both hold a freeze on a filesystem at the same
time; the freeze is not lifted until /both/ holders lift it. This
will enable us to fix a longstanding bug in XFS online fsck.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Message-Id: <20230822182604.GB11286@frogsfrogsfrogs>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
2c18a63b76 |
super: wait until we passed kill super
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again. That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet. We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-4-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
5e87491415 |
super: wait for nascent superblocks
Recent patches experiment with making it possible to allocate a new
superblock before opening the relevant block device. Naturally this has
intricate side-effects that we get to learn about while developing this.
Superblock allocators such as sget{_fc}() return with s_umount of the
new superblock held and lock ordering currently requires that block
level locks such as bdev_lock and open_mutex rank above s_umount.
Before
|
||
|
|
ed0360bbab |
fs: create kiocb_{start,end}_write() helpers
aio, io_uring, cachefiles and overlayfs, all open code an ugly variant
of file_{start,end}_write() to silence lockdep warnings.
Create helpers for this lockdep dance so we can use the helpers in all
the callers.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <20230817141337.1025891-4-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
f6c05b9e5d |
fs: add kerneldoc to file_{start,end}_write() helpers
and use sb_end_write() instead of open coded version. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
38bcdd3893 |
fs: remove get_super
get_super is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-17-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
aee79d4e52 |
fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.
When running UnixBench/Shell Scripts, we observed high false sharing for
accessing i_mmap against i_mmap_rwsem.
UnixBench/Shell Scripts are typical load/execute command test scenarios,
which concurrently launch->execute->exit a lot of shell commands. A lot
of processes invoke vma_interval_tree_remove which touch "i_mmap", the
call stack:
----vma_interval_tree_remove
|----unlink_file_vma
| free_pgtables
| |----exit_mmap
| | mmput
| | |----begin_new_exec
| | | load_elf_binary
| | | bprm_execve
Meanwhile, there are a lot of processes touch 'i_mmap_rwsem' to acquire
the semaphore in order to access 'i_mmap'. In existing 'address_space'
layout, 'i_mmap' and 'i_mmap_rwsem' are in the same cacheline.
The patch places the i_mmap and i_mmap_rwsem in separate cache lines to
avoid this false sharing problem.
With this patch, based on kernel v6.4.0, on Intel Sapphire Rapids
112C/224T platform, the score improves by ~5.3%. And perf c2c tool shows
the false sharing is resolved as expected, the symbol
vma_interval_tree_remove disappeared in cache line 0 after this change.
Baseline:
=================================================
Shared Cache Line Distribution Pareto
=================================================
-------------------------------------------------------------
0 3729 5791 0 0 0xff19b3818445c740
-------------------------------------------------------------
3.27% 3.02% 0.00% 0.00% 0x18 0 1 0xffffffffa194403b 604 483 389 692 203 [k] vma_interval_tree_insert [kernel.kallsyms] vma_interval_tree_insert+75 0 1
4.13% 3.63% 0.00% 0.00% 0x20 0 1 0xffffffffa19440a2 553 413 415 962 215 [k] vma_interval_tree_remove [kernel.kallsyms] vma_interval_tree_remove+18 0 1
2.04% 1.35% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1d6 1210 855 460 1229 222 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1
0.62% 1.85% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1bf 762 329 577 527 198 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1
0.48% 0.31% 0.00% 0.00% 0x28 0 1 0xffffffffa219a58c 1677 1476 733 1544 224 [k] down_write [kernel.kallsyms] down_write+28 0 1
0.05% 0.07% 0.00% 0.00% 0x28 0 1 0xffffffffa219a21d 1040 819 689 33 27 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+749 0 1
0.00% 0.05% 0.00% 0.00% 0x28 0 1 0xffffffffa17707db 0 1005 786 1373 223 [k] up_write [kernel.kallsyms] up_write+27 0 1
0.00% 0.02% 0.00% 0.00% 0x28 0 1 0xffffffffa219a064 0 233 778 32 30 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1
33.82% 34.10% 0.00% 0.00% 0x30 0 1 0xffffffffa1770945 779 495 534 6011 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1
17.06% 15.28% 0.00% 0.00% 0x30 0 1 0xffffffffa1770915 593 438 468 2715 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1
3.54% 3.52% 0.00% 0.00% 0x30 0 1 0xffffffffa2199f84 881 601 583 1421 223 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+84 0 1
With this change:
-------------------------------------------------------------
0 556 838 0 0 0xff2780d7965d2780
-------------------------------------------------------------
0.18% 0.60% 0.00% 0.00% 0x8 0 1 0xffffffffafff27b8 503 453 569 14 13 [k] do_dentry_open [kernel.kallsyms] do_dentry_open+456 0 1
0.54% 0.12% 0.00% 0.00% 0x8 0 1 0xffffffffaffc51ac 510 199 428 15 12 [k] hugepage_vma_check [kernel.kallsyms] hugepage_vma_check+252 0 1
1.80% 2.15% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1d6 1778 799 343 215 136 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1
0.54% 1.31% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1bf 547 296 528 91 71 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1
0.72% 0.72% 0.00% 0.00% 0x18 0 1 0xffffffffb079a58c 1479 1534 676 288 163 [k] down_write [kernel.kallsyms] down_write+28 0 1
0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffafd707db 0 2381 744 282 158 [k] up_write [kernel.kallsyms] up_write+27 0 1
0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffb079a064 0 239 518 6 6 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1
46.58% 47.02% 0.00% 0.00% 0x20 0 1 0xffffffffafd70945 704 403 499 1137 219 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1
23.92% 25.78% 0.00% 0.00% 0x20 0 1 0xffffffffafd70915 558 413 500 542 185 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1
v1->v2: change padding to exchange fields.
Link: https://lkml.kernel.org/r/20230716145653.20122-1-lipeng.zhu@intel.com
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Yu Ma <yu.ma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
45e0d4b95b |
vfs: fix up the assert in i_readcount_dec
Drops a race where 2 threads could spot a positive value and both proceed to dec to -1, without reporting anything. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Message-Id: <20230811194814.1612336-1-mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
ffb6cf19e0
|
fs: add infrastructure for multigrain timestamps
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. POSIX generally mandates that when the the mtime changes, the ctime must also change. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Later patches will convert individual filesystems to use the new infrastructure. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-9-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
913e99287b
|
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
6faddda69f |
libfs: Add directory operations for stable offsets
Create a vector of directory operations in fs/libfs.c that handles directory seeks and readdir via stable offsets instead of the current cursor-based mechanism. For the moment these are unused. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Message-Id: <168814732984.530310.11190772066786107220.stgit@manet.1015granger.net> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
541d4c798a |
fs: drop the timespec64 arg from generic_update_time
In future patches we're going to change how the ctime is updated to keep track of when it has been queried. The way that the update_time operation works (and a lot of its callers) make this difficult, since they grab a timestamp early and then pass it down to eventually be copied into the inode. All of the existing update_time callers pass in the result of current_time() in some fashion. Drop the "time" parameter from generic_update_time, and rework it to fetch its own timestamp. This change means that an update_time could fetch a different timestamp than was seen in inode_needs_update_time. update_time is only ever called with one of two flag combinations: Either S_ATIME is set, or S_MTIME|S_CTIME|S_VERSION are set. With this change we now treat the flags argument as an indicator that some value needed to be updated when last checked, rather than an indication to update specific timestamps. Rework the logic for updating the timestamps and put it in a new inode_update_timestamps helper that other update_time routines can use. S_ATIME is as treated as we always have, but if any of the other three are set, then we attempt to update all three. Also, some callers of generic_update_time need to know what timestamps were actually updated. Change it to return an S_* flag mask to indicate that and rework the callers to expect it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
0d72b92883 |
fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE_COOKIE) that can have side effects when it is reported, and we're looking at adding more with the addition of multigrain timestamps. Add a request_mask argument to generic_fillattr and have most callers just pass in the value that is passed to getattr. Have other callers (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of STATX_CHANGE_COOKIE into generic_fillattr. Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
3e32715496
|
vfs: get rid of old '->iterate' directory operation
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
9cf3516c29 |
fs: add IOCB flags related to passing back dio completions
Async dio completions generally happen from hard/soft IRQ context, which means that users like iomap may need to defer some of the completion handling to a workqueue. This is less efficient than having the original issuer handle it, like we do for sync IO, and it adds latency to the completions. Add IOCB_DIO_CALLER_COMP, which the issuer can set if it is able to safely punt these completions to a safe context. If the dio handler is aware of this flag, assign a callback handler in kiocb->dio_complete and associated data io kiocb->private. The issuer will then call this handler with that data from task context. No functional changes in this patch. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
13bc244578 |
fs: rename i_ctime field to __i_ctime
Now that everything in-tree is converted to use the accessor functions, rename the i_ctime field in the inode to discourage direct access. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705185812.579118-4-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
880b957785 |
fs: distinguish between user initiated freeze and kernel initiated freeze
Userspace can freeze a filesystem using the FIFREEZE ioctl or by
suspending the block device; this state persists until userspace thaws
the filesystem with the FITHAW ioctl or resuming the block device.
Since commit
|
||
|
|
ed5f17f66e |
fs: Pass argument to fcntl_setlease as int
The interface for fcntl expects the argument passed for the command F_SETLEASE to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-nfs@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-3-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
bccb5c397f |
fcntl: Cast commands with int args explicitly
According to the fcntl API specification commands that expect an integer, hence not a pointer, always take an int and not long. In order to avoid access to undefined bits, we should explicitly cast the argument to int. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-2-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
0c4767923e |
fs: new helper: simple_rename_timestamp
A rename potentially involves updating 4 different inode timestamps. Add a function that handles the details sanely, and convert the libfs.c callers to use it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705185812.579118-3-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
9b6304c1d5 |
fs: add ctime accessors infrastructure
struct timespec64 has unused bits in the tv_nsec field that can be used for other purposes. In future patches, we're going to change how the inode->i_ctime is accessed in certain inodes in order to make use of them. In order to do that safely though, we'll need to eradicate raw accesses of the inode->i_ctime field from the kernel. Add new accessor functions for the ctime that we use to replace them. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Message-Id: <20230705185812.579118-2-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
c6b0271053 |
\n
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmScT18ACgkQnJ2qBz9k
QNnlqAf/bIU+I3Qd3EUpzWrOXEyRjaUggRnb4ibIH2I6DjSAP4wtm5wiG/+wjDFe
v+gdRd8PlAlHbZJvW3WUxeSzWendqd78i2lgwFN+s2QCVtQSUsNy7mtUvOL2b1zy
Kf35vTNbkKE0TevoqHZmoT/mehSBj6Zt4k5POMalfxwnJHoVF25OqHEQQc8vnOjv
as/uMaHVwK/Q0pMafTz8vt9Fogkdqe6A+qLLxTvG6iQKd2Z0NdYK2GxR0oTVhDOK
Ly+h1evRldgOcrishrje00LZT8SznUQkWBjIpPN/HbXR1qc5Jk+BYJUqT2jg7zVd
EW61U79nsaugpTUicpTUIluUZ7/QKA==
=toKL
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc filesystem updates from Jan Kara:
- Rewrite kmap_local() handling in ext2
- Convert ext2 direct IO path to iomap (with some infrastructure tweaks
associated with that)
- Convert two boilerplate licenses in udf to SPDX identifiers
- Other small udf, ext2, and quota fixes and cleanups
* tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix uninitialized array access for some pathnames
ext2: Drop fragment support
quota: fix warning in dqgrab()
quota: Properly disable quotas when add_dquot_ref() fails
fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier
fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier
fs: Drop wait_unfrozen wait queue
ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore
ext2_{set_link,delete_entry}(): don't bother with page_addr
ext2_put_page(): accept any pointer within the page
ext2_get_page(): saner type
ext2: use offset_in_page() instead of open-coding it as subtraction
ext2_rename(): set_link and delete_entry may fail
ext2: Add direct-io trace points
ext2: Move direct-io to use iomap
ext2: Use generic_buffers_fsync() implementation
ext4: Use generic_buffers_fsync_noflush() implementation
fs/buffer.c: Add generic_buffers_fsync*() implementation
ext2/dax: Fix ext2_setsize when len is page aligned
|
||
|
|
3a8a670eee |
Networking changes for 6.5.
Core
----
- Rework the sendpage & splice implementations. Instead of feeding
data into sockets page by page extend sendmsg handlers to support
taking a reference on the data, controlled by a new flag called
MSG_SPLICE_PAGES. Rework the handling of unexpected-end-of-file
to invoke an additional callback instead of trying to predict what
the right combination of MORE/NOTLAST flags is.
Remove the MSG_SENDPAGE_NOTLAST flag completely.
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid.
- Enable socket busy polling with CONFIG_RT.
- Improve reliability and efficiency of reporting for ref_tracker.
- Auto-generate a user space C library for various Netlink families.
Protocols
---------
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2].
- Use per-VMA locking for "page-flipping" TCP receive zerocopy.
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags.
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative.
- Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO).
- Avoid waking up applications using TLS sockets until we have
a full record.
- Allow using kernel memory for protocol ioctl callbacks, paving
the way to issuing ioctls over io_uring.
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address.
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch.
- PPPoE: make number of hash bits configurable.
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig).
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge).
- Support matching on Connectivity Fault Management (CFM) packets.
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug.
- HSR: don't enable promiscuous mode if device offloads the proto.
- Support active scanning in IEEE 802.15.4.
- Continue work on Multi-Link Operation for WiFi 7.
BPF
---
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used,
or in fact passing the verifier at all for complex programs,
especially those using open-coded iterators.
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what
the output buffer *should* be, without writing anything.
- Accept dynptr memory as memory arguments passed to helpers.
- Add routing table ID to bpf_fib_lookup BPF helper.
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands.
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only).
- Show target_{obj,btf}_id in tracing link fdinfo.
- Addition of several new kfuncs (most of the names are self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter
---------
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value.
- Increase ip_vs_conn_tab_bits range for 64BIT builds.
- Allow updating size of a set.
- Improve NAT tuple selection when connection is closing.
Driver API
----------
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out).
- Support configuring rate selection pins of SFP modules.
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines.
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer.
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio).
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message.
- Split devlink instance and devlink port operations.
New hardware / drivers
----------------------
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers
-------
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is
a variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split
the different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and
Enhanced MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmSbJM4ACgkQMUZtbf5S
IrtoDhAAhEim1+LBIKf4lhPcVdZ2p/TkpnwTz5jsTwSeRBAxTwuNJ2fQhFXg13E3
MnRq6QaEp8G4/tA/gynLvQop+FEZEnv+horP0zf/XLcC8euU7UrKdrpt/4xxdP07
IL/fFWsoUGNO+L9LNaHwBo8g7nHvOkPscHEBHc2Xrvzab56TJk6vPySfLqcpKlNZ
CHWDwTpgRqNZzSKiSpoMVd9OVMKUXcPYHpDmfEJ5l+e8vTXmZzOLHrSELHU5nP5f
mHV7gxkDCTshoGcaed7UTiOvgu1p6E5EchDJxiLaSUbgsd8SZ3u4oXwRxgj33RK/
fB2+UaLrRt/DdlHvT/Ph8e8Ygu77yIXMjT49jsfur/zVA0HEA2dFb7V6QlsYRmQp
J25pnrdXmE15llgqsC0/UOW5J1laTjII+T2T70UOAqQl4LWYAQDG4WwsAqTzU0KY
dueydDouTp9XC2WYrRUEQxJUzxaOaazskDUHc5c8oHp/zVBT+djdgtvVR9+gi6+7
yy4elI77FlEEqL0ItdU/lSWINayAlPLsIHkMyhSGKX0XDpKjeycPqkNx4UterXB/
JKIR5RBWllRft+igIngIkKX0tJGMU0whngiw7d1WLw25wgu4sB53hiWWoSba14hv
tXMxwZs5iGaPcT38oRVMZz8I1kJM4Dz3SyI7twVvi4RUut64EG4=
=9i4I
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
|
||
|
|
6e17c6de3d |
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing. - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability. - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning. - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface. - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree. - Johannes Weiner has done some cleanup work on the compaction code. - David Hildenbrand has contributed additional selftests for get_user_pages(). - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code. - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code. - Huang Ying has done some maintenance on the swap code's usage of device refcounting. - Christoph Hellwig has some cleanups for the filemap/directio code. - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses. - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings. - John Hubbard has a series of fixes to the MM selftesting code. - ZhangPeng continues the folio conversion campaign. - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock. - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8. - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management. - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code. - Vishal Moola also has done some folio conversion work. - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh J0qXCzqaaN8+BuEyLGDVPaXur9KirwY= =B7yQ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ... |
||
|
|
a0433f8cae |
for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
|
||
|
|
3eccc0c886 |
for-6.5/splice-2023-06-23
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
M3VxWD3GZQ==
=KhW4
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
|
||
|
|
1f2300a738 |
v6.5/vfs.file
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4WgAKCRCRxhvAZXjc
oofvAQDs9RJwQUyWHJmQA+tWz5cUE5DviVWCwwul5dQRRCqgaQEA2OIO0gPFaVoq
1OYOeLyUjl/cpS8e3u4uJtw34jttdQA=
=AwcR
-----END PGP SIGNATURE-----
Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs file handling updates from Christian Brauner:
"This contains Amir's work to fix a long-standing problem where an
unprivileged overlayfs mount can be used to avoid fanotify permission
events that were requested for an inode or superblock on the
underlying filesystem.
Some background about files opened in overlayfs. If a file is opened
in overlayfs @file->f_path will refer to a "fake" path. What this
means is that while @file->f_inode will refer to inode of the
underlying layer, @file->f_path refers to an overlayfs
{dentry,vfsmount} pair. The reasons for doing this are out of scope
here but it is the reason why the vfs has been providing the
open_with_fake_path() helper for overlayfs for very long time now. So
nothing new here.
This is for sure not very elegant and everyone including the overlayfs
maintainers agree. Improving this significantly would involve more
fragile and potentially rather invasive changes.
In various codepaths access to the path of the underlying filesystem
is needed for such hybrid file. The best example is fsnotify where
this becomes security relevant. Passing the overlayfs
@file->f_path->dentry will cause fsnotify to skip generating fsnotify
events registered on the underlying inode or superblock.
To fix this we extend the vfs provided open_with_fake_path() concept
for overlayfs to create a backing file container that holds the real
path and to expose a helper that can be used by relevant callers to
get access to the path of the underlying filesystem through the new
file_real_path() helper. This pattern is similar to what we do in
d_real() and d_real_inode().
The first beneficiary is fsnotify and fixes the security sensitive
problem mentioned above.
There's a couple of nice cleanups included as well.
Over time, the old open_with_fake_path() helper added specifically for
overlayfs a long time ago started to get used in other places such as
cachefiles. Even though cachefiles have nothing to do with hybrid
files.
The only reason cachefiles used that concept was that files opened
with open_with_fake_path() aren't charged against the caller's open
file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
both overlayfs and cachefiles need to ensure to not overcharge the
caller for their internal open calls.
So this work disentangles FMODE_NOACCOUNT use cases and backing file
use-cases by adding the FMODE_BACKING flag which indicates that the
file can be used to retrieve the backing file of another filesystem.
(Fyi, Jens will be sending you a really nice cleanup from Christoph
that gets rid of 3 FMODE_* flags otherwise this would be the last
fmode_t bit we'd be using.)
So now overlayfs becomes the sole user of the renamed
open_with_fake_path() helper which is now named backing_file_open().
For internal kernel users such as cachefiles that are only interested
in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
kernel_file_open() helper which opens a file without being charged
against the caller's open file limit. All new helpers are properly
documented and clearly annotated to mention their special uses.
We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
distinguish it from vfs_tmpfile() and align it the other kernel_*()
internal helpers"
* tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
ovl: enable fsnotify events on underlying real files
fs: use backing_file container for internal files with "fake" f_path
fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
fs: use a helper for opening kernel internal files
fs: rename {vfs,kernel}_tmpfile_open()
|
||
|
|
64bf6ae93e |
v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
=zdfx
-----END PGP SIGNATURE-----
Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Miscellaneous features, cleanups, and fixes for vfs and individual fs
Features:
- Use mode 0600 for file created by cachefilesd so it can be run by
unprivileged users. This aligns them with directories which are
already created with mode 0700 by cachefilesd
- Reorder a few members in struct file to prevent some false sharing
scenarios
- Indicate that an eventfd is used a semaphore in the eventfd's
fdinfo procfs file
- Add a missing uapi header for eventfd exposing relevant uapi
defines
- Let the VFS protect transitions of a superblock from read-only to
read-write in addition to the protection it already provides for
transitions from read-write to read-only. Protecting read-only to
read-write transitions allows filesystems such as ext4 to perform
internal writes, keeping writers away until the transition is
completed
Cleanups:
- Arnd removed the architecture specific arch_report_meminfo()
prototypes and added a generic one into procfs.h. Note, we got a
report about a warning in amdpgpu codepaths that suggested this was
bisectable to this change but we concluded it was a false positive
- Remove unused parameters from split_fs_names()
- Rename put_and_unmap_page() to unmap_and_put_page() to let the name
reflect the order of the cleanup operation that has to unmap before
the actual put
- Unexport buffer_check_dirty_writeback() as it is not used outside
of block device aops
- Stop allocating aio rings from highmem
- Protecting read-{only,write} transitions in the VFS used open-coded
barriers in various places. Replace them with proper little helpers
and document both the helpers and all barrier interactions involved
when transitioning between read-{only,write} states
- Use flexible array members in old readdir codepaths
Fixes:
- Use the correct type __poll_t for epoll and eventfd
- Replace all deprecated strlcpy() invocations, whose return value
isn't checked with an equivalent strscpy() call
- Fix some kernel-doc warnings in fs/open.c
- Reduce the stack usage in jffs2's xattr codepaths finally getting
rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
royally annoying compilation warning
- Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
fmode_t is required to avoid fmode_t to integer degradation
warnings
- Create coredumps with O_WRONLY instead of O_RDWR. There's a long
explanation in that commit how O_RDWR is actually a bug which we
found out with the help of Linus and git archeology
- Fix "no previous prototype" warnings in the pipe codepaths
- Add overflow calculations for remap_verify_area() as a signed
addition overflow could be triggered in xfstests
- Fix a null pointer dereference in sysv
- Use an unsigned variable for length calculations in jfs avoiding
compilation warnings with gcc 13
- Fix a dangling pipe pointer in the watch queue codepath
- The legacy mount option parser provided as a fallback by the VFS
for filesystems not yet converted to the new mount api did prefix
the generated mount option string with a leading ',' causing issues
for some filesystems
- Fix a repeated word in a comment in fs.h
- autofs: Update the ctime when mtime is updated as mandated by
POSIX"
* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
readdir: Replace one-element arrays with flexible-array members
fs: Provide helpers for manipulating sb->s_readonly_remount
fs: Protect reconfiguration of sb read-write from racing writes
eventfd: add a uapi header for eventfd userspace APIs
autofs: set ctime as well when mtime changes on a dir
eventfd: show the EFD_SEMAPHORE flag in fdinfo
fs/aio: Stop allocating aio rings from HIGHMEM
fs: Fix comment typo
fs: unexport buffer_check_dirty_writeback
fs: avoid empty option when generating legacy mount string
watch_queue: prevent dangling pipe pointer
fs.h: Optimize file struct to prevent false sharing
highmem: Rename put_and_unmap_page() to unmap_and_put_page()
cachefiles: Allow the cache to be non-root
init: remove unused names parameter in split_fs_names()
jfs: Use unsigned variable for length calculations
fs/sysv: Null check to prevent null-ptr-deref bug
fs: use UB-safe check for signed addition overflow in remap_verify_area
procfs: consolidate arch_report_meminfo declaration
fs: pipe: reveal missing function protoypes
...
|
||
|
|
63773d2b59 | Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. | ||
|
|
d7439fb1f4 |
fs: Provide helpers for manipulating sb->s_readonly_remount
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Message-Id: <20230620112832.5158-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
62d53c4a1d
|
fs: use backing_file container for internal files with "fake" f_path
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
cbb0b9d4bb
|
fs: use a helper for opening kernel internal files
cachefiles uses kernel_open_tmpfile() to open kernel internal tmpfile without accounting for nr_files. cachefiles uses open_with_fake_path() for the same reason without the need for a fake path. Fork open_with_fake_path() to kernel_file_open() which only does the noaccount part and use it in cachefiles. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
d56e0ddb8f
|
fs: rename {vfs,kernel}_tmpfile_open()
Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile without accounting for nr_files. Rename this helper to kernel_tmpfile_open() to better reflect this helper is used for kernel internal users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
a3bbdc52c3 |
Remove file->f_op->sendpage
Remove file->f_op->sendpage as splicing to a socket now calls sendmsg rather than sendpage. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
0733ad8002 |
fs: remove the now unused FMODE_* flags
FMODE_NDELAY, FMODE_EXCL and FMODE_WRITE_IOCTL were only used for block internal purposed and are now entirely unused, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-31-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
81b1fb7d17 |
fs: remove sb->s_mode
There is no real need to store the open mode in the super_block now. It is only used by f2fs, which can easily recalculate it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
b6334e2cd4 |
fs: Fix comment typo
Delete duplicated word in comment. Signed-off-by: Mao Zhu <zhumao001@208suo.com> Message-Id: <20230611123314.5282-1-dengshaomin@cdjrlc.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
44fff0fa08 |
fs: factor out a direct_write_fallback helper
Add a helper dealing with handling the syncing of a buffered write fallback for direct I/O. Link: https://lkml.kernel.org/r/20230601145904.1385409-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c402a9a943 |
filemap: add a kiocb_invalidate_post_direct_write helper
Add a helper to invalidate page cache after a dio write. Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2bfc668509 |
splice, net: Add a splice_eof op to file-ops and socket-ops
Add an optional method, ->splice_eof(), to allow splice to indicate the premature termination of a splice to struct file_operations and struct proto_ops. This is called if sendfile() or splice() encounters all of the following conditions inside splice_direct_to_actor(): (1) the user did not set SPLICE_F_MORE (splice only), and (2) an EOF condition occurred (->splice_read() returned 0), and (3) we haven't read enough to fulfill the request (ie. len > 0 still), and (4) we have already spliced at least one byte. A further patch will modify the behaviour of SPLICE_F_MORE to always be passed to the actor if either the user set it or we haven't yet read sufficient data to fulfill the request. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wh=V579PDYvkpnTobCLGczbgxpMgGmmhqiTyE34Cpi5Gg@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Matthew Wilcox <willy@infradead.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: David Hildenbrand <david@redhat.com> cc: Christian Brauner <brauner@kernel.org> cc: Chuck Lever <chuck.lever@oracle.com> cc: Boris Pismenny <borisp@nvidia.com> cc: John Fastabend <john.fastabend@gmail.com> cc: linux-mm@kvack.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
2dc334f1a6 |
splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
Replace generic_splice_sendpage() + splice_from_pipe + pipe_to_sendpage() with a net-specific handler, splice_to_socket(), that calls sendmsg() with MSG_SPLICE_PAGES set instead of calling ->sendpage(). MSG_MORE is used to indicate if the sendmsg() is expected to be followed with more data. This allows multiple pipe-buffer pages to be passed in a single call in a BVEC iterator, allowing the processing to be pushed down to a loop in the protocol driver. This helps pave the way for passing multipage folios down too. Protocols that haven't been converted to handle MSG_SPLICE_PAGES yet should just ignore it and do a normal sendmsg() for now - although that may be a bit slower as it may copy everything. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
87efb39075 |
fs: add a method to shut down the file system
Add a new ->shutdown super operation that can be used to tell the file system to shut down, and call it from newly created holder ops when the block device under a file system shuts down. This only covers the main block device for "simple" file systems using get_tree_bdev / mount_bdev. File systems their own get_tree method or opening additional devices will need to set up their own blk_holder_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
a7bc2e8ddf |
fs.h: Optimize file struct to prevent false sharing
In the syscall test of UnixBench, performance regression occurred due
to false sharing.
The lock and atomic members, including file::f_lock, file::f_count and
file::f_pos_lock are highly contended and frequently updated in the
high-concurrency test scenarios. perf c2c indentified one affected
read access, file::f_op.
To prevent false sharing, the layout of file struct is changed as
following
(A) f_lock, f_count and f_pos_lock are put together to share the same
cache line.
(B) The read mostly members, including f_path, f_inode, f_op are put
into a separate cache line.
(C) f_mode is put together with f_count, since they are used frequently
at the same time.
Due to '__randomize_layout' attribute of file struct, the updated layout
only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'.
The optimization has been validated in the syscall test of UnixBench.
performance gain is 30~50%. Furthermore, to confirm the optimization
effectiveness on the other codes path, the results of fsdisk, fsbuffer
and fstime are also shown.
Here are the detailed test results of unixbench.
Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk
Without Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples)
System Call Overhead 5772268.3 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7
File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9
File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7
System Call Overhead 15000.0 5772268.3 3848.2
========
System Benchmarks Index Score (Partial Only) 2768.3
With Patch
------------------------------------------------------------------------
File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples)
System Call Overhead 8237404.4 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4
File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5
System Call Overhead 15000.0 8237404.4 5491.6
========
System Benchmarks Index Score (Partial Only) 3295.3
Signed-off-by: chenzhiyin <zhiyin.chen@intel.com>
Message-Id: <20230601092400.27162-1-zhiyin.chen@intel.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
576215cffd |
fs: Drop wait_unfrozen wait queue
wait_unfrozen waitqueue is used only in quota code to wait for filesystem to become unfrozen. In that place we can just use sb_start_write() - sb_end_write() pair to achieve the same. So just remove the waitqueue. Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230525141710.7595-1-jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz> |
||
|
|
ac2263b588 |
Revert "module: error out early on concurrent load of the same module file"
This reverts commit
|
||
|
|
9828ed3f69 |
module: error out early on concurrent load of the same module file
It turns out that udev under certain circumstances will concurrently try
to load the same modules over-and-over excessively. This isn't a kernel
bug, but it ends up affecting the kernel, to the point that under
certain circumstances we can fail to boot, because the kernel uses a lot
of memory to read all the module data all at once.
Note that it isn't a memory leak, it's just basically a thundering herd
problem happening at bootup with a lot of CPUs, with the worst cases
then being pretty bad.
Admittedly the worst situations are somewhat contrived: lots and lots of
CPUs, not a lot of memory, and KASAN enabled to make it all slower and
as such (unintentionally) exacerbate the problem.
Luis explains: [1]
"My best assessment of the situation is that each CPU in udev ends up
triggering a load of duplicate set of modules, not just one, but *a
lot*. Not sure what heuristics udev uses to load a set of modules per
CPU."
Petr Pavlu chimes in: [2]
"My understanding is that udev workers are forked. An initial kmod
context is created by the main udevd process but no sharing happens
after the fork. It means that the mentioned memory pool logic doesn't
really kick in.
Multiple parallel load requests come from multiple udev workers, for
instance, each handling an udev event for one CPU device and making
the exactly same requests as all others are doing at the same time.
The optimization idea would be to recognize these duplicate requests
at the udevd/kmod level and converge them"
Note that module loading has tried to mitigate this issue before, see
for example commit
|
||
|
|
c6585011bc |
splice: Remove generic_file_splice_read()
Remove generic_file_splice_read() as it has been replaced with calls to filemap_splice_read() and copy_splice_read(). With this, ITER_PIPE is no longer used. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Steve French <smfrench@gmail.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-30-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
69df79a451 |
splice: Rename direct_splice_read() to copy_splice_read()
Rename direct_splice_read() to copy_splice_read() to better reflect as to what it does. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> cc: Steve French <sfrench@samba.org> cc: Jens Axboe <axboe@kernel.dk> cc: Al Viro <viro@zeniv.linux.org.uk> cc: linux-cifs@vger.kernel.org cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230522135018.2742245-4-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
f15afbd34d
|
fs: fix undefined behavior in bit shift for SB_NOUSER
Shifting signed 32-bit value by 31 bits is undefined, so changing
significant bit to unsigned. It was spotted by UBSAN.
So let's just fix this by using the BIT() helper for all SB_* flags.
Fixes:
|
||
|
|
bedf149527 |
New code for 6.4:
* Remove an unused symbol. * Add tracepoints for the directio code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZEKzuQAKCRBKO3ySh0YR pothAQD9sBm7//+vYXxQXPlmX09Jvxixnlwyth+PvUI2Al3mrgEA0h1ZSRhxBbxx QiIFXCQYckb9GTIcRd67lCp1j/Ie/g0= =vGbm -----END PGP SIGNATURE----- Merge tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap updates from Darrick Wong: "The only changes for this cycle are the addition of tracepoints to the iomap directio code so that Ritesh (who is working on porting ext2 to iomap) can observe the io flows more easily. Summary: - Remove an unused symbol - Add tracepoints for the directio code" * tag 'iomap-6.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Add DIO tracepoints iomap: Remove IOMAP_DIO_NOSYNC unused dio flag fs.h: Add TRACE_IOCB_STRINGS for use in trace points |
||
|
|
5b9a7bb72f |
for-6.4/io_uring-2023-04-21
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRCvawQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpiKTEACvp0jm3Lyhxb8RMsx5T6Ko0pFH3DIymiL4 xpoZmAUflOjD0c+99FwHRQqKKXuo3OelhW+YOm0N6OOAt6JMSGmKpZh0UNJx+Fgj wMwiQ0X3Y5SaAsr5ZpXM+G1BV7ajihsMpu8/a718ERB3U3cLDz2qJfnzJh+E5Ip5 pYB4vS3+/FAER2MYQ7IPeovch2wWYtxDPztOxNX6SORu3OvpWiz1GR/+8u0tqj50 ROq97Jwjh5Tl4zP356EUSj/Vkfdr2yb+NlLbun8My5x8tYftZjnrNQ/+qeJNLwB8 tWTrg166ox/VX3aYZruAgUPv0IyGPZg7qZV5R72ChBK3VhIbOOLOCm/V6dhvl/XH vu2FG7J8WylWHmc+OU8u7TeSJdrwxTLs4e2IFUBK9ymAYFp0Q9S924fgvSYsFvVB iNn58SPRIbuA4SPtRfCd7pENtZW/QKfBC5CYK+pjsZVX40c9dbe40foVu4t2/EAo gi9+gSWEUVRRW2osxjaHXh78cW63g0j9bNfS6n1Vy32Oo5Mwm7n+bVWqCU5bCBXI MpPOk6AgME3UPwFzGzSmx+PVw8VacPxYP1NF8RFTCwj7OowFnrolJtruDmKJgXWY BN41EDo41k/C5mEu16Jr9rAkHeVhHaNZ+JhyDrzv8llJ/rv+4zEJw9SrhnpufmOX +YERd/ndAw== =Erfk -----END PGP SIGNATURE----- Merge tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Cleanup of the io-wq per-node mapping, notably getting rid of it so we just have a single io_wq entry per ring (Breno) - Followup to the above, move accounting to io_wq as well and completely drop struct io_wqe (Gabriel) - Enable KASAN for the internal io_uring caches (Breno) - Add support for multishot timeouts. Some applications use timeouts to wake someone waiting on completion entries, and this makes it a bit easier to just have a recurring timer rather than needing to rearm it every time (David) - Support archs that have shared cache coloring between userspace and the kernel, and hence have strict address requirements for mmap'ing the ring into userspace. This should only be parisc/hppa. (Helge, me) - XFS has supported O_DIRECT writes without needing to lock the inode exclusively for a long time, and ext4 now supports it as well. This is true for the common cases of not extending the file size. Flag the fs as having that feature, and utilize that to avoid serializing those writes in io_uring (me) - Enable completion batching for uring commands (me) - Revert patch adding io_uring restriction to what can be GUP mapped or not. This does not belong in io_uring, as io_uring isn't really special in this regard. Since this is also getting in the way of cleanups and improvements to the GUP code, get rid of if (me) - A few series greatly reducing the complexity of registered resources, like buffers or files. Not only does this clean up the code a lot, the simplified code is also a LOT more efficient (Pavel) - Series optimizing how we wait for events and run task_work related to it (Pavel) - Fixes for file/buffer unregistration with DEFER_TASKRUN (Pavel) - Misc cleanups and improvements (Pavel, me) * tag 'for-6.4/io_uring-2023-04-21' of git://git.kernel.dk/linux: (71 commits) Revert "io_uring/rsrc: disallow multi-source reg buffers" io_uring: add support for multishot timeouts io_uring/rsrc: disassociate nodes and rsrc_data io_uring/rsrc: devirtualise rsrc put callbacks io_uring/rsrc: pass node to io_rsrc_put_work() io_uring/rsrc: inline io_rsrc_put_work() io_uring/rsrc: add empty flag in rsrc_node io_uring/rsrc: merge nodes and io_rsrc_put io_uring/rsrc: infer node from ctx on io_queue_rsrc_removal io_uring/rsrc: remove unused io_rsrc_node::llist io_uring/rsrc: refactor io_queue_rsrc_removal io_uring/rsrc: simplify single file node switching io_uring/rsrc: clean up __io_sqe_buffers_update() io_uring/rsrc: inline switch_start fast path io_uring/rsrc: remove rsrc_data refs io_uring/rsrc: fix DEFER_TASKRUN rsrc quiesce io_uring/rsrc: use wq for quiescing io_uring/rsrc: refactor io_rsrc_ref_quiesce io_uring/rsrc: remove io_rsrc_node::done io_uring/rsrc: use nospec'ed indexes ... |
||
|
|
11b32219cb |
legacy direct-io cleanup
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZEYDPAAKCRBZ7Krx/gZQ 65qvAQC60ak71gAqOSktR93GXYrQ5Q8hbpug7t3lzDQE9huGqgEAl0Zvr8d5ir3j y0X5U5Yl6bcUSQDd4VY76C+53yIKZQs= =rKXE -----END PGP SIGNATURE----- Merge tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull legacy dio cleanup from Al Viro. * tag 'pull-old-dio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: __blockdev_direct_IO(): get rid of submit_io callback |
||
|
|
f6c73a1113 |
fs.h: Add TRACE_IOCB_STRINGS for use in trace points
Add TRACE_IOCB_STRINGS macro which can be used in the trace point patch to print different flag values with meaningful string output. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: line up strings all prettylike] Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
||
|
|
d8aeb44a9a |
fs: add FMODE_DIO_PARALLEL_WRITE flag
Some filesystems support multiple threads writing to the same file with O_DIRECT without requiring exclusive access to it. io_uring can use this hint to avoid serializing dio writes to this inode, instead allowing them to run in parallel. XFS and ext4 both fall into this category, so set the flag for both of them. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
4f704d9a83
|
nfs: use vfs setgid helper
We've aligned setgid behavior over multiple kernel releases. The details can be found in the following two merge messages: |
||
|
|
0aaf08de84 |
__blockdev_direct_IO(): get rid of submit_io callback
always NULL... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
3822a7c409 |
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
|
||
|
|
307e14c039 |
46 fs/cifs (smb3 client) changesets, 37 in fs/cifs and 9 for related helper functions and cleanup outside from Dave Howells and Willy
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmP2kaAACgkQiiy9cAdy
T1Eergv9FHVs7hS0anJF0xgRghR4+g0m5UUo08iJazgJdDgcS5JY+ZasIpYpEsG3
QmsIT33XVYZypXoOzjMSsPlwo6esTCJQScVLz85e4ebedCbCBDks+wVQcbfTzD5/
KrwmUoTBLU0L/ppFhqRk9k53nrSf1SXCWPthjdfWa3mTHdIVM4kQJruTWwUDiJXp
mdYwTx6FnTNer3QWetNzYOwdUgLu3rk0zLcBwQNCo6g5LOpA44iFfEAO4zeiOuZT
LMDPbDj0nWQyWPLLdcbtsn2laYyEBDBLZevLirSaqPQ/KCtGcw0mBt6dCAzg8/CM
ONqHHxdEpvPON8Sxujcn4CxpXhl0nCLwwtKtWU4rt7IevI9U+PynNl57TtJJ16/s
b3XD2QVbFjlcdAMTmArvqnogdzoC3mZu1R1IRs+jukhLAOqZiLN6o/E2HAllt47i
krzXeXIzQr10w9fnJ7LtIc/7IUFgtUfrOkg4TKyNcnRVHQaSSxv+JLRgqMPOr/M0
I7zt0G0j
=4hIT
-----END PGP SIGNATURE-----
Merge tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs client updates from Steve French:
"The largest subset of this is from David Howells et al: making the
cifs/smb3 driver pass iov_iters down to the lowest layers, directly to
the network transport rather than passing lists of pages around,
helping multiple areas:
- Pin user pages, thereby fixing the race between concurrent DIO read
and fork, where the pages containing the DIO read buffer may end up
belonging to the child process and not the parent - with the result
that the parent might not see the retrieved data.
- cifs shouldn't take refs on pages extracted from non-user-backed
iterators (eg. KVEC). With these changes, cifs will apply the
appropriate cleanup.
- Making it easier to transition to using folios in cifs rather than
pages by dealing with them through BVEC and XARRAY iterators.
- Allowing cifs to use the new splice function
The remainder are:
- fixes for stable, including various fixes for uninitialized memory,
wrong length field causing mount issue to very old servers,
important directory lease fixes and reconnect fixes
- cleanups (unused code removal, change one element array usage, and
a change form strtobool to kstrtobool, and Kconfig cleanups)
- SMBDIRECT (RDMA) fixes including iov_iter integration and UAF fixes
- reconnect fixes
- multichannel fixes, including improving channel allocation (to
least used channel)
- remove the last use of lock_page_killable by moving to
folio_lock_killable"
* tag '6.3-rc-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: (46 commits)
update internal module version number for cifs.ko
cifs: update ip_addr for ses only for primary chan setup
cifs: use tcon allocation functions even for dummy tcon
cifs: use the least loaded channel for sending requests
cifs: DIO to/from KVEC-type iterators should now work
cifs: Remove unused code
cifs: Build the RDMA SGE list directly from an iterator
cifs: Change the I/O paths to use an iterator rather than a page list
cifs: Add a function to read into an iter from a socket
cifs: Add some helper functions
cifs: Add a function to Hash the contents of an iterator
cifs: Add a function to build an RDMA SGE list from an iterator
netfs: Add a function to extract an iterator into a scatterlist
netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
splice: Export filemap/direct_splice_read()
iov_iter: Add a function to extract a page list from an iterator
iov_iter: Define flags to qualify page extraction.
splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
splice: Add a func to do a splice from a buffered file without ITER_PIPE
...
|
||
|
|
33b3b04154 |
splice: Add a func to do a splice from an O_DIRECT file without ITER_PIPE
Implement a function, direct_file_splice(), that deals with this by using an ITER_BVEC iterator instead of an ITER_PIPE iterator as the former won't free its buffers when reverted. The function bulk allocates all the buffers it thinks it is going to use in advance, does the read synchronously and only then trims the buffer down. The pages we did use get pushed into the pipe. This fixes a problem with the upcoming iov_iter_extract_pages() function, whereby pages extracted from a non-user-backed iterator such as ITER_PIPE aren't pinned. __iomap_dio_rw(), however, calls iov_iter_revert() to shorten the iterator to just the bufferage it is going to use - which has the side-effect of freeing the excess pipe buffers, even though they're attached to a bio and may get written to by DMA (thanks to Hillf Danton for spotting this[1]). This then causes memory corruption that is particularly noticeable when the syzbot test[2] is run. The test boils down to: out = creat(argv[1], 0666); ftruncate(out, 0x800); lseek(out, 0x200, SEEK_SET); in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW); sendfile(out, in, NULL, 0x1dd00); run repeatedly in parallel. What I think is happening is that ftruncate() occasionally shortens the DIO read that's about to be made by sendfile's splice core by reducing i_size. This should be more efficient for DIO read by virtue of doing a bulk page allocation, but slightly less efficient by ignoring any partial page in the pipe. Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1] Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2] Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
07073eb01c |
splice: Add a func to do a splice from a buffered file without ITER_PIPE
Provide a function to do splice read from a buffered file, pulling the folios out of the pagecache directly by calling filemap_get_pages() to do any required reading and then pasting the returned folios into the pipe. A helper function is provided to do the actual folio pasting and will handle multipage folios by splicing as many of the relevant subpages as will fit into the pipe. The code is loosely based on filemap_read() and might belong in mm/filemap.c with that as it needs to use filemap_get_pages(). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: David Hildenbrand <david@redhat.com> cc: John Hubbard <jhubbard@nvidia.com> cc: linux-mm@kvack.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> |
||
|
|
05e6295f7b |
fs.idmapped.v6.3
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY+5NlQAKCRCRxhvAZXjc
orOaAP9i2h3OJy95nO2Fpde0Bt2UT+oulKCCcGlvXJ8/+TQpyQD/ZQq47gFQ0EAz
Br5NxeyGeecAb0lHpFz+CpLGsxMrMwQ=
=+BG5
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in
|
||
|
|
4d7ca40901
|
fs: port vfs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
c14329d39f
|
fs: port fs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
e67fe63341
|
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap.
Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in
|
||
|
|
0dbe12f2e4
|
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
9452e93e6d
|
fs: port privilege checking helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
01beba7957
|
fs: port inode_owner_or_capable() to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
f2d40141d5
|
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
4609e1f18e
|
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
8782a9aea3
|
fs: port ->fileattr_set() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
13e83a4923
|
fs: port ->set_acl() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
7743532277
|
fs: port ->get_acl() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
011e2b717b
|
fs: port ->tmpfile() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
e18275ae55
|
fs: port ->rename() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
5ebb29bee8
|
fs: port ->mknod() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
c54bd91e9e
|
fs: port ->mkdir() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
7a77db9551
|
fs: port ->symlink() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
6c960e68aa
|
fs: port ->create() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
b74d24f7a7
|
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
c1632a0f11
|
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
17e810229c |
mm: support POSIX_FADV_NOREUSE
This patch adds POSIX_FADV_NOREUSE to vma_has_recency() so that the LRU
algorithm can ignore access to mapped files marked by this flag.
The advantages of POSIX_FADV_NOREUSE are:
1. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not alter the
default readahead behavior.
2. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not split VMAs and
therefore does not take mmap_lock.
3. Unlike MADV_COLD, setting it has a negligible cost, regardless of
how many pages it affects.
Its limitations are:
1. Like POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL, it currently does
not support range. IOW, its scope is the entire file.
2. It currently does not ignore access through file descriptors.
Specifically, for the active/inactive LRU, given a file page shared
by two users and one of them having set POSIX_FADV_NOREUSE on the
file, this page will be activated upon the second user accessing
it. This corner case can be covered by checking POSIX_FADV_NOREUSE
before calling folio_mark_accessed() on the read path. But it is
considered not worth the effort.
There have been a few attempts to support POSIX_FADV_NOREUSE, e.g., [1].
This time the goal is to fill a niche: a few desktop applications, e.g.,
large file transferring and video encoding/decoding, want fast file
streaming with mmap() rather than direct IO. Among those applications, an
SVT-AV1 regression was reported when running with MGLRU [2]. The
following test can reproduce that regression.
kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
kb=$((kb - 8*1024*1024))
modprobe brd rd_nr=1 rd_size=$kb
dd if=/dev/zero of=/dev/ram0 bs=1M
mkfs.ext4 /dev/ram0
mount /dev/ram0 /mnt/
swapoff -a
fallocate -l 8G /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile
wget http://ultravideo.cs.tut.fi/video/Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z
7z e -o/mnt/ Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z
SvtAv1EncApp --preset 12 -w 3840 -h 2160 \
-i /mnt/Bosphorus_3840x2160.y4m
For MGLRU, the following change showed a [9-11]% increase in FPS,
which makes it on par with the active/inactive LRU.
patch Source/App/EncApp/EbAppMain.c <<EOF
31a32
> #include <fcntl.h>
35d35
< #include <fcntl.h> /* _O_BINARY */
117a118
> posix_fadvise(config->mmap.fd, 0, 0, POSIX_FADV_NOREUSE);
EOF
[1] https://lore.kernel.org/r/1308923350-7932-1-git-send-email-andrea@betterlinux.com/
[2] https://openbenchmarking.org/result/2209259-PTS-MGLRU8GB57
Link: https://lkml.kernel.org/r/20221230215252.2628425-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Righi <andrea.righi@canonical.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
abf08576af
|
fs: port vfs_*() helpers to struct mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
||
|
|
5970e15dbc |
filelock: move file locking definitions to separate header file
The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Steve French <stfrench@microsoft.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> |
||
|
|
8e1858710d |
ceph: avoid use-after-free in ceph_fl_release_lock()
When ceph releasing the file_lock it will try to get the inode pointer from the fl->fl_file, which the memory could already be released by another thread in filp_close(). Because in VFS layer the fl->fl_file doesn't increase the file's reference counter. Will switch to use ceph dedicate lock info to track the inode. And in ceph_fl_release_lock() we should skip all the operations if the fl->fl_u.ceph.inode is not set, which should come from the request file_lock. And we will set fl->fl_u.ceph.inode when inserting it to the inode lock list, which is when copying the lock. Link: https://tracker.ceph.com/issues/57986 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> |
||
|
|
9b93f5069f |
fs.idmapped.mnt_idmap.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5by6AAKCRCRxhvAZXjc
onblAPsFzodV8/9UoCIkKxwn0aiclbiAITTWI9ZLulmKhm0I6wD/RUOLKjt12uZJ
m81UTfkWHopWKtQ+X3saZEcyYTNLugE=
=AtGb
-----END PGP SIGNATURE-----
Merge tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping updates from Christian Brauner:
"Last cycle we've already made the interaction with idmapped mounts
more robust and type safe by introducing the vfs{g,u}id_t type. This
cycle we concluded the conversion and removed the legacy helpers.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem - with
namespaces that are relevent on the mount level. Especially for
filesystem developers without detailed knowledge in this area this can
be a potential source for bugs.
Instead of passing the plain namespace we introduce a dedicated type
struct mnt_idmap and replace the pointer with a pointer to a struct
mnt_idmap. There are no semantic or size changes for the mount struct
caused by this.
We then start converting all places aware of idmapped mounts to rely
on struct mnt_idmap. Once the conversion is done all helpers down to
the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take
a struct mnt_idmap argument instead of two namespace arguments. This
way it becomes impossible to conflate the two removing and thus
eliminating the possibility of any bugs. Fwiw, I fixed some issues in
that area a while ago in ntfs3 and ksmbd in the past. Afterwards only
low-level code can ultimately use the associated namespace for any
permission checks. Even most of the vfs can be completely obivious
about this ultimately and filesystems will never interact with it in
any form in the future.
A struct mnt_idmap currently encompasses a simple refcount and pointer
to the relevant namespace the mount is idmapped to. If a mount isn't
idmapped then it will point to a static nop_mnt_idmap and if it
doesn't that it is idmapped. As usual there are no allocations or
anything happening for non-idmapped mounts. Everthing is carefully
written to be a nop for non-idmapped mounts as has always been the
case.
If an idmapped mount is created a struct mnt_idmap is allocated and a
reference taken on the relevant namespace. Each mount that gets
idmapped or inherits the idmap simply bumps the reference count on
struct mnt_idmap. Just a reminder that we only allow a mount to change
it's idmapping a single time and only if it hasn't already been
attached to the filesystems and has no active writers.
The actual changes are fairly straightforward but this will have huge
benefits for maintenance and security in the long run even if it
causes some churn.
Note that this also makes it possible to extend struct mount_idmap in
the future. For example, it would be possible to place the namespace
pointer in an anonymous union together with an idmapping struct. This
would allow us to expose an api to userspace that would let it specify
idmappings directly instead of having to go through the detour of
setting up namespaces at all"
* tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
acl: conver higher-level helpers to rely on mnt_idmap
fs: introduce dedicated idmap type for mounts
|
||
|
|
e1212e9b6f |
fs.vfsuid.conversion.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bspgAKCRCRxhvAZXjc
opEWAQDpF5rnZn1vv4/uOTij9ztcA4yLxu/Q19CdqBaoHlWZ9AD/d3eecee3bh5h
iPHtlUK5/VspfD9LPpdc5ZbPCdZ2pA4=
=t6NN
-----END PGP SIGNATURE-----
Merge tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfsuid updates from Christian Brauner:
"Last cycle we introduced the vfs{g,u}id_t types and associated helpers
to gain type safety when dealing with idmapped mounts. That initial
work already converted a lot of places over but there were still some
left,
This converts all remaining places that still make use of non-type
safe idmapping helpers to rely on the new type safe vfs{g,u}id based
helpers.
Afterwards it removes all the old non-type safe helpers"
* tag 'fs.vfsuid.conversion.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: remove unused idmapping helpers
ovl: port to vfs{g,u}id_t and associated helpers
fuse: port to vfs{g,u}id_t and associated helpers
ima: use type safe idmapping helpers
apparmor: use type safe idmapping helpers
caps: use type safe idmapping helpers
fs: use type safe idmapping helpers
mnt_idmapping: add missing helpers
|
||
|
|
cf619f8919 |
fs.ovl.setgid.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bt7AAKCRCRxhvAZXjc
ovAOAP9qcrUqs2MoyBDe6qUXThYY9w2rgX/ZI4ZZmbtsXEDGtQEA/LPddq8lD8o9
m17zpvMGbXXRwz4/zVGuyWsHgg0HsQ0=
=ioRq
-----END PGP SIGNATURE-----
Merge tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull setgid inheritance updates from Christian Brauner:
"This contains the work to make setgid inheritance consistent between
modifying a file and when changing ownership or mode as this has been
a repeated source of very subtle bugs. The gist is that we perform the
same permission checks in the write path as we do in the ownership and
mode changing paths after this series where we're currently doing
different things.
We've already made setgid inheritance a lot more consistent and
reliable in the last releases by moving setgid stripping from the
individual filesystems up into the vfs. This aims to make the logic
even more consistent and easier to understand and also to fix
long-standing overlayfs setgid inheritance bugs. Miklos was nice
enough to just let me carry the trivial overlayfs patches from Amir
too.
Below is a more detailed explanation how the current difference in
setgid handling lead to very subtle bugs exemplified via overlayfs
which is a victim of the current rules. I hope this explains why I
think taking the regression risk here is worth it.
A long while ago I found a few setgid inheritance bugs in overlayfs in
the write path in certain conditions. Amir recently picked this back
up in [1] and I jumped on board to fix this more generally.
On the surface all that overlayfs would need to fix setgid inheritance
would be to call file_remove_privs() or file_modified() but actually
that isn't enough because the setgid inheritance api is wildly
inconsistent in that area.
Before this pr setgid stripping in file_remove_privs()'s old
should_remove_suid() helper was inconsistent with other parts of the
vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is
S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups
and the caller isn't privileged over the inode although we require
this already in setattr_prepare() and setattr_copy() and so all
filesystem implement this requirement implicitly because they have to
use setattr_{prepare,copy}() anyway.
But the inconsistency shows up in setgid stripping bugs for overlayfs
in xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping
works correctly when performing various write-like operations as an
unprivileged user (fallocate, reflink, write, etc.):
echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k
The test basically creates a file with 6666 permissions. While the
file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP
set.
On a regular filesystem like xfs what will happen is:
sys_fallocate()
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> setattr_copy()
In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.
But we also see that ATTR_KILL_SGID won't be set because while the
file is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.
So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE.
Now notify_change() sees that ATTR_KILL_SUID is set and does:
ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);
which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.
Now we call into the filesystem's ->setattr() inode operation which
will end up calling setattr_copy(). Since ATTR_MODE is set we will
hit:
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
if (!vfsgid_in_group_p(vfsgid) &&
!capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
mode &= ~S_ISGID;
inode->i_mode = mode;
}
and since the caller in the test is neither capable nor in the group
of the inode the S_ISGID bit is stripped.
But assume the file isn't suid then ATTR_KILL_SUID won't be raised
which has the consequence that neither the setgid nor the suid bits
are stripped even though it should be stripped because the inode isn't
in the caller's groups and the caller isn't privileged over the inode.
If overlayfs is in the mix things become a bit more complicated and
the bug shows up more clearly.
When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to
file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be
raised but because the check in notify_change() is questioning the
ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped
the S_ISGID bit isn't removed even though it should be stripped:
sys_fallocate()
-> vfs_fallocate()
-> ovl_fallocate()
-> file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> ovl_setattr()
/* TAKE ON MOUNTER'S CREDS */
-> ovl_do_notify_change()
-> notify_change()
/* GIVE UP MOUNTER'S CREDS */
/* TAKE ON MOUNTER'S CREDS */
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = attr_force | kill;
-> notify_change()
The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have
notify_change() not pointlessly requiring S_IXGRP again. It doesn't
make any sense in the first place because the caller must calculate
the flags via should_remove_suid() anyway which would raise
ATTR_KILL_SGID
Note that some xfstests will now fail as these patches will cause the
setgid bit to be lost in certain conditions for unprivileged users
modifying a setgid file when they would've been kept otherwise. I
think this risk is worth taking and I explained and mentioned this
multiple times on the list [2].
Enforcing the rules consistently across write operations and
chmod/chown will lead to losing the setgid bit in cases were it
might've been retained before.
While I've mentioned this a few times but it's worth repeating just to
make sure that this is understood. For the sake of maintainability,
consistency, and security this is a risk worth taking.
If we really see regressions for workloads the fix is to have special
setgid handling in the write path again with different semantics from
chmod/chown and possibly additional duct tape for overlayfs. I'll
update the relevant xfstests with if you should decide to merge this
second setgid cleanup.
Before that people should be aware that there might be failures for
fstests where unprivileged users modify a setgid file"
Link: https://lore.kernel.org/linux-fsdevel/20221003123040.900827-1-amir73il@gmail.com [1]
Link: https://lore.kernel.org/linux-fsdevel/20221122142010.zchf2jz2oymx55qi@wittgenstein [2]
* tag 'fs.ovl.setgid.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
fs: use consistent setgid checks in is_sxid()
ovl: remove privs in ovl_fallocate()
ovl: remove privs in ovl_copyfile()
attr: use consistent sgid stripping checks
attr: add setattr_should_drop_sgid()
fs: move should_remove_suid()
attr: add in_group_or_capable()
|
||
|
|
6a518afcc2 |
fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
=27Wm
-----END PGP SIGNATURE-----
Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull VFS acl updates from Christian Brauner:
"This contains the work that builds a dedicated vfs posix acl api.
The origins of this work trace back to v5.19 but it took quite a while
to understand the various filesystem specific implementations in
sufficient detail and also come up with an acceptable solution.
As we discussed and seen multiple times the current state of how posix
acls are handled isn't nice and comes with a lot of problems: The
current way of handling posix acls via the generic xattr api is error
prone, hard to maintain, and type unsafe for the vfs until we call
into the filesystem's dedicated get and set inode operations.
It is already the case that posix acls are special-cased to death all
the way through the vfs. There are an uncounted number of hacks that
operate on the uapi posix acl struct instead of the dedicated vfs
struct posix_acl. And the vfs must be involved in order to interpret
and fixup posix acls before storing them to the backing store, caching
them, reporting them to userspace, or for permission checking.
Currently a range of hacks and duct tape exist to make this work. As
with most things this is really no ones fault it's just something that
happened over time. But the code is hard to understand and difficult
to maintain and one is constantly at risk of introducing bugs and
regressions when having to touch it.
Instead of continuing to hack posix acls through the xattr handlers
this series builds a dedicated posix acl api solely around the get and
set inode operations.
Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
helpers must be used in order to interact with posix acls. They
operate directly on the vfs internal struct posix_acl instead of
abusing the uapi posix acl struct as we currently do. In the end this
removes all of the hackiness, makes the codepaths easier to maintain,
and gets us type safety.
This series passes the LTP and xfstests suites without any
regressions. For xfstests the following combinations were tested:
- xfs
- ext4
- btrfs
- overlayfs
- overlayfs on top of idmapped mounts
- orangefs
- (limited) cifs
There's more simplifications for posix acls that we can make in the
future if the basic api has made it.
A few implementation details:
- The series makes sure to retain exactly the same security and
integrity module permission checks. Especially for the integrity
modules this api is a win because right now they convert the uapi
posix acl struct passed to them via a void pointer into the vfs
struct posix_acl format to perform permission checking on the mode.
There's a new dedicated security hook for setting posix acls which
passes the vfs struct posix_acl not a void pointer. Basing checking
on the posix acl stored in the uapi format is really unreliable.
The vfs currently hacks around directly in the uapi struct storing
values that frankly the security and integrity modules can't
correctly interpret as evidenced by bugs we reported and fixed in
this area. It's not necessarily even their fault it's just that the
format we provide to them is sub optimal.
- Some filesystems like 9p and cifs need access to the dentry in
order to get and set posix acls which is why they either only
partially or not even at all implement get and set inode
operations. For example, cifs allows setxattr() and getxattr()
operations but doesn't allow permission checking based on posix
acls because it can't implement a get acl inode operation.
Thus, this patch series updates the set acl inode operation to take
a dentry instead of an inode argument. However, for the get acl
inode operation we can't do this as the old get acl method is
called in e.g., generic_permission() and inode_permission(). These
helpers in turn are called in various filesystem's permission inode
operation. So passing a dentry argument to the old get acl inode
operation would amount to passing a dentry to the permission inode
operation which we shouldn't and probably can't do.
So instead of extending the existing inode operation Christoph
suggested to add a new one. He also requested to ensure that the
get and set acl inode operation taking a dentry are consistently
named. So for this version the old get acl operation is renamed to
->get_inode_acl() and a new ->get_acl() inode operation taking a
dentry is added. With this we can give both 9p and cifs get and set
acl inode operations and in turn remove their complex custom posix
xattr handlers.
In the future I hope to get rid of the inode method duplication but
it isn't like we have never had this situation. Readdir is just one
example. And frankly, the overall gain in type safety and the more
pleasant api wise are simply too big of a benefit to not accept
this duplication for a while.
- We've done a full audit of every codepaths using variant of the
current generic xattr api to get and set posix acls and
surprisingly it isn't that many places. There's of course always a
chance that we might have missed some and if so I'm sure we'll find
them soon enough.
The crucial codepaths to be converted are obviously stacking
filesystems such as ecryptfs and overlayfs.
For a list of all callers currently using generic xattr api helpers
see [2] including comments whether they support posix acls or not.
- The old vfs generic posix acl infrastructure doesn't obey the
create and replace semantics promised on the setxattr(2) manpage.
This patch series doesn't address this. It really is something we
should revisit later though.
The patches are roughly organized as follows:
(1) Change existing set acl inode operation to take a dentry
argument (Intended to be a non-functional change)
(2) Rename existing get acl method (Intended to be a non-functional
change)
(3) Implement get and set acl inode operations for filesystems that
couldn't implement one before because of the missing dentry.
That's mostly 9p and cifs (Intended to be a non-functional
change)
(4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
and vfs_set_acl() including security and integrity hooks
(Intended to be a non-functional change)
(5) Implement get and set acl inode operations for stacking
filesystems (Intended to be a non-functional change)
(6) Switch posix acl handling in stacking filesystems to new posix
acl api now that all filesystems it can stack upon support it.
(7) Switch vfs to new posix acl api (semantical change)
(8) Remove all now unused helpers
(9) Additional regression fixes reported after we merged this into
linux-next
Thanks to Seth for a lot of good discussion around this and
encouragement and input from Christoph"
* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
posix_acl: Fix the type of sentinel in get_acl
orangefs: fix mode handling
ovl: call posix_acl_release() after error checking
evm: remove dead code in evm_inode_set_acl()
cifs: check whether acl is valid early
acl: make vfs_posix_acl_to_xattr() static
acl: remove a slew of now unused helpers
9p: use stub posix acl handlers
cifs: use stub posix acl handlers
ovl: use stub posix acl handlers
ecryptfs: use stub posix acl handlers
evm: remove evm_xattr_acl_change()
xattr: use posix acl api
ovl: use posix acl api
ovl: implement set acl method
ovl: implement get acl method
ecryptfs: implement set acl method
ecryptfs: implement get acl method
ksmbd: use vfs_remove_acl()
acl: add vfs_remove_acl()
...
|
||
|
|
bd90741318 |
misc pile
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzrwAKCRBZ7Krx/gZQ 6+WrAP9QltAQopxexxpRxTdA3yq7Fy9ZakkS7b1udhRHgRA8GgEA7ZcrqX8IsyDW hLW4cQPVUkJD7MCR8P7lw5sLaararAg= =TchO -----END PGP SIGNATURE----- Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "misc pile" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: sysv: Fix sysv_nblocks() returns wrong value get rid of INT_LIMIT, use type_max() instead btrfs: replace INT_LIMIT(loff_t) with OFFSET_MAX fs: simplify vfs_get_super fs: drop useless condition from inode_needs_update_time |
||
|
|
8702f2c611 |
Non-MM patches for 6.2-rc1.
- A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line. - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files. - A series from Yang Yingliang to address rapido memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t(). - And a whole shower of singleton patches all over the place. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5efRgAKCRDdBJ7gKXxA jgvdAP0al6oFDtaSsshIdNhrzcMwfjt6PfVxxHdLmNhF1hX2dwD/SVluS1bPSP7y 0sZp7Ustu3YTb8aFkMl96Y9m9mY1Nwg= =ga5B -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - A ptrace API cleanup series from Sergey Shtylyov - Fixes and cleanups for kexec from ye xingchen - nilfs2 updates from Ryusuke Konishi - squashfs feature work from Xiaoming Ni: permit configuration of the filesystem's compression concurrency from the mount command line - A series from Akinobu Mita which addresses bound checking errors when writing to debugfs files - A series from Yang Yingliang to address rapidio memory leaks - A series from Zheng Yejian to address possible overflow errors in encode_comp_t() - And a whole shower of singleton patches all over the place * tag 'mm-nonmm-stable-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (79 commits) ipc: fix memory leak in init_mqueue_fs() hfsplus: fix bug causing custom uid and gid being unable to be assigned with mount rapidio: devices: fix missing put_device in mport_cdev_open kcov: fix spelling typos in comments hfs: Fix OOB Write in hfs_asc2mac hfs: fix OOB Read in __hfs_brec_find relay: fix type mismatch when allocating memory in relay_create_buf() ocfs2: always read both high and low parts of dinode link count io-mapping: move some code within the include guarded section kernel: kcsan: kcsan_test: build without structleak plugin mailmap: update email for Iskren Chernev eventfd: change int to __u64 in eventfd_signal() ifndef CONFIG_EVENTFD rapidio: fix possible UAF when kfifo_alloc() fails relay: use strscpy() is more robust and safer cpumask: limit visibility of FORCE_NR_CPUS acct: fix potential integer overflow in encode_comp_t() acct: fix accuracy loss for input value of encode_comp_t() linux/init.h: include <linux/build_bug.h> and <linux/stringify.h> rapidio: rio: fix possible name leak in rio_register_mport() rapidio: fix possible name leaks when rio_add_device() fails ... |
||
|
|
73fa58dca8 |
File locking changes for v6.2.
-----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmOPOjwTHGpsYXl0b25A a2VybmVsLm9yZwAKCRAADmhBGVaCFZ/jEADDZ1RlXCwuozDzAXFzzsR+kmKJJfXG ff3ejXHhyJdYH8kh1IldTCR4RGblTH7dM/gO/ApJlSLbEglQm9AIjZ2lpVstqtzQ lnZir+bA6uzOyYMRVXJ+0oDZuv3Gca3W8IhFHCqD7K9oQQbn+c/ZmEWrvNJJXN1j Ogi1SXHUNfrFgSbgBKjc2VqewuiTc2I8tZAQyezYoGXKn6LtAgMJhQIS4eWjqjju 38aageni9doKPnAmMOq+vBcw2bWV5mYijz/pObfsaDlAgFdr9rKjNP5+F4fBply1 SDW2T1ge8jWYegq39EcDKxd/raSOET/p9vQu6rHniXKfvMQ6Ywbr7qji1a7yTZ+i MkuOToNZy/+TTEvFQm48Fa25tcKjjl/uuk5Ugojf/hSWOsNkW1Cy4S33eUzDZiSO wox5EFVhFpf8Q8L3dUQY0sZazCyoEftw+bq2cKGHJYfUhBD7u6yLG7EKqYiqpepX SSPxuh3GC65xl33hYJL2V+5cgXAV23kSGCdNqDUvYZgJfjhDjQnyoSTcuBjh67kv chmSoeUaIkS4yFqsH9kRINMSef2M5LXYbfxTnftokX0cvV6RqQndZl43X5LEBgQL GRIxyxPkkKaqFjkqyFzBD0dkVGyjyUmkioy/1xON3pLWz3Sk77U38pEQ7NeUl2Lc bK5uysBuvDnCpg== =XMv7 -----END PGP SIGNATURE----- Merge tag 'locks-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: "The main change here is to add the new locks_inode_context helper, and convert all of the places that dereference inode->i_flctx directly to use that instead. There is a new helper to indicate whether any locks are held on an inode. This is mostly for Ceph but may be usable elsewhere too. Andi Kleen requested that we print the PID when the LOCK_MAND warning fires, to help track down applications trying to use it. Finally, we added some new warnings to some of the file locking functions that fire when the ->fl_file and filp arguments differ. This helped us find some long-standing bugs in lockd. Patches for those are in Chuck Lever's tree and should be in his v6.2 PR. After that patch, people using NFSv2/v3 locking may see some warnings fire until those go in. Happy Holidays!" * tag 'locks-v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: Add process name and pid to locks warning nfsd: use locks_inode_context helper nfs: use locks_inode_context helper lockd: use locks_inode_context helper ksmbd: use locks_inode_context helper cifs: use locks_inode_context helper ceph: use locks_inode_context helper filelock: add a new locks_inode_context accessor function filelock: new helper: vfs_inode_has_locks filelock: WARN_ON_ONCE when ->fl_file and filp don't match |
||
|
|
2e41f274f9 |
libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value
Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit |
||
|
|
401a8b8fd5 |
filelock: add a new locks_inode_context accessor function
There are a number of places in the kernel that are accessing the inode->i_flctx field without smp_load_acquire. This is required to ensure that the caller doesn't see a partially-initialized structure. Add a new accessor function for it to make this clear and convert all of the relevant accesses in locks.c to use it. Also, convert locks_free_lock_context to use the helper as well instead of just doing a "bare" assignment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> |
||
|
|
ab1ddef98a |
filelock: new helper: vfs_inode_has_locks
Ceph has a need to know whether a particular inode has any locks set on it. It's currently tracking that by a num_locks field in its filp->private_data, but that's problematic as it tries to decrement this field when releasing locks and that can race with the file being torn down. Add a new vfs_inode_has_locks helper that just returns whether any locks are currently held on the inode. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> |
||
|
|
ea258f159d |
get rid of INT_LIMIT, use type_max() instead
INT_LIMIT() tries to do what type_max() does, except that type_max() doesn't rely upon undefined behaviour[*], might as well use type_max() instead. [*] if T is an N-bit signed integer type, the maximal value in T is pow(2, N - 1) - 1, all right, but naive expression for that value ends up with a couple of wraparounds and as usual for wraparounds in signed types, that's an undefined behaviour. type_max() takes care to avoid those... Caught-by: UBSAN Suggested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
10bc8e4af6 |
vfs: fix copy_file_range() averts filesystem freeze protection
Commit |
||
|
|
8d84e39d76
|
fs: use consistent setgid checks in is_sxid()
Now that we made the VFS setgid checking consistent an inode can't be marked security irrelevant even if the setgid bit is still set. Make this function consistent with all other helpers. Note that enforcing consistent setgid stripping checks for file modification and mode- and ownership changes will cause the setgid bit to be lost in more cases than useed to be the case. If an unprivileged user wrote to a non-executable setgid file that they don't have privilege over the setgid bit will be dropped. This will lead to temporary failures in some xfstests until they have been updated. Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
||
|
|
256c8aed2b
|
fs: introduce dedicated idmap type for mounts
Last cycle we've already made the interaction with idmapped mounts more
robust and type safe by introducing the vfs{g,u}id_t type. This cycle we
concluded the conversion and removed the legacy helpers.
Currently we still pass around the plain namespace that was attached to
a mount. This is in general pretty convenient but it makes it easy to
conflate filesystem and mount namespaces and what different roles they
have to play. Especially for filesystem developers without much
experience in this area this is an easy source for bugs.
Instead of passing the plain namespace we introduce a dedicated type
struct mnt_idmap and replace the pointer with a pointer to a struct
mnt_idmap. There are no semantic or size changes for the mount struct
caused by this.
We then start converting all places aware of idmapped mounts to rely on
struct mnt_idmap. Once the conversion is done all helpers down to the
really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a
struct mnt_idmap argument instead of two namespace arguments. This way
it becomes impossible to conflate the two, removing and thus eliminating
the possibility of any bugs. Fwiw, I fixed some issues in that area a
while ago in ntfs3 and ksmbd in the past. Afterwards, only low-level
code can ultimately use the associated namespace for any permission
checks. Even most of the vfs can be ultimately completely oblivious
about this and filesystems will never interact with it directly in any
form in the future.
A struct mnt_idmap currently encompasses a simple refcount and a pointer
to the relevant namespace the mount is idmapped to. If a mount isn't
idmapped then it will point to a static nop_mnt_idmap. If it is an
idmapped mount it will point to a new struct mnt_idmap. As usual there
are no allocations or anything happening for non-idmapped mounts.
Everthing is carefully written to be a nop for non-idmapped mounts as
has always been the case.
If an idmapped mount or mount tree is created a new struct mnt_idmap is
allocated and a reference taken on the relevant namespace. For each
mount in a mount tree that gets idmapped or a mount that inherits the
idmap when it is cloned the reference count on the associated struct
mnt_idmap is bumped. Just a reminder that we only allow a mount to
change it's idmapping a single time and only if it hasn't already been
attached to the filesystems and has no active writers.
The actual changes are fairly straightforward. This will have huge
benefits for maintenance and security in the long run even if it causes
some churn. I'm aware that there's some cost for all of you. And I'll
commit to doing this work and make this as painless as I can.
Note that this also makes it possible to extend struct mount_idmap in
the future. For example, it would be possible to place the namespace
pointer in an anonymous union together with an idmapping struct. This
would allow us to expose an api to userspace that would let it specify
idmappings directly instead of having to go through the detour of
setting up namespaces at all.
This just adds the infrastructure and doesn't do any conversions.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
e4236f9768
|
Merge branch 'fs.vfsuid.conversion' into for-next | ||
|
|
eb7718cdb7
|
fs: remove unused idmapping helpers
Now that all places can deal with the new type safe helpers remove all of the old helpers. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
||
|
|
03fd1402bd
|
Merge branch 'fs.acl.rework' into for-next | ||
|
|
7420332a6f
|
fs: add new get acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on the old get acl inode operation to retrieve posix acl and need to implement their own custom handlers because of that. In a previous patch we renamed the old get acl inode operation to ->get_inode_acl(). We decided to rename it and implement a new one since ->get_inode_acl() is called generic_permission() and inode_permission() both of which can be called during an filesystem's ->permission() handler. So simply passing a dentry argument to ->get_acl() would have amounted to also having to pass a dentry argument to ->permission(). We avoided that change. This adds a new ->get_acl() inode operations which takes a dentry argument which filesystems such as 9p, cifs, and overlayfs can implement to get posix acls. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
||
|
|
cac2f8b8d8
|
fs: rename current get acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
||
|
|
138060ba92
|
fs: pass dentry to set acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. Since some filesystem rely on the dentry being available to them when setting posix acls (e.g., 9p and cifs) they cannot rely on set acl inode operation. But since ->set_acl() is required in order to use the generic posix acl xattr handlers filesystems that do not implement this inode operation cannot use the handler and need to implement their own dedicated posix acl handlers. Update the ->set_acl() inode method to take a dentry argument. This allows all filesystems to rely on ->set_acl(). As far as I can tell all codepaths can be switched to rely on the dentry instead of just the inode. Note that the original motivation for passing the dentry separate from the inode instead of just the dentry in the xattr handlers was because of security modules that call security_d_instantiate(). This hook is called during d_instantiate_new(), d_add(), __d_instantiate_anon(), and d_splice_alias() to initialize the inode's security context and possibly to set security.* xattrs. Since this only affects security.* xattrs this is completely irrelevant for posix acls. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
||
|
|
ed5a7047d2
|
attr: use consistent sgid stripping checks
Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.
But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):
echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k
The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:
sys_fallocate()
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> setattr_copy()
In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.
But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.
So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:
ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);
which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.
Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:
if (ia_valid & ATTR_MODE) {
umode_t mode = attr->ia_mode;
vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
if (!vfsgid_in_group_p(vfsgid) &&
!capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
mode &= ~S_ISGID;
inode->i_mode = mode;
}
and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.
But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.
If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:
sys_fallocate()
-> vfs_fallocate()
-> ovl_fallocate()
-> file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = ATTR_FORCE | kill;
-> notify_change()
-> ovl_setattr()
// TAKE ON MOUNTER'S CREDS
-> ovl_do_notify_change()
-> notify_change()
// GIVE UP MOUNTER'S CREDS
// TAKE ON MOUNTER'S CREDS
-> vfs_fallocate()
-> xfs_file_fallocate()
-> file_modified()
-> __file_remove_privs()
-> dentry_needs_remove_privs()
-> should_remove_suid()
-> __remove_privs()
newattrs.ia_valid = attr_force | kill;
-> notify_change()
The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.
While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.
Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
f721d24e5d |
tmpfile API change
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY0DP2AAKCRBZ7Krx/gZQ 6/+qAQCEGQWpcC5MB17zylaX7gqzhgAsDrwtpevlno3aIv/1pQD/YWr/E8tf7WTW ERXRXMRx1cAzBJhUhVgIY+3ANfU2Rg4= =cko4 -----END PGP SIGNATURE----- Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs tmpfile updates from Al Viro: "Miklos' ->tmpfile() signature change; pass an unopened struct file to it, let it open the damn thing. Allows to add tmpfile support to FUSE" * tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fuse: implement ->tmpfile() vfs: open inside ->tmpfile() vfs: move open right after ->tmpfile() vfs: make vfs_tmpfile() static ovl: use vfs_tmpfile_open() helper cachefiles: use vfs_tmpfile_open() helper cachefiles: only pass inode to *mark_inode_inuse() helpers cachefiles: tmpfile error handling cleanup hugetlbfs: cleanup mknod and tmpfile vfs: add vfs_tmpfile_open() helper |
||
|
|
0a78a376ef |
for-6.1/io_uring-2022-10-03
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmM67S0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppnPEACkBzilBLKwT9MWdUAITwyrMXsAa1R9gsR9 Tb3Xs+mNO2meuycLAUh4LIbb28NNr7/S5rwWet5NRZ71hgv4Q/WA/0EemAGGXYqd +3MEBAWU3FBFkC/cJXCnT8F5yCXYRkT5n/hzCSNEpNKjQ5JnAhHDlWAjgzZRuD/A A+YJjoBVJJuI1wY4I5XCpeQXEmg/Wc1MgXfyHgWVtGKnYrrxibiCnBZnqbAMZNvD hGn1Vl02ooamGTFm/nW/OAt71DtqsjWUCVOHKmlZ+zBUjbUj6FMXmPVV7vCV9o2w PT4Dx3CTc2iXwa8KfEFNPvXBzy0Qfu8edweP/MvZHWHVZREpEAh4cG6GhwW8whD+ 5mPisqmRjZKe0BBS4k/wKN1RXEypSQoTU4EdljfbQPU/usn35lmjMmEXXgs3IhqM fcTdO5ZUOp+CGyzI0Bc7UtS8vilJbX9ynN8G80MUUAZzuQg39MH7lNQYSJSSvJfU OlvzmL3lhRLYM1s/KKiZzdDBoMvC7R4oHmzCveOjQTMIHf6WNyqKFlrWScq2wzpN oRxqt0xiVQ3PFMmFj6N08f145qtbASuF3sKv7dbU3QXTsCAos3wdTdX+PejYApEZ W3dr0TDjNBicLNVPiSj132p0ZRtdZvLGuGVkBD4GPQeH2NwswxMHQAfz8e2lqmA4 9bWG6BM7Yw== =m9kX -----END PGP SIGNATURE----- Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: - Add supported for more directly managed task_work running. This is beneficial for real world applications that end up issuing lots of system calls as part of handling work. Normal task_work will always execute as we transition in and out of the kernel, even for "unrelated" system calls. It's more efficient to defer the handling of io_uring's deferred work until the application wants it to be run, generally in batches. As part of ongoing work to write an io_uring network backend for Thrift, this has been shown to greatly improve performance. (Dylan) - Add IOPOLL support for passthrough (Kanchan) - Improvements and fixes to the send zero-copy support (Pavel) - Partial IO handling fixes (Pavel) - CQE ordering fixes around CQ ring overflow (Pavel) - Support sendto() for non-zc as well (Pavel) - Support sendmsg for zerocopy (Pavel) - Networking iov_iter fix (Stefan) - Misc fixes and cleanups (Pavel, me) * tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ... |
||
|
|
bc32a6330f |
The first two changes that involve files outside of fs/ext4:
- submit_bh() can never return an error, so change it to return void,
and remove the unused checks from its callers
- fix I_DIRTY_TIME handling so it will be set even if the inode
already has I_DIRTY_INODE
Performance:
- Always enable i_version counter (as btrfs and xfs already do).
Remove some uneeded i_version bumps to avoid unnecessary nfs cache
invalidations.
- Wake up journal waters in FIFO order, to avoid some journal users
from not getting a journal handle for an unfairly long time.
- In ext4_write_begin() allocate any necessary buffer heads before
starting the journal handle.
- Don't try to prefetch the block allocation bitmaps for a read-only
file system.
Bug Fixes:
- Fix a number of fast commit bugs, including resources leaks and out
of bound references in various error handling paths and/or if the fast
commit log is corrupted.
- Avoid stopping the online resize early when expanding a file system
which is less than 16TiB to a size greater than 16TiB.
- Fix apparent metadata corruption caused by a race with a metadata
buffer head getting migrated while it was trying to be read.
- Mark the lazy initialization thread freezable to prevent suspend
failures.
- Other miscellaneous bug fixes.
Cleanups:
- Break up the incredibly long ext4_full_super() function by
refactoring to move code into more understandable, smaller
functions.
- Remove the deprecated (and ignored) noacl and nouser_attr mount
option.
- Factor out some common code in fast commit handling.
- Other miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmM8/2gACgkQ8vlZVpUN
gaPohAf9GDMUq3QIYoWLlJ+ygJhL0xQGPfC6sypMjHaUO5GSo+1+sAMU3JBftxUS
LrgTtmzSKzwp9PyOHNs+mswUzhLZivKVCLMmOznQUZS228GSVKProhN1LPL4UP2Q
Ks8i1M5XTWS+mtJ5J5Mw6jRHxcjfT6ynyJKPnIWKTwXyeru1WSJ2PWqtWQD4EZkE
lImECy0jX/zlK02s0jDYbNIbXIvI/TTYi7wT8o1ouLCAXMDv5gJRc5TXCVtX8i59
/Pl9rGG/+IWTnYT/aQ668S2g0Cz6Wyv2EkmiPUW0Y8NoLaaouBYZoC2hDujiv+l1
ucEI14TEQ+DojJTdChrtwKqgZfqDOw==
=xoLC
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The first two changes involve files outside of fs/ext4:
- submit_bh() can never return an error, so change it to return void,
and remove the unused checks from its callers
- fix I_DIRTY_TIME handling so it will be set even if the inode
already has I_DIRTY_INODE
Performance:
- Always enable i_version counter (as btrfs and xfs already do).
Remove some uneeded i_version bumps to avoid unnecessary nfs cache
invalidations
- Wake up journal waiters in FIFO order, to avoid some journal users
from not getting a journal handle for an unfairly long time
- In ext4_write_begin() allocate any necessary buffer heads before
starting the journal handle
- Don't try to prefetch the block allocation bitmaps for a read-only
file system
Bug Fixes:
- Fix a number of fast commit bugs, including resources leaks and out
of bound references in various error handling paths and/or if the
fast commit log is corrupted
- Avoid stopping the online resize early when expanding a file system
which is less than 16TiB to a size greater than 16TiB
- Fix apparent metadata corruption caused by a race with a metadata
buffer head getting migrated while it was trying to be read
- Mark the lazy initialization thread freezable to prevent suspend
failures
- Other miscellaneous bug fixes
Cleanups:
- Break up the incredibly long ext4_full_super() function by
refactoring to move code into more understandable, smaller
functions
- Remove the deprecated (and ignored) noacl and nouser_attr mount
option
- Factor out some common code in fast commit handling
- Other miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (53 commits)
ext4: fix potential out of bound read in ext4_fc_replay_scan()
ext4: factor out ext4_fc_get_tl()
ext4: introduce EXT4_FC_TAG_BASE_LEN helper
ext4: factor out ext4_free_ext_path()
ext4: remove unnecessary drop path references in mext_check_coverage()
ext4: update 'state->fc_regions_size' after successful memory allocation
ext4: fix potential memory leak in ext4_fc_record_regions()
ext4: fix potential memory leak in ext4_fc_record_modified_inode()
ext4: remove redundant checking in ext4_ioctl_checkpoint
jbd2: add miss release buffer head in fc_do_one_pass()
ext4: move DIOREAD_NOLOCK setting to ext4_set_def_opts()
ext4: remove useless local variable 'blocksize'
ext4: unify the ext4 super block loading operation
ext4: factor out ext4_journal_data_mode_check()
ext4: factor out ext4_load_and_init_journal()
ext4: factor out ext4_group_desc_init() and ext4_group_desc_free()
ext4: factor out ext4_geometry_check()
ext4: factor out ext4_check_feature_compatibility()
ext4: factor out ext4_init_metadata_csum()
ext4: factor out ext4_encoding_init()
...
|
||
|
|
7a3353c5c4 |
struct file-related stuff
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYzxjIQAKCRBZ7Krx/gZQ 6/FPAQCNCZygQzd+54//vo4kTwv5T2Bv3hS8J51rASPJT87/BQD/TfCLS5urt/Gt 81A1dFOfnTXseofuBKyGSXwQm0dWpgA= =PLre -----END PGP SIGNATURE----- Merge tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs file updates from Al Viro: "struct file-related stuff" * tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: dma_buf_getfile(): don't bother with ->f_flags reassignments Change calling conventions for filldir_t locks: fix TOCTOU race when granting write lease |
||
|
|
cbfecb927f |
fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
Currently the I_DIRTY_TIME will never get set if the inode already has I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME. That's true, however ext4 will only update the on-disk inode in ->dirty_inode(), not on actual writeback. As a result if the inode already has I_DIRTY_INODE state by the time we get to __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled into on-disk inode and will not get updated until the next I_DIRTY_INODE update, which might never come if we crash or get a power failure. The problem can be reproduced on ext4 by running xfstest generic/622 with -o iversion mount option. Fix it by allowing I_DIRTY_TIME to be set even if the inode already has I_DIRTY_INODE. Also make sure that the case is properly handled in writeback_single_inode() as well. Additionally changes in xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag. Thanks Jan Kara for suggestions on how to make this work properly. Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: stable@kernel.org Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
||
|
|
863f144f12 |
vfs: open inside ->tmpfile()
This is in preparation for adding tmpfile support to fuse, which requires that the tmpfile creation and opening are done as a single operation. Replace the 'struct dentry *' argument of i_op->tmpfile with 'struct file *'. Call finish_open_simple() as the last thing in ->tmpfile() instances (may be omitted in the error case). Change d_tmpfile() argument to 'struct file *' as well to make callers more readable. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
|
|
9751b33865 |
vfs: move open right after ->tmpfile()
Create a helper finish_open_simple() that opens the file with the original dentry. Handle the error case here as well to simplify callers. Call this helper right after ->tmpfile() is called. Next patch will change the tmpfile API and move this call into tmpfile instances. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
|
|
3e9d4c5935 |
vfs: make vfs_tmpfile() static
No callers outside of fs/namei.c anymore. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
|
|
22873deac9 |
vfs: add vfs_tmpfile_open() helper
This helper unifies tmpfile creation with opening. Existing vfs_tmpfile() callers outside of fs/namei.c will be converted to using this helper. There are two such callers: cachefile and overlayfs. The cachefiles code currently uses the open_with_fake_path() helper to open the tmpfile, presumably to disable accounting of the open file. Overlayfs uses tmpfile for copy_up, which means these struct file instances will be short lived, hence it doesn't really matter if they are accounted or not. Disable accounting in this helper too, which should be okay for both callers. Add MAY_OPEN permission checking for consistency. Like for create(2) read/write permissions are not checked. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
|
|
d7e7b9af10 |
fscrypt: stop using keyrings subsystem for fscrypt_master_key
The approach of fs/crypto/ internally managing the fscrypt_master_key
structs as the payloads of "struct key" objects contained in a
"struct key" keyring has outlived its usefulness. The original idea was
to simplify the code by reusing code from the keyrings subsystem.
However, several issues have arisen that can't easily be resolved:
- When a master key struct is destroyed, blk_crypto_evict_key() must be
called on any per-mode keys embedded in it. (This started being the
case when inline encryption support was added.) Yet, the keyrings
subsystem can arbitrarily delay the destruction of keys, even past the
time the filesystem was unmounted. Therefore, currently there is no
easy way to call blk_crypto_evict_key() when a master key is
destroyed. Currently, this is worked around by holding an extra
reference to the filesystem's request_queue(s). But it was overlooked
that the request_queue reference is *not* guaranteed to pin the
corresponding blk_crypto_profile too; for device-mapper devices that
support inline crypto, it doesn't. This can cause a use-after-free.
- When the last inode that was using an incompletely-removed master key
is evicted, the master key removal is completed by removing the key
struct from the keyring. Currently this is done via key_invalidate().
Yet, key_invalidate() takes the key semaphore. This can deadlock when
called from the shrinker, since in fscrypt_ioctl_add_key(), memory is
allocated with GFP_KERNEL under the same semaphore.
- More generally, the fact that the keyrings subsystem can arbitrarily
delay the destruction of keys (via garbage collection delay, or via
random processes getting temporary key references) is undesirable, as
it means we can't strictly guarantee that all secrets are ever wiped.
- Doing the master key lookups via the keyrings subsystem results in the
key_permission LSM hook being called. fscrypt doesn't want this, as
all access control for encrypted files is designed to happen via the
files themselves, like any other files. The workaround which SELinux
users are using is to change their SELinux policy to grant key search
access to all domains. This works, but it is an odd extra step that
shouldn't really have to be done.
The fix for all these issues is to change the implementation to what I
should have done originally: don't use the keyrings subsystem to keep
track of the filesystem's fscrypt_master_key structs. Instead, just
store them in a regular kernel data structure, and rework the reference
counting, locking, and lifetime accordingly. Retain support for
RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free()
with fscrypt_sb_delete(), which releases the keys synchronously and runs
a bit earlier during unmount, so that block devices are still available.
A side effect of this patch is that neither the master keys themselves
nor the filesystem keyrings will be listed in /proc/keys anymore.
("Master key users" and the master key users keyrings will still be
listed.) However, this was mostly an implementation detail, and it was
intended just for debugging purposes. I don't know of anyone using it.
This patch does *not* change how "master key users" (->mk_users) works;
that still uses the keyrings subsystem. That is still needed for key
quotas, and changing that isn't necessary to solve the issues listed
above. If we decide to change that too, it would be a separate patch.
I've marked this as fixing the original commit that added the fscrypt
keyring, but as noted above the most important issue that this patch
fixes wasn't introduced until the addition of inline encryption support.
Fixes:
|
||
|
|
de97fcb303 |
fs: add batch and poll flags to the uring_cmd_iopoll() handler
We need the poll_flags to know how to poll for the IO, and we should have the batch structure in preparation for supporting batched completions with iopoll. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
de27e18e86 |
fs: add file_operations->uring_cmd_iopoll
io_uring will invoke this to do completion polling on uring-cmd operations. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
25885a35a7 |
Change calling conventions for filldir_t
filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop". Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).
So let's just return bool ("should we keep going?") - it's less confusing
that way. The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
do something for everything in directory, until we run into problem
and
find an entry in directory and do something to it.
The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.
"true means stop" would be more natural for the second group; "true
means keep going" - for the first one. I tried both variants and
the things like
if allocation failed
something = -ENOMEM;
return true;
just looked unnatural and asking for trouble.
[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
||
|
|
1da8cf961b |
io_uring-6.0-2022-08-13
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmL3+fQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmXyEACfERdKYdZ/W3IvPoyK8CJ3p7f/6SOj2/p1 DTuaa3l7/kVq2HcRUGgZwvgeWpOCFghdBm5co/4hGqSw7bT8rERGDelo41ohhTfr xKIiwJflK/s280VXLJFA+o7Jeoj1oTFYCmdUmU3wcKFVnQdu1rz9s0L6bwsEqq93 y1uty96dxYZn2mENLbBah0x9yV0h2ZxRkguUm0sdnKl/tMkUVLSD1TPLHf2s6eAL o3Dbmo9jv4HFXoJj8YL50Oxl22zIKBHl9hZqHdLcKesFgyFTChckKUNijWyPL2vE zesbnd57sXgY6ghi4LDGeCOtN41WNjiVeAm/c4XK5oFhTag8Q2x0D1hTPUByHksl IV/116xs6pHTeZRhNlMOBVMZGLSz95zSuRUyTONAmKgc/b3if/w3zTi1W3CnJSlx 7O5GpqQDZTQuin0jldNKImbx1aPAATb+UWDkl7O5aXkjw4FUtxT5GrYcBNswVuKX iybx8NyVn8kFD1hix3U8huBOPSg1JMkR+sFml+NqYRd4i2CwV8KAPPuzsPw6MRBL U4DfkAkpsbKqSK+mri5aUrYxmpYkJ45mgyldiewiOso9+AYg9DDp3D2iGgAiRbKm i3pz1Gh/3iUow0UAI5ZFlDhjHgWPlIH7IBbemivhjhFV4GrXJqTwUzsA1iDKTe14 3lHKkAPVPA== =FfLf -----END PGP SIGNATURE----- Merge tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: - Regression fix for this merge window, fixing a wrong order of arguments for io_req_set_res() for passthru (Dylan) - Fix for the audit code leaking context memory (Peilin) - Ensure that provided buffers are memcg accounted (Pavel) - Correctly handle short zero-copy sends (Pavel) - Sparse warning fixes for the recvmsg multishot command (Dylan) - Error handling fix for passthru (Anuj) - Remove randomization of struct kiocb fields, to avoid it growing in size if re-arranged in such a fashion that it grows more holes or padding (Keith, Linus) - Small series improving type safety of the sqe fields (Stefan) * tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block: io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields io_uring: make io_kiocb_to_cmd() typesafe fs: don't randomize struct kiocb fields io_uring: consistently make use of io_notif_to_data() io_uring: fix error handling for io_uring_cmd io_uring: fix io_recvmsg_prep_multishot sparse warnings io_uring/net: send retry for zerocopy io_uring: mem-account pbuf buckets audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker() io_uring: pass correct parameters to io_req_set_res |
||
|
|
addebd9ac9 |
fs: don't randomize struct kiocb fields
This is a size sensitive structure and randomizing can introduce extra padding that breaks io_uring's fixed size expectations. There are few fields here as it is, half of which need a fixed order to optimally pack, so the randomization isn't providing much. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/io-uring/b6f508ca-b1b2-5f40-7998-e4cff1cf7212@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
426b4ca2d6 |
fs.setgid.v6.0
-----BEGIN PGP SIGNATURE-----
iHQEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYvIYIgAKCRCRxhvAZXjc
omE8AQDAZG2YjNJfMnUUaaWYO3+zTaHlQp7OQkQTXIHfcfViXQD4vPt3Wxh3rrF+
J8fwNcWmXhSNei8HP6cA06QmSajnDQ==
=GF9/
-----END PGP SIGNATURE-----
Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull setgid updates from Christian Brauner:
"This contains the work to move setgid stripping out of individual
filesystems and into the VFS itself.
Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires
additional privileges to avoid security issues.
When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.
However, there are several key issues with the current implementation:
- S_ISGID stripping logic is entangled with umask stripping.
For example, if the umask removes the S_IXGRP bit from the file
about to be created then the S_ISGID bit will be kept.
The inode_init_owner() helper is responsible for S_ISGID stripping
and is called before posix_acl_create(). So we can end up with two
different orderings:
1. FS without POSIX ACL support
First strip umask then strip S_ISGID in inode_init_owner().
In other words, if a filesystem doesn't support or enable POSIX
ACLs then umask stripping is done directly in the vfs before
calling into the filesystem:
2. FS with POSIX ACL support
First strip S_ISGID in inode_init_owner() then strip umask in
posix_acl_create().
In other words, if the filesystem does support POSIX ACLs then
unmask stripping may be done in the filesystem itself when
calling posix_acl_create().
Note that technically filesystems are free to impose their own
ordering between posix_acl_create() and inode_init_owner() meaning
that there's additional ordering issues that influence S_ISGID
inheritance.
(Note that the commit message of commit
|
||
|
|
2bd5d41e0e |
fuse update for 6.0
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYvD4IQAKCRDh3BK/laaZ PDHHAP93H+2E9c6biGd5pEaL2ABChRY+wsQURzD+SZ6AL3JaUQD+KHxA4q0MJvws L6CWcf2XptUDCLe3P6sgSTvv5Gk1OAM= =FoXF -----END PGP SIGNATURE----- Merge tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix an issue with reusing the bdi in case of block based filesystems - Allow root (in init namespace) to access fuse filesystems in user namespaces if expicitly enabled with a module param - Misc fixes * tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: retire block-device-based superblock on force unmount vfs: function to prevent re-use of block-device-based superblocks virtio_fs: Modify format for virtio_fs_direct_access virtiofs: delete unused parameter for virtio_fs_cleanup_vqs fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other fuse: Remove the control interface for virtio-fs fuse: ioctl: translate ENOSYS fuse: limit nsec fuse: avoid unnecessary spinlock bump fuse: fix deadlock between atomic O_TRUNC and page invalidation fuse: write inode in fuse_release() |
||
|
|
6614a3c316 |
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
||
|
|
5264406cdb |
iov_iter work, part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter()
and ->write_iter() instead of ->read()/->write(); new_sync_{read,write}()
has a surprising amount of overhead, in particular inside iocb_flags().
That's why the beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYurGOQAKCRBZ7Krx/gZQ
6ysyAP91lvBfMRepcxpd9kvtuzWkU8A3rfSziZZteEHANB9Q7QEAiPn2a2OjWkcZ
uAyUWfCkHCNx+dSMkEvUgR5okQ0exAM=
=9UCV
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
|
||
|
|
a782e86649 |
Saner handling of "lseek should fail with ESPIPE" - gets rid of
magical no_llseek thing and makes checks consistent. In particular, ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYug2xgAKCRBZ7Krx/gZQ 6wxWAQDqeg+xMq2FGPXmgjCa+Cp3PXH96Lp6f3hHzakIDx+t8gEAxvuiXAD22Mct 6S1SKuGj0iDIuM4L7hUiWTiY/bDXSAc= =3EC/ -----END PGP SIGNATURE----- Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs lseek updates from Al Viro: "Jason's lseek series. Saner handling of 'lseek should fail with ESPIPE' - this gets rid of the magical no_llseek thing and makes checks consistent. In particular, the ad-hoc "can we do splice via internal pipe" checks got saner (and somewhat more permissive, which is what Jason had been after, AFAICT)" * tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: remove no_llseek fs: check FMODE_LSEEK to control internal pipe splicing vfio: do not set FMODE_LSEEK flag dma-buf: remove useless FMODE_LSEEK flag fs: do not compare against ->llseek fs: clear or set FMODE_LSEEK based on llseek function |
||
|
|
f00654007f |
Folio changes for 6.0
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into their
own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
/58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
=DgUi
-----END PGP SIGNATURE-----
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
|
||
|
|
98e2474640 |
for-5.20/io_uring-buffered-writes-2022-07-29
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm7UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpldTEADTg/96R+eq78UZBNZmdifY9/qwQD+kzNiK ACDoYZFSbWUjMeOWqRxbYr6mXBKHnGHyTGlraTTpLDzhpB1xwoWfgOK9uOYXW/Ik eWfgTujPW/8v/l/z86khE+GH9b/maGCRqNZgS6uLVLzhxG6oCkoYTyOh1iHaF1VM Rma4nbJ8GSEDtiXNDl0Bznnyks/pzwoz/9slwzZ7PxtFwZsBxKuxgMUR5HIXdRp7 5iUoFJhZrGWyi/dbQZUsK/9VYVVnKkcBCz2pb4GEmC+3dS/vlPEoeWUpPHInNyd1 9NB9v8c+KFmQaWnCxuxcdHvCfmRRQrX8Pr8/OBNZKO6McYrKWKA+lurp4EGClE3m cZdK+P/9FS/Eeua8hum9UnbPAqsJPqLTbpbrySeBdd4iFA6u7rRqDX2+nz3PNe9U 1b7V1bWBIEY/Rsw/PKo59oIeV0auD8v9OCHJ0lF2pv6dRln2/W0y1Qfd1DI18xFG +9bBnQzhF7R0O8UP5ApVayQCYrd906YsSVUOqAiLmUs/BoOgRq6g/0BqSOVVKE2u 5iq8zTsVMkxY0ZpExwZST/700JwkPIV4SVPEYRC6QssFTcylvlisIek6XYSS9HX4 Z6gzMwJW1H47bEfG4JolTI8uBjp0hQLCPX0O0XFLVnbHQwN0kjIBmv3axAwJO2NV qrrHXjf09w== =hV7G -----END PGP SIGNATURE----- Merge tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block Pull io_uring buffered writes support from Jens Axboe: "This contains support for buffered writes, specifically for XFS. btrfs is in progress, will be coming in the next release. io_uring does support buffered writes on any file type, but since the buffered write path just always -EAGAIN (or -EOPNOTSUPP) any attempt to do so if IOCB_NOWAIT is set, any buffered write will effectively be handled by io-wq offload. This isn't very efficient, and we even have specific code in io-wq to serialize buffered writes to the same inode to avoid further inefficiencies with thread offload. This is particularly sad since most buffered writes don't block, they simply copy data to a page and dirty it. With this pull request, we can handle buffered writes a lot more effiently. If balance_dirty_pages() needs to block, we back off on writes as indicated. This improves buffered write support by 2-3x. Jan Kara helped with the mm bits for this, and Stefan handled the fs/iomap/xfs/io_uring parts of it" * tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block: mm: honor FGP_NOWAIT for page cache page allocation xfs: Add async buffered write support xfs: Specify lockmode when calling xfs_ilock_for_iomap() io_uring: Add tracepoint for short writes io_uring: fix issue with io_write() not always undoing sb_start_write() io_uring: Add support for async buffered writes fs: Add async write file modification handling. fs: Split off inode_needs_update_time and __file_update_time fs: add __remove_file_privs() with flags parameter fs: add a FMODE_BUF_WASYNC flags for f_mode iomap: Return -EAGAIN from iomap_write_iter() iomap: Add async buffered write support iomap: Add flags parameter to iomap_page_create() mm: Add balance_dirty_pages_ratelimited_flags() function mm: Move updates of dirty_exceeded into one place mm: Move starting of background writeback into the main balancing loop |
||
|
|
9d0ddc0cb5 |
fs: Remove aops->migratepage()
With all users converted to migrate_folio(), remove this operation. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
||
|
|
67235182a4 |
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their declarations to buffer.h and switch all filesystems that have wired them up. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
||
|
|
5490da4f06 |
fs: Add aops->migrate_folio
Provide a folio-based replacement for aops->migratepage. Update the documentation to document migrate_folio instead of migratepage. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> |
||
|
|
68f2736a85 |
mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the address_space_operations hole. They aren't filesystems and don't behave like filesystems. They just need their own movable_operations structure, which we can point to directly from page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> |
||
|
|
04b9407197 |
vfs: function to prevent re-use of block-device-based superblocks
The function is to be called from filesystem-specific code to mark a superblock to be ignored by superblock test and thus never re-used. The function also unregisters bdi if the bdi is per-superblock to avoid collision if a new superblock is created to represent the filesystem. generic_shutdown_super() skips unregistering bdi for a retired superlock as it assumes retire function has already done it. This patch adds the functionality only for the block-device-based supers, since the primary use case of the feature is to gracefully handle force unmount of external devices, mounted with FUSE. This can be further extended to cover all superblocks, if the need arises. Signed-off-by: Daniil Lunev <dlunev@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> |
||
|
|
66fa3cedf1 |
fs: Add async write file modification handling.
This adds a file_modified_async() function to return -EAGAIN if the request either requires to remove privileges or needs to update the file modification time. This is required for async buffered writes, so the request gets handled in the io worker of io-uring. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-11-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
8017553980 |
fs: add a FMODE_BUF_WASYNC flags for f_mode
This introduces the flag FMODE_BUF_WASYNC. If devices support async buffered writes, this flag can be set. It also modifies the check in generic_write_checks to take async buffered writes into consideration. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-8-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
2b3416ceff
|
fs: add mode_strip_sgid() helper
Add a dedicated helper to handle the setgid bit when creating a new file in a setgid directory. This is a preparatory patch for moving setgid stripping into the vfs. The patch contains no functional changes. Currently the setgid stripping logic is open-coded directly in inode_init_owner() and the individual filesystems are responsible for handling setgid inheritance. Since this has proven to be brittle as evidenced by old issues we uncovered over the last months (see [1] to [3] below) we will try to move this logic into the vfs. Link: |
||
|
|
6f7db3894a |
fsdax: dedup file range to use a compare function
With dax we cannot deal with readpage() etc. So, we create a dax comparison function which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
868941b144 |
fs: remove no_llseek
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
9adf24a409
|
fs: port HAS_UNMAPPED_ID() to vfs{g,u}id_t
The HAS_UNMAPPED_ID() helper is fully self contained so we can port it
to vfs{g,u}id_t without much effort.
Cc: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
b27c82e129
|
attr: port attribute changes to new types
Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.
This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.
Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.
The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.
We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.
Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.
The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.
Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
1f36146a5a
|
fs: introduce tiny iattr ownership update helpers
Nearly all fileystems currently open-code the same checks for
determining whether the i_{g,u}id fields of an inode need to be updated
and then updating the fields.
Introduce tiny helpers i_{g,u}id_needs_update() and i_{g,u}id_update()
that wrap this logic. This allows filesystems to not care about updating
inode->i_{g,u}id with the correct values themselves instead leaving this
to the helpers.
We also get rid of a lot of code duplication and make it easier to
change struct iattr in the future since changes can be localized to
these helpers.
And finally we make it hard to conflate k{g,u}id_t types with
vfs{g,u}id_t types for filesystems that support idmapped mounts.
In the following patch we will port all filesystems that raise
FS_ALLOW_IDMAP to use the new helpers. However, the ultimate goal is to
convert all filesystems to make use of these helpers.
All new helpers are nops on non-idmapped mounts.
Link: https://lore.kernel.org/r/20220621141454.2914719-5-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
45c311501c
|
fs: use mount types in iattr
Add ia_vfs{g,u}id members of type vfs{g,u}id_t to struct iattr. We use
an anonymous union (similar to what we do in struct file) around
ia_{g,u}id and ia_vfs{g,u}id.
At the end of this series ia_{g,u}id and ia_vfs{g,u}id will always
contain the same value independent of whether struct iattr is
initialized from an idmapped mount. This is a change from how this is
done today.
Wrapping this in a anonymous unions has a few advantages. It allows us
to avoid needlessly increasing struct iattr. Since the types for
ia_{g,u}id and ia_vfs{g,u}id are structures with overlapping/identical
members they are covered by 6.5.2.3/6 of the C standard and it is safe
to initialize and access them.
Filesystems that raise FS_ALLOW_IDMAP and thus support idmapped mounts
will have to use ia_vfs{g,u}id and the associated helpers. And will be
ported at the end of this series. They will immediately benefit from the
type safe new helpers.
Filesystems that do not support FS_ALLOW_IDMAP can continue to use
ia_{g,u}id for now. The aim is to convert every filesystem to always use
ia_vfs{g,u}id and thus ultimately remove the ia_{g,u}id members.
Link: https://lore.kernel.org/r/20220621141454.2914719-4-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
234a3113f2
|
fs: add two type safe mapping helpers
Introduce i_{g,u}id_into_vfs{g,u}id(). They return vfs{g,u}id_t. This
makes it way harder to confused idmapped mount {g,u}ids with filesystem
{g,u}ids.
The two helpers will eventually replace the old non type safe
i_{g,u}id_into_mnt() helpers once we finished converting all places. Add
a comment noting that they will be removed in the future.
All new helpers are nops on non-idmapped mounts.
Link: https://lore.kernel.org/r/20220621141454.2914719-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
||
|
|
164f4064ca |
keep iocb_flags() result cached in struct file
* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal opens, alloc_file() for pipe()/socket()/etc.) * update when handling F_SETFL * keep in a new field - file->f_iocb_flags; since that thing is needed only before the refcount reaches zero, we can put it into the same anon union where ->f_rcuhead and ->f_llist live - those are used only after refcount reaches zero. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
91b94c5d6a |
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
New helper to be used instead of direct checks for IOCB_DSYNC: iocb_is_dsync(iocb). Checks converted, which allows to avoid the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines) from iocb_flags() - it's checked in iocb_is_dsync() instead Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
e87f2c26c8 |
struct file: use anonymous union member for rcuhead and llist
Once upon a time we couldn't afford anon unions; these days minimal gcc version had been raised enough to take care of that. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
cbd76edeab |
Cleanups (and one fix) around struct mount handling.
The fix is usermode_driver.c one - once you've done kern_mount(), you must kern_unmount(); simple mntput() will end up with a leak. Several failure exits in there messed up that way... In practice you won't hit those particular failure exits without fault injection, though. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpvrWQAKCRBZ7Krx/gZQ 6z29AP9EZVSyIvnwXleehpa2mEZhsp+KAKgV/ENaKHMn7jiH0wD/bfgnhxIDNuc5 108E2R5RWEYTynW5k7nnP5PsTsMq5Qc= =b3Wc -----END PGP SIGNATURE----- Merge tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount handling updates from Al Viro: "Cleanups (and one fix) around struct mount handling. The fix is usermode_driver.c one - once you've done kern_mount(), you must kern_unmount(); simple mntput() will end up with a leak. Several failure exits in there messed up that way... In practice you won't hit those particular failure exits without fault injection, though" * tag 'pull-18-rc1-work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: move mount-related externs from fs.h to mount.h blob_to_mnt(): kern_unmount() is needed to undo kern_mount() m->mnt_root->d_inode->i_sb is a weird way to spell m->mnt_sb... linux/mount.h: trim includes uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) |
||
|
|
dbe0ee4661 |
Descriptor handling cleanups
-----BEGIN PGP SIGNATURE----- iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpwEZAAKCRBZ7Krx/gZQ 691uAP0QnwO0lOYOa41MfQ6QnzPbiYcffqtUuTJBWyfs8+WnugD2NNGQP7Zjtin9 q0wcv2KA6yY7qgu7RCCbxU/7J0YdCg== =ywuh -----END PGP SIGNATURE----- Merge tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull file descriptor updates from Al Viro. - Descriptor handling cleanups * tag 'pull-18-rc1-work.fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Unify the primitives for file descriptor closing fs: remove fget_many and fput_many interface io_uring_enter(): don't leave f.flags uninitialized |