mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-11 04:28:51 +00:00
f1f73651a0
19371 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ce8a79d560 |
for-6.2/block-2022-12-08
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOScsgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi5ID/9pLXFYOq1+uDjU0KO/MdjMjK8Ukr34lCnk
WkajRLheE8JBKOFDE54XJk56sQSZHX9bTWqziar0h1fioh7FlQR/tVvzsERCm2M9
2y9THJNJygC68wgybStyiKlshFjl7TD7Kv5N9Y3xP3mkQygT+D6o8fXZk5xQbYyH
YdFSoq4rJVHxRL03yzQiReGGIYdOUEQQh8l1FiLwLlKa3lXAey1KuxWIzksVN0KK
aZB4QhiBpOiPgDHUVisq2XtyQjpZ2byoCImPzgrcqk9Jo4esvm/e6esrg4xlsvII
LKFFkTmbVqjUZtFjqakFHmfuzVor4nU5f+xb90ZHExuuODYckkxWp5rWhf9QwqqI
0ik6WYgI1/5vnHnX8f2DYzOFQf9qa/rLgg0CshyUODlD6RfHa9vntqYvlIFkmOBd
Q7KblIoK8YTzUS1M+v7X8JQ7gDR2KwygH37Da2KJS+vgvfIb8kJGr1ZORuhJuJJ7
Bl69gaNkHTHrqufp7UI64YXfueeuNu2J9z3zwzGoxeaFaofF/phDn0/2gCQE1fQI
XBhsMw+ETqI6B2SPHMnzYDu2DM1S8ZTOYQlaD4G3uqgWnAM1tG707395uAy5yu4n
D5azU1fVG4UocoNIyPujpaoSRs2zWZycEFEeUQkhyDDww/j4hlHi6H33eOnk0zsr
wxzFGfvHfw==
=k/vv
-----END PGP SIGNATURE-----
Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull requests via Christoph:
- Support some passthrough commands without CAP_SYS_ADMIN (Kanchan
Joshi)
- Refactor PCIe probing and reset (Christoph Hellwig)
- Various fabrics authentication fixes and improvements (Sagi
Grimberg)
- Avoid fallback to sequential scan due to transient issues (Uday
Shankar)
- Implement support for the DEAC bit in Write Zeroes (Christoph
Hellwig)
- Allow overriding the IEEE OUI and firmware revision in configfs
for nvmet (Aleksandr Miloserdov)
- Force reconnect when number of queue changes in nvmet (Daniel
Wagner)
- Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi
Grimberg, Christoph Hellwig, Christophe JAILLET)
- Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni)
- Use the common tagset helpers in nvme-pci driver (Christoph
Hellwig)
- Cleanup the nvme-pci removal path (Christoph Hellwig)
- Use kstrtobool() instead of strtobool (Christophe JAILLET)
- Allow unprivileged passthrough of Identify Controller (Joel
Granados)
- Support io stats on the mpath device (Sagi Grimberg)
- Minor nvmet cleanup (Sagi Grimberg)
- MD pull requests via Song:
- Code cleanups (Christoph)
- Various fixes
- Floppy pull request from Denis:
- Fix a memory leak in the init error path (Yuan)
- Series fixing some batch wakeup issues with sbitmap (Gabriel)
- Removal of the pktcdvd driver that was deprecated more than 5 years
ago, and subsequent removal of the devnode callback in struct
block_device_operations as no users are now left (Greg)
- Fix for partition read on an exclusively opened bdev (Jan)
- Series of elevator API cleanups (Jinlong, Christoph)
- Series of fixes and cleanups for blk-iocost (Kemeng)
- Series of fixes and cleanups for blk-throttle (Kemeng)
- Series adding concurrent support for sync queues in BFQ (Yu)
- Series bringing drbd a bit closer to the out-of-tree maintained
version (Christian, Joel, Lars, Philipp)
- Misc drbd fixes (Wang)
- blk-wbt fixes and tweaks for enable/disable (Yu)
- Fixes for mq-deadline for zoned devices (Damien)
- Add support for read-only and offline zones for null_blk
(Shin'ichiro)
- Series fixing the delayed holder tracking, as used by DM (Yu,
Christoph)
- Series enabling bio alloc caching for IRQ based IO (Pavel)
- Series enabling userspace peer-to-peer DMA (Logan)
- BFQ waker fixes (Khazhismel)
- Series fixing elevator refcount issues (Christoph, Jinlong)
- Series cleaning up references around queue destruction (Christoph)
- Series doing quiesce by tagset, enabling cleanups in drivers
(Christoph, Chao)
- Series untangling the queue kobject and queue references (Christoph)
- Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye,
Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph)
* tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits)
blktrace: Fix output non-blktrace event when blk_classic option enabled
block: sed-opal: Don't include <linux/kernel.h>
sed-opal: allow using IOC_OPAL_SAVE for locking too
blk-cgroup: Fix typo in comment
block: remove bio_set_op_attrs
nvmet: don't open-code NVME_NS_ATTR_RO enumeration
nvme-pci: use the tagset alloc/free helpers
nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set
nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers
nvme: consolidate setting the tagset flags
nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
block: bio_copy_data_iter
nvme-pci: split out a nvme_pci_ctrl_is_dead helper
nvme-pci: return early on ctrl state mismatch in nvme_reset_work
nvme-pci: rename nvme_disable_io_queues
nvme-pci: cleanup nvme_suspend_queue
nvme-pci: remove nvme_pci_disable
nvme-pci: remove nvme_disable_admin_queue
nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
nvme: use nvme_wait_ready in nvme_shutdown_ctrl
...
|
||
|
|
02bf43c7b7 |
fs.xattr.simple.rework.rbtree.rwlock.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bw/wAKCRCRxhvAZXjc
ol79AQCsHS9s78dLUvdasfQ1023dyF9zaQ8XGkDO6tRssJzGAAD7B8odxDsfQgjQ
Qzzn9YPZVUgHjd4xBg21UVPmRP5snwQ=
=wYgr
-----END PGP SIGNATURE-----
Merge tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull simple-xattr updates from Christian Brauner:
"This ports the simple xattr infrastucture to rely on a simple rbtree
protected by a read-write lock instead of a linked list protected by a
spinlock.
A while ago we received reports about scaling issues for filesystems
using the simple xattr infrastructure that also support setting a
larger number of xattrs. Specifically, cgroups and tmpfs.
Both cgroupfs and tmpfs can be mounted by unprivileged users in
unprivileged containers and root in an unprivileged container can set
an unrestricted number of security.* xattrs and privileged users can
also set unlimited trusted.* xattrs. A few more words on further that
below. Other xattrs such as user.* are restricted for kernfs-based
instances to a fairly limited number.
As there are apparently users that have a fairly large number of
xattrs we should scale a bit better. Using a simple linked list
protected by a spinlock used for set, get, and list operations doesn't
scale well if users use a lot of xattrs even if it's not a crazy
number.
Let's switch to a simple rbtree protected by a rwlock. It scales way
better and gets rid of the perf issues some people reported. We
originally had fancier solutions even using an rcu+seqlock protected
rbtree but we had concerns about being to clever and also that
deletion from an rbtree with rcu+seqlock isn't entirely safe.
The rbtree plus rwlock is perfectly fine. By far the most common
operation is getting an xattr. While setting an xattr is not and
should be comparatively rare. And listxattr() often only happens when
copying xattrs between files or together with the contents to a new
file.
Holding a lock across listxattr() is unproblematic because it doesn't
list the values of xattrs. It can only be used to list the names of
all xattrs set on a file. And the number of xattr names that can be
listed with listxattr() is limited to XATTR_LIST_MAX aka 65536 bytes.
If a larger buffer is passed then vfs_listxattr() caps it to
XATTR_LIST_MAX and if more xattr names are found it will return
-E2BIG. In short, the maximum amount of memory that can be retrieved
via listxattr() is limited and thus listxattr() bounded.
Of course, the API is broken as documented on xattr(7) already. While
I have no idea how the xattr api ended up in this state we should
probably try to come up with something here at some point. An iterator
pattern similar to readdir() as an alternative to listxattr() or
something else.
Right now it is extremly strange that users can set millions of xattrs
but then can't use listxattr() to know which xattrs are actually set.
And it's really trivial to do:
for i in {1..1000000}; do setfattr -n security.$i -v $i ./file1; done
And around 5000 xattrs it's impossible to use listxattr() to figure
out which xattrs are actually set. So I have suggested that we try to
limit the number of xattrs for simple xattrs at least. But that's a
future patch and I don't consider it very urgent.
A bonus of this port to rbtree+rwlock is that we shrink the memory
consumption for users of the simple xattr infrastructure.
This also adds kernel documentation to all the functions"
* tag 'fs.xattr.simple.rework.rbtree.rwlock.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
xattr: use rbtree for simple_xattrs
|
||
|
|
deb9acc122 |
A large number of cleanups and bug fixes, with many of the bug fixes
found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc.) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented.) -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmOWqrMACgkQ8vlZVpUN gaMvmgf+P2C6vzjn13ZdF+GwFTi4fx4TJ5BZT78LQqvTZqhkfk4k1q2SFfHI7nXT ZWdu1KUQ0SYLo64oaSU9W+2B2pmGi/KgUlrwNhy8DFeGStogPuDVfmGWB63p1UQL ld42mE9q7bjY6nCZSKYXPp2jfSwsHuliHBJ4UfzVNAIwjiUEJ7pGeIrMFdLAEkVm TVNzvlUZaHUnVxhpsP6hs+5WNhHQ2IhWz4rwX01ussNgHTijYac4iaL05wpTvF5e 6NtvfmpOEMAbYrmIkJX4RVss4JNsHNOC0E8fjEHlgXJxBiAI6w8GxTxrS52Y4ELH nHXl/pc0L+I8+yh9B9+s0LBaSuPuTg== =lezv -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with many of the bug fixes found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: fix reserved cluster accounting in __es_remove_extent() ext4: fix inode leak in ext4_xattr_inode_create() on an error path ext4: allocate extended attribute value in vmalloc area ext4: avoid unaccounted block allocation when expanding inode ext4: initialize quota before expanding inode in setproject ioctl ext4: stop providing .writepage hook mm: export buffer_migrate_folio_norefs() ext4: switch to using write_cache_pages() for data=journal writeout jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout ext4: switch to using ext4_do_writepages() for ordered data writeout ext4: move percpu_rwsem protection into ext4_writepages() ext4: provide ext4_do_writepages() ext4: add support for writepages calls that cannot map blocks ext4: drop pointless IO submission from ext4_bio_write_page() ext4: remove nr_submitted from ext4_bio_write_page() ext4: move keep_towrite handling to ext4_bio_write_page() ext4: handle redirtying in ext4_bio_write_page() ext4: fix kernel BUG in 'ext4_write_inline_data_end()' ext4: make ext4_mb_initialize_context return void ext4: fix deadlock due to mbcache entry corruption ... |
||
|
|
6a518afcc2 |
fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
=27Wm
-----END PGP SIGNATURE-----
Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull VFS acl updates from Christian Brauner:
"This contains the work that builds a dedicated vfs posix acl api.
The origins of this work trace back to v5.19 but it took quite a while
to understand the various filesystem specific implementations in
sufficient detail and also come up with an acceptable solution.
As we discussed and seen multiple times the current state of how posix
acls are handled isn't nice and comes with a lot of problems: The
current way of handling posix acls via the generic xattr api is error
prone, hard to maintain, and type unsafe for the vfs until we call
into the filesystem's dedicated get and set inode operations.
It is already the case that posix acls are special-cased to death all
the way through the vfs. There are an uncounted number of hacks that
operate on the uapi posix acl struct instead of the dedicated vfs
struct posix_acl. And the vfs must be involved in order to interpret
and fixup posix acls before storing them to the backing store, caching
them, reporting them to userspace, or for permission checking.
Currently a range of hacks and duct tape exist to make this work. As
with most things this is really no ones fault it's just something that
happened over time. But the code is hard to understand and difficult
to maintain and one is constantly at risk of introducing bugs and
regressions when having to touch it.
Instead of continuing to hack posix acls through the xattr handlers
this series builds a dedicated posix acl api solely around the get and
set inode operations.
Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
helpers must be used in order to interact with posix acls. They
operate directly on the vfs internal struct posix_acl instead of
abusing the uapi posix acl struct as we currently do. In the end this
removes all of the hackiness, makes the codepaths easier to maintain,
and gets us type safety.
This series passes the LTP and xfstests suites without any
regressions. For xfstests the following combinations were tested:
- xfs
- ext4
- btrfs
- overlayfs
- overlayfs on top of idmapped mounts
- orangefs
- (limited) cifs
There's more simplifications for posix acls that we can make in the
future if the basic api has made it.
A few implementation details:
- The series makes sure to retain exactly the same security and
integrity module permission checks. Especially for the integrity
modules this api is a win because right now they convert the uapi
posix acl struct passed to them via a void pointer into the vfs
struct posix_acl format to perform permission checking on the mode.
There's a new dedicated security hook for setting posix acls which
passes the vfs struct posix_acl not a void pointer. Basing checking
on the posix acl stored in the uapi format is really unreliable.
The vfs currently hacks around directly in the uapi struct storing
values that frankly the security and integrity modules can't
correctly interpret as evidenced by bugs we reported and fixed in
this area. It's not necessarily even their fault it's just that the
format we provide to them is sub optimal.
- Some filesystems like 9p and cifs need access to the dentry in
order to get and set posix acls which is why they either only
partially or not even at all implement get and set inode
operations. For example, cifs allows setxattr() and getxattr()
operations but doesn't allow permission checking based on posix
acls because it can't implement a get acl inode operation.
Thus, this patch series updates the set acl inode operation to take
a dentry instead of an inode argument. However, for the get acl
inode operation we can't do this as the old get acl method is
called in e.g., generic_permission() and inode_permission(). These
helpers in turn are called in various filesystem's permission inode
operation. So passing a dentry argument to the old get acl inode
operation would amount to passing a dentry to the permission inode
operation which we shouldn't and probably can't do.
So instead of extending the existing inode operation Christoph
suggested to add a new one. He also requested to ensure that the
get and set acl inode operation taking a dentry are consistently
named. So for this version the old get acl operation is renamed to
->get_inode_acl() and a new ->get_acl() inode operation taking a
dentry is added. With this we can give both 9p and cifs get and set
acl inode operations and in turn remove their complex custom posix
xattr handlers.
In the future I hope to get rid of the inode method duplication but
it isn't like we have never had this situation. Readdir is just one
example. And frankly, the overall gain in type safety and the more
pleasant api wise are simply too big of a benefit to not accept
this duplication for a while.
- We've done a full audit of every codepaths using variant of the
current generic xattr api to get and set posix acls and
surprisingly it isn't that many places. There's of course always a
chance that we might have missed some and if so I'm sure we'll find
them soon enough.
The crucial codepaths to be converted are obviously stacking
filesystems such as ecryptfs and overlayfs.
For a list of all callers currently using generic xattr api helpers
see [2] including comments whether they support posix acls or not.
- The old vfs generic posix acl infrastructure doesn't obey the
create and replace semantics promised on the setxattr(2) manpage.
This patch series doesn't address this. It really is something we
should revisit later though.
The patches are roughly organized as follows:
(1) Change existing set acl inode operation to take a dentry
argument (Intended to be a non-functional change)
(2) Rename existing get acl method (Intended to be a non-functional
change)
(3) Implement get and set acl inode operations for filesystems that
couldn't implement one before because of the missing dentry.
That's mostly 9p and cifs (Intended to be a non-functional
change)
(4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
and vfs_set_acl() including security and integrity hooks
(Intended to be a non-functional change)
(5) Implement get and set acl inode operations for stacking
filesystems (Intended to be a non-functional change)
(6) Switch posix acl handling in stacking filesystems to new posix
acl api now that all filesystems it can stack upon support it.
(7) Switch vfs to new posix acl api (semantical change)
(8) Remove all now unused helpers
(9) Additional regression fixes reported after we merged this into
linux-next
Thanks to Seth for a lot of good discussion around this and
encouragement and input from Christoph"
* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
posix_acl: Fix the type of sentinel in get_acl
orangefs: fix mode handling
ovl: call posix_acl_release() after error checking
evm: remove dead code in evm_inode_set_acl()
cifs: check whether acl is valid early
acl: make vfs_posix_acl_to_xattr() static
acl: remove a slew of now unused helpers
9p: use stub posix acl handlers
cifs: use stub posix acl handlers
ovl: use stub posix acl handlers
ecryptfs: use stub posix acl handlers
evm: remove evm_xattr_acl_change()
xattr: use posix acl api
ovl: use posix acl api
ovl: implement set acl method
ovl: implement get acl method
ecryptfs: implement set acl method
ecryptfs: implement get acl method
ksmbd: use vfs_remove_acl()
acl: add vfs_remove_acl()
...
|
||
|
|
75f4d9af8b |
iov_iter work; most of that is about getting rid of
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
|
||
|
|
e2ed78d5d9 |
linux-kselftest-kunit-next-6.2-rc1
This KUnit next update for Linux 6.2-rc1 consists of several enhancements,
fixes, clean-ups, documentation updates, improvements to logging and KTAP
compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests
when KUNIT is enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmOXnPYACgkQCwJExA0N
Qxwf9RAAwdBKxgPZuKZ40v69Jm8YhaO3vyKUkyYRH59/HQGFUHMA2f2ONez4krEX
iXPgBFQ+7pB63FdgQi2HSg2z/u3xY02AaGgZGXDuNJDmg2xYjNDfZ0GjN6tuavlN
Liz01DGZkjZoVVXM6oV2xT8woBg/0BbdkKNL1OBO9RBZFHzwDryRzfXmQb8cKlNr
S+tkeZTlCA/s7UW2LNj4VlTzn6wgni4Y9gSk4wbQmSGWn3OX3rHaqAb7GiZ/yPGb
1WjbMeE8FwyydLU40aOZZ8V6AJRiw5VGPJyFzWJyWZ21xOgN9Z95b+I36z8RXraA
i/wnazO/FJsrhzvKL83rQkrSW6bpmVY+jGvk+L6deFM6Ro/vEWHJ4DgyKsIdMiJy
gUM1Q69szptq+ZRHGrZWPlVONBkBXMOL+fePbCbGcMzlaEAS/zsFYW9IBKcvLzwP
uHzzMS/cMmSUq52ZIyl9jhHQFVSoErCpJwQjAaZBQpYXPmE7yLcZItxnCaSUQTay
bRwyps5ph5md0oJTTFJKZ4Zx5FJ2ItjbC4y9BIexb9gYRDdRq723ivDoVENZl/Zk
DFIV95AY+mSxadS5vFagwWwX0ZN0KFKxeM8Tw7VTimal/0Sbglqp+oflsuKFD6JQ
b5HUixYifKMbWxkH5xrUb8NdjmBj561TYa8U4N+j3oOiaPYu5Ss=
=UQNn
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
"Several enhancements, fixes, clean-ups, documentation updates,
improvements to logging and KTAP compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests when KUNIT is
enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition"
* tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits)
Documentation: dev-tools: Clarify requirements for result description
apparmor: test: make static symbols visible during kunit testing
kunit: add macro to allow conditionally exposing static symbols to tests
kunit: tool: make parser preserve whitespace when printing test log
Documentation: kunit: Fix "How Do I Use This" / "Next Steps" sections
kunit: tool: don't include KTAP headers and the like in the test log
kunit: improve KTAP compliance of KUnit test output
kunit: tool: parse KTAP compliant test output
mm: slub: test: Use the kunit_get_current_test() function
kunit: Use the static key when retrieving the current test
kunit: Provide a static key to check if KUnit is actively running tests
kunit: tool: make --json do nothing if --raw_ouput is set
kunit: tool: tweak error message when no KTAP found
kunit: remove KUNIT_INIT_MEM_ASSERTION macro
Documentation: kunit: Remove redundant 'tips.rst' page
Documentation: KUnit: reword description of assertions
Documentation: KUnit: make usage.rst a superset of tips.rst, remove duplication
kunit: eliminate KUNIT_INIT_*_ASSERT_STRUCT macros
kunit: tool: remove redundant file.close() call in unit test
kunit: tool: unit tests all check parser errors, standardize formatting a bit
...
|
||
|
|
268325bda5 |
Random number generator updates for Linux 6.2-rc1.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
|
||
|
|
ca1443c7e7 |
Merge branch 'for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou: "Baoquan was nice enough to run some clean ups for percpu" * 'for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: mm/percpu: remove unused PERCPU_DYNAMIC_EARLY_SLOTS mm/percpu.c: remove the lcm code since block size is fixed at page size mm/percpu: replace the goto with break mm/percpu: add comment to state the empty populated pages accounting mm/percpu: Update the code comment when creating new chunk mm/percpu: use list_first_entry_or_null in pcpu_reclaim_populated() mm/percpu: remove unused pcpu_map_extend_chunks |
||
|
|
909c6475d5 |
mm: slub: test: Use the kunit_get_current_test() function
Use the newly-added function kunit_get_current_test() instead of accessing current->kunit_test directly. This function uses a static key to return more quickly when KUnit is enabled, but no tests are actively running. There should therefore be a negligible performance impact to enabling the slub KUnit tests. Other than the performance improvement, this should be a no-op. Cc: Oliver Glitta <glittao@gmail.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Gow <davidgow@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> |
||
|
|
7d62159919 |
hyperv-next for v6.2
-----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmORzR4THHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXkqCCACFwHz04iepLE7R8ZZ6BVUhD6uzfzDo s1j7ozOUGUe3vI6q0DElHWVQZgzIzLypVsfWkZToe6jeOU6R48b0tZSFyJCUNwGM ogmS7N8fBdHfY9SBFoUPoziBifXpf3kq4hhX/w+1Lge9CN5Ywc4KjuJb91EAInbs lm47O4KQY8w8A7BbPBHYBueUVWLvgwPRPOS032zqxN1787m2tCxpqkfnImK39kh6 IsBBIZfYsok0H5wldhZXnsARpEOeFF6BoFBXpFPlmnbv2VcK2AfZgTYdA3ESyEgd NyOFDfh6BO07gTR1xCH6gvOpkHwx6xKAkjE36RymdhXS6fhRCRsfahVB =m78g -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Drop unregister syscore from hyperv_cleanup to avoid hang (Gaurav Kohli) - Clean up panic path for Hyper-V framebuffer (Guilherme G. Piccoli) - Allow IRQ remapping to work without x2apic (Nuno Das Neves) - Fix comments (Olaf Hering) - Expand hv_vp_assist_page definition (Saurabh Sengar) - Improvement to page reporting (Shradha Gupta) - Make sure TSC clocksource works when Linux runs as the root partition (Stanislav Kinsburskiy) * tag 'hyperv-next-signed-20221208' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Remove unregister syscore call from Hyper-V cleanup iommu/hyper-v: Allow hyperv irq remapping without x2apic clocksource: hyper-v: Add TSC page support for root partition clocksource: hyper-v: Use TSC PFN getter to map vvar page clocksource: hyper-v: Introduce TSC PFN getter clocksource: hyper-v: Introduce a pointer to TSC page x86/hyperv: Expand definition of struct hv_vp_assist_page PCI: hv: update comment in x86 specific hv_arch_irq_unmask hv: fix comment typo in vmbus_channel/low_latency drivers: hv, hyperv_fb: Untangle and refactor Hyper-V panic notifiers video: hyperv_fb: Avoid taking busy spinlock on panic path hv_balloon: Add support for configurable order free page reporting mm/page_reporting: Add checks for page_reporting_order param |
||
|
|
893660b0e1 |
slab updates for 6.2-rc1
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmOTQvYACgkQ4CHKc/GJ
qRCaqQf/UjCDmj1vYKcsTzp5L4MDXdQPA7dKtytbnZtROtClVNUzB0jODsfeMI7C
SwbDJRoUU1y99GRFYIx9oGji1q7TYOWS/PsZxOGkv8ILommmQ1kJdZdxt9rOqYNg
3mjCZoQmZMIRipLDrN55C096Mi+mI89kkE4Lkyrigpmxvc0KyX6QBerr+VmaBMHw
DjmFC6Gj+ZH2AX6z7AzOF1gZ42gPBQUjWdHFRcY41dShOQZNl2FPT5ITAvotlJlH
9mj6woCqW936UOcpUl+Qqk7mekDJb1hqmYXV2VAlhprBi6Vcd9PU6GmPPb6w51bS
HkSNNYjkbuNxBXY13PUPcR0hEHv9zw==
=AlWx
-----END PGP SIGNATURE-----
Merge tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- SLOB deprecation and SLUB_TINY
The SLOB allocator adds maintenance burden and stands in the way of
API improvements [1]. Deprecate it by renaming the config option (to
make users notice) to CONFIG_SLOB_DEPRECATED with updated help text.
SLUB should be used instead as SLAB will be the next on the removal
list.
Based on reports from a riscv k210 board with 8MB RAM, add a
CONFIG_SLUB_TINY option to minimize SLUB's memory usage at the
expense of scalability. This has resolved the k210 regression [2] so
in case there are no others (that wouldn't be resolvable by further
tweaks to SLUB_TINY) plan is to remove SLOB in a few cycles.
Existing defconfigs with CONFIG_SLOB are converted to
CONFIG_SLUB_TINY.
- kmalloc() slub_debug redzone improvements
A series from Feng Tang that builds on the tracking or requested size
for kmalloc() allocations (for caches with debugging enabled) added
in 6.1, to make redzone checks consider the requested size and not
the rounded up one, in order to catch more subtle buffer overruns.
Includes new slub_kunit test.
- struct slab fields reordering to accomodate larger rcu_head
RCU folks would like to grow rcu_head with debugging options, which
breaks current struct slab layout's assumptions, so reorganize it to
make this possible.
- Miscellaneous improvements/fixes:
- __alloc_size checking compiler workaround (Kees Cook)
- Optimize and cleanup SLUB's sysfs init (Rasmus Villemoes)
- Make SLAB compatible with PROVE_RAW_LOCK_NESTING (Jiri Kosina)
- Correct SLUB's percpu allocation estimates (Baoquan He)
- Re-enableS LUB's run-time failslab sysfs control (Alexander Atanasov)
- Make tools/vm/slabinfo more user friendly when not run as root (Rong Tao)
- Dead code removal in SLUB (Hyeonggon Yoo)
* tag 'slab-for-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (31 commits)
mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED
mm, slub: don't aggressively inline with CONFIG_SLUB_TINY
mm, slub: remove percpu slabs with CONFIG_SLUB_TINY
mm, slub: split out allocations from pre/post hooks
mm/slub, kunit: Add a test case for kmalloc redzone check
mm/slub, kunit: add SLAB_SKIP_KFENCE flag for cache creation
mm, slub: refactor free debug processing
mm, slab: ignore SLAB_RECLAIM_ACCOUNT with CONFIG_SLUB_TINY
mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY
mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY
mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY
mm, slub: disable SYSFS support with CONFIG_SLUB_TINY
mm, slub: add CONFIG_SLUB_TINY
mm, slab: ignore hardened usercopy parameters when disabled
slab: Remove special-casing of const 0 size allocations
slab: Clean up SLOB vs kmalloc() definition
mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
mm/migrate: make isolate_movable_page() skip slab pages
mm/slab: move and adjust kernel-doc for kmem_cache_alloc
mm/slub, percpu: correct the calculation of early percpu allocation size
...
|
||
|
|
4cee37b3a4 |
9 hotfixes. 6 for MM, 3 for other areas. Four of these patches address
post-6.0 issues. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5Ur2AAKCRDdBJ7gKXxA jsGmAQDWSq6z9fVgk30XpMr/X7t5c6NTPw5GocVpdwG8iqch3gEAjEs5/Kcd/mx4 d1dLaJFu1u3syessp8nJrNr1HANIog8= =L8zu -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Six for MM, three for other areas. Four of these patches address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: memcg: fix possible use-after-free in memcg_write_event_control() MAINTAINERS: update Muchun Song's email mm/gup: fix gup_pud_range() for dax mmap: fix do_brk_flags() modifying obviously incorrect VMAs mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit tmpfs: fix data loss from failed fallocate kselftests: cgroup: update kmem test precision tolerance mm: do not BUG_ON missing brk mapping, because userspace can unmap it mailmap: update Matti Vaittinen's email address |
||
|
|
4a7ba45b1a |
memcg: fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to |
||
|
|
fcd0ccd836 |
mm/gup: fix gup_pud_range() for dax
For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit |
||
|
|
6c28ca6485 |
mmap: fix do_brk_flags() modifying obviously incorrect VMAs
Add more sanity checks to the VMA that do_brk_flags() will expand. Ensure
the VMA matches basic merge requirements within the function before
calling can_vma_merge_after().
Drop the duplicate checks from vm_brk_flags() since they will be enforced
later.
The old code would expand file VMAs on brk(), which is functionally
wrong and also dangerous in terms of locking because the brk() path
isn't designed for file VMAs and therefore doesn't lock the file
mapping. Checking can_vma_merge_after() ensures that new anonymous
VMAs can't be merged into file VMAs.
See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com
Fixes:
|
||
|
|
44bcabd70c |
tmpfs: fix data loss from failed fallocate
Fix tmpfs data loss when the fallocate system call is interrupted by a
signal, or fails for some other reason. The partial folio handling in
shmem_undo_range() forgot to consider this unfalloc case, and was liable
to erase or truncate out data which had already been committed earlier.
It turns out that none of the partial folio handling there is appropriate
for the unfalloc case, which just wants to proceed to removal of whole
folios: which find_get_entries() provides, even when partially covered.
Original patch by Rui Wang.
Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/
Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com
Fixes:
|
||
|
|
f5ad508340 |
mm: do not BUG_ON missing brk mapping, because userspace can unmap it
The following program will trigger the BUG_ON that this patch removes,
because the user can munmap() mm->brk:
#include <sys/syscall.h>
#include <sys/mman.h>
#include <assert.h>
#include <unistd.h>
static void *brk_now(void)
{
return (void *)syscall(SYS_brk, 0);
}
static void brk_set(void *b)
{
assert(syscall(SYS_brk, b) != -1);
}
int main(int argc, char *argv[])
{
void *b = brk_now();
brk_set(b + 4096);
assert(munmap(b - 4096, 4096 * 2) == 0);
brk_set(b);
return 0;
}
Compile that with musl, since glibc actually uses brk(), and then
execute it, and it'll hit this splat:
kernel BUG at mm/mmap.c:229!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 1379 Comm: a.out Tainted: G S U 6.1.0-rc7+ #419
RIP: 0010:__do_sys_brk+0x2fc/0x340
Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20>
RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000
RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08
RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000
R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000
R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000
FS: 0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x400678
Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b>
RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c
RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678
RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000
RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000
R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978
R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Instead, just return the old brk value if the original mapping has been
removed.
[akpm@linux-foundation.org: fix changelog, per Liam]
Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com
Fixes:
|
||
|
|
e26355e215 |
mm: export buffer_migrate_folio_norefs()
Ext4 needs this function to allow safe migration for journalled data pages. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221207112722.22220-11-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
||
|
|
fbf8321238 |
memcg: Fix possible use-after-free in memcg_write_event_control()
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to |
||
|
|
0ba09b1733 |
Revert "mm: align larger anonymous mappings on THP boundaries"
This reverts commit
|
||
|
|
bdaa78c6aa |
15 hotfixes. 11 marked cc:stable. Only three or four of the latter
address post-6.0 issues, which is hopefully a sign that things are converging. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4pQpQAKCRDdBJ7gKXxA jquxAP9Lqif7CGDgdq8uWY2hHS/Ujc3k7Ohgyzs37olnCuU8KwEA6/J7SpjsBgtY OfzvnwxpCTh8Kfzu/oNckIHo/EEiIA8= =o6qT -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes, 11 marked cc:stable. Only three or four of the latter address post-6.0 issues, which is hopefully a sign that things are converging" * tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths mm/khugepaged: fix GUP-fast interaction by sending IPI mm/khugepaged: take the right locks for page table retraction mm: migrate: fix THP's mapcount on isolation mm: introduce arch_has_hw_nonleaf_pmd_young() mm: add dummy pmd_young() for architectures not having it mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing madvise: use zap_page_range_single for madvise dontneed mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE |
||
|
|
dc19745ad0 |
Merge branch 'slub-tiny-v1r6' into slab/for-next
Merge my series [1] to deprecate the SLOB allocator. - Renames CONFIG_SLOB to CONFIG_SLOB_DEPRECATED with deprecation notice. - The recommended replacement is CONFIG_SLUB, optionally with the new CONFIG_SLUB_TINY tweaks for systems with 16MB or less RAM. - Use cases that stopped working with CONFIG_SLUB_TINY instead of SLOB should be reported to linux-mm@kvack.org and slab maintainers, otherwise SLOB will be removed in few cycles. [1] https://lore.kernel.org/all/20221121171202.22080-1-vbabka@suse.cz/ |
||
|
|
6176665213 |
Merge branch 'slab/for-6.2/kmalloc_redzone' into slab/for-next
Add a new slub_kunit test for the extended kmalloc redzone check, by Feng Tang. Also prevent unwanted kfence interaction with all slub kunit tests. |
||
|
|
149b6fa228 |
mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED
As explained in [1], we would like to remove SLOB if possible. - There are no known users that need its somewhat lower memory footprint so much that they cannot handle SLUB (after some modifications by the previous patches) instead. - It is an extra maintenance burden, and a number of features are incompatible with it. - It blocks the API improvement of allowing kfree() on objects allocated via kmem_cache_alloc(). As the first step, rename the CONFIG_SLOB option in the slab allocator configuration choice to CONFIG_SLOB_DEPRECATED. Add CONFIG_SLOB depending on CONFIG_SLOB_DEPRECATED as an internal option to avoid code churn. This will cause existing .config files and defconfigs with CONFIG_SLOB=y to silently switch to the default (and recommended replacement) SLUB, while still allowing SLOB to be configured by anyone that notices and needs it. But those should contact the slab maintainers and linux-mm@kvack.org as explained in the updated help. With no valid objections, the plan is to update the existing defconfigs to SLUB and remove SLOB in a few cycles. To make SLUB more suitable replacement for SLOB, a CONFIG_SLUB_TINY option was introduced to limit SLUB's memory overhead. There is a number of defconfigs specifying CONFIG_SLOB=y. As part of this patch, update them to select CONFIG_SLUB and CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/b35c3f82-f67b-2103-7d82-7a7ba7521439@suse.cz/ Cc: Russell King <linux@armlinux.org.uk> Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Janusz Krzysztofik <jmkrzyszt@gmail.com> Cc: Tony Lindgren <tony@atomide.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Conor Dooley <conor@kernel.org> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Aaro Koskinen <aaro.koskinen@iki.fi> # OMAP1 Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> # riscv k210 Acked-by: Arnd Bergmann <arnd@arndb.de> # arm Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
be784ba861 |
mm, slub: don't aggressively inline with CONFIG_SLUB_TINY
SLUB fastpaths use __always_inline to avoid function calls. With CONFIG_SLUB_TINY we would rather save the memory. Add a __fastpath_inline macro that's __always_inline normally but empty with CONFIG_SLUB_TINY. bloat-o-meter results on x86_64 mm/slub.o: add/remove: 3/1 grow/shrink: 1/8 up/down: 865/-1784 (-919) Function old new delta kmem_cache_free 20 281 +261 slab_alloc_node.isra - 245 +245 slab_free.constprop.isra - 231 +231 __kmem_cache_alloc_lru.isra - 128 +128 __kmem_cache_release 88 83 -5 __kmem_cache_create 1446 1436 -10 __kmem_cache_free 271 142 -129 kmem_cache_alloc_node 330 127 -203 kmem_cache_free_bulk.part 826 613 -213 __kmem_cache_alloc_node 230 10 -220 kmem_cache_alloc_lru 325 12 -313 kmem_cache_alloc 325 10 -315 kmem_cache_free.part 376 - -376 Total: Before=26103, After=25184, chg -3.52% Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> |
||
|
|
0af8489b02 |
mm, slub: remove percpu slabs with CONFIG_SLUB_TINY
SLUB gets most of its scalability by percpu slabs. However for
CONFIG_SLUB_TINY the goal is minimal memory overhead, not scalability.
Thus, #ifdef out the whole kmem_cache_cpu percpu structure and
associated code. Additionally to the slab page savings, this reduces
percpu allocator usage, and code size.
This change builds on recent commit
|
||
|
|
56d5a2b9ba |
mm, slub: split out allocations from pre/post hooks
In the following patch we want to introduce CONFIG_SLUB_TINY allocation paths that don't use the percpu slab. To prepare, refactor the allocation functions: Split out __slab_alloc_node() from slab_alloc_node() where the former does the actual allocation and the latter calls the pre/post hooks. Analogically, split out __kmem_cache_alloc_bulk() from kmem_cache_alloc_bulk(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> |
||
|
|
6cd6d33ca4 |
mm/slub, kunit: Add a test case for kmalloc redzone check
kmalloc redzone check for slub has been merged, and it's better to add
a kunit case for it, which is inspired by a real-world case as described
in commit
|
||
|
|
f268f6cf87 |
mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit |
||
|
|
2ba99c5e08 |
mm/khugepaged: fix GUP-fast interaction by sending IPI
Since commit |
||
|
|
8d3c106e19 |
mm/khugepaged: take the right locks for page table retraction
pagetable walks on address ranges mapped by VMAs can be done under the
mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the
VMA's address_space. Only one of these needs to be held, and it does not
need to be held in exclusive mode.
Under those circumstances, the rules for concurrent access to page table
entries are:
- Terminal page table entries (entries that don't point to another page
table) can be arbitrarily changed under the page table lock, with the
exception that they always need to be consistent for
hardware page table walks and lockless_pages_from_mm().
This includes that they can be changed into non-terminal entries.
- Non-terminal page table entries (which point to another page table)
can not be modified; readers are allowed to READ_ONCE() an entry, verify
that it is non-terminal, and then assume that its value will stay as-is.
Retracting a page table involves modifying a non-terminal entry, so
page-table-level locks are insufficient to protect against concurrent page
table traversal; it requires taking all the higher-level locks under which
it is possible to start a page walk in the relevant range in exclusive
mode.
The collapse_huge_page() path for anonymous THP already follows this rule,
but the shmem/file THP path was getting it wrong, making it possible for
concurrent rmap-based operations to cause corruption.
Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com
Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com
Fixes:
|
||
|
|
829ae0f81c |
mm: migrate: fix THP's mapcount on isolation
The issue is reported when removing memory through virtio_mem device. The
transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page can't
be migrated and the corresponding memory block can't be put into offline
state.
Fix it by replacing page_mapcount() with total_mapcount(). With this, the
transparent huge page can be isolated and migrated, and the memory block
can be put into offline state. Besides, The page's refcount is increased
a bit earlier to avoid the page is released when the check is executed.
Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com
Fixes:
|
||
|
|
4aaf269c76 |
mm: introduce arch_has_hw_nonleaf_pmd_young()
When running as a Xen PV guests commit |
||
|
|
95bc35f9be |
mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes()
Commit |
||
|
|
04ada095dc |
hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
tables associated with the address range. For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final. However,
__unmap_hugepage_range_final assumes the passed vma is about to be removed
and deletes the vma_lock to prevent pmd sharing as the vma is on the way
out. In the case of madvise(MADV_DONTNEED) the vma remains, but the
missing vma_lock prevents pmd sharing and could potentially lead to issues
with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing. Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
not set in new pages added to the page table. This resulted in pages that
appeared anonymous in a VM_SHARED vma and triggered the BUG.
Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
call from unmap_vmas(). This is used to indicate the 'final' unmapping of
a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and
the vm_lock is not deleted.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
Fixes:
|
||
|
|
21b85b0952 |
madvise: use zap_page_range_single for madvise dontneed
This series addresses the issue first reported in [1], and fully described
in patch 2. Patches 1 and 2 address the user visible issue and are tagged
for stable backports.
While exploring solutions to this issue, related problems with mmu
notification calls were discovered. This is addressed in the patch
"hugetlb: remove duplicate mmu notifications:". Since there are no user
visible effects, this third is not tagged for stable backports.
Previous discussions suggested further cleanup by removing the
routine zap_page_range. This is possible because zap_page_range_single
is now exported, and all callers of zap_page_range pass ranges entirely
within a single vma. This work will be done in a later patch so as not
to distract from this bug fix.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
This patch (of 2):
Expose the routine zap_page_range_single to zap a range within a single
vma. The madvise routine madvise_dontneed_single_vma can use this routine
as it explicitly operates on a single vma. Also, update the mmu
notification range in zap_page_range_single to take hugetlb pmd sharing
into account. This is required as MADV_DONTNEED supports hugetlb vmas.
Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
Fixes:
|
||
|
|
aebb02ce8b |
mm/page_reporting: Add checks for page_reporting_order param
Current code allows the page_reporting_order parameter to be changed via sysfs to any integer value. The new value is used immediately in page reporting code with no validation, which could cause incorrect behavior. Fix this by adding validation of the new value. Export this parameter for use in the driver that is calling the page_reporting_register(). This is needed by drivers like hv_balloon to know the order of the pages reported. Traditionally the values provided in the kernel boot line or subsequently changed via sysfs take priority therefore, if page_reporting_order parameter's value is set, it takes precedence over the value passed while registering with the driver. Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/1664517699-1085-2-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> |
||
|
|
fa9b88e459 |
mm, slub: refactor free debug processing
Since commit
|
||
|
|
2f7c1c1396 |
mm, slub: don't create kmalloc-rcl caches with CONFIG_SLUB_TINY
Distinguishing kmalloc(__GFP_RECLAIMABLE) can help against fragmentation by grouping pages by mobility, but on tiny systems the extra memory overhead of separate set of kmalloc-rcl caches will probably be worse, and mobility grouping likely disabled anyway. Thus with CONFIG_SLUB_TINY, don't create kmalloc-rcl caches and use the regular ones. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
90ce872c22 |
mm, slub: lower the default slub_max_order with CONFIG_SLUB_TINY
With CONFIG_SLUB_TINY we want to minimize memory overhead. By lowering the default slub_max_order we can make slab allocations use smaller pages. However depending on object sizes, order-0 might not be the best due to increased fragmentation. When testing on a 8MB RAM k210 system by Damien Le Moal [1], slub_max_order=1 had the best results, so use that as the default for CONFIG_SLUB_TINY. [1] https://lore.kernel.org/all/6a1883c4-4c3f-545a-90e8-2cd805bcf4ae@opensource.wdc.com/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
5a8a3c1f73 |
mm, slub: retain no free slabs on partial list with CONFIG_SLUB_TINY
SLUB will leave a number of slabs on the partial list even if they are empty, to avoid some slab freeing and reallocation. The goal of CONFIG_SLUB_TINY is to minimize memory overhead, so set the limits to 0 for immediate slab page freeing. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
b1a413a39a |
mm, slub: disable SYSFS support with CONFIG_SLUB_TINY
Currently SLUB enables its sysfs support depending unconditionally on the general CONFIG_SYSFS setting. To reduce the configuration combination space, make CONFIG_SLUB_TINY disable SLUB's sysfs support by reusing the existing SLAB_SUPPORTS_SYSFS define. It is unlikely that real tiny systems would combine CONFIG_SLUB_TINY with CONFIG_SYSFS, but a randconfig might. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
e240e53ae0 |
mm, slub: add CONFIG_SLUB_TINY
For tiny systems that have used SLOB until now, SLUB might be impractical due to its higher memory usage. To help with that, introduce an option CONFIG_SLUB_TINY that modifies SLUB to use less memory. This is done by sacrificing scalability, security and debugging features, therefore not recommended for any system with more than 16MB RAM. This commit introduces the option and uses it to set other related options in a way that reduces memory usage. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
346907ceb9 |
mm, slab: ignore hardened usercopy parameters when disabled
With CONFIG_HARDENED_USERCOPY not enabled, there are no __check_heap_object() checks happening that would use the struct kmem_cache useroffset and usersize fields. Yet the fields are still initialized, preventing merging of otherwise compatible caches. Also the fields contribute to struct kmem_cache size unnecessarily when unused. Thus #ifdef them out completely when CONFIG_HARDENED_USERCOPY is disabled. In kmem_dump_obj() print object_size instead of usersize, as that's actually the intention. In a quick virtme boot test, this has reduced the number of caches in /proc/slabinfo from 131 to 111. Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> |
||
|
|
0b1dcc2cf5 |
24 hotfixes. 8 marked cc:stable and 16 for post-6.0 issues.
There have been a lot of hotfixes this cycle, and this is quite a large
batch given how far we are into the -rc cycle. Presumably a reflection of
the unusually large amount of MM material which went into 6.1-rc1.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY4Bd6gAKCRDdBJ7gKXxA
jvX6AQCsG1ld24kMpdD+70XXUyC29g/6/jribgtZApHyDYjxSwD/WmLNpPlUPRax
WB071Y5w65vjSTUKvwU0OLGbHwyxgAw=
=swD5
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0
issues.
There have been a lot of hotfixes this cycle, and this is quite a
large batch given how far we are into the -rc cycle. Presumably a
reflection of the unusually large amount of MM material which went
into 6.1-rc1"
* tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits)
test_kprobes: fix implicit declaration error of test_kprobes
nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
swapfile: fix soft lockup in scan_swap_map_slots
hugetlb: fix __prep_compound_gigantic_page page flag setting
kfence: fix stack trace pruning
proc/meminfo: fix spacing in SecPageTables
mm: multi-gen LRU: retry folios written back while isolated
mailmap: update email address for Satya Priya
mm/migrate_device: return number of migrating pages in args->cpages
kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible
MAINTAINERS: update Alex Hung's email address
mailmap: update Alex Hung's email address
mm: mmap: fix documentation for vma_mas_szero
mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed
mm/memory: return vm_fault_t result from migrate_to_ram() callback
mm: correctly charge compressed memory to its memcg
ipc/shm: call underlying open/close vm_ops
gcov: clang: fix the buffer overflow issue
...
|
||
|
|
de4eda9de2 |
use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
||
|
|
81a70c21d9 |
mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1
balance_dirty_pages doesn't do the required dirty throttling on cgroupv1.
See commit
|
||
|
|
ea4452de2a |
mm: fix unexpected changes to {failslab|fail_page_alloc}.attr
When we specify __GFP_NOWARN, we only expect that no warnings will be
issued for current caller. But in the __should_failslab() and
__should_fail_alloc_page(), the local GFP flags alter the global
{failslab|fail_page_alloc}.attr, which is persistent and shared by all
tasks. This is not what we expected, let's fix it.
[akpm@linux-foundation.org: unexport should_fail_ex()]
Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com
Fixes:
|
||
|
|
de1ccfb648 |
swapfile: fix soft lockup in scan_swap_map_slots
A softlockup occurs in scan free swap slot under huge memory pressure.
The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the
disksize of each zram device is 50MB.
LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but
the real loop number would more than LATENCY_LIMIT because of "goto checks
and goto scan" repeatly without decreasing latency limit.
In order to fix it, decrease latency_ration in advance.
There is also a suspicious place that will cause softlockups in
get_swap_pages(). In this function, the "goto start_over" may result in
continuous scanning of the swap partition. If there is no cond_sched in
scan_swap_map_slots(), it would cause a softlockup (I am not sure about
this).
WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466]
CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G
dump backtrace+0x0/0x1le4
show stack+0x20/@x2c
dump_stack+0xd8/0x140
watchdog print_info+0x48/0x54
watchdog_process_before_softlockup+0x98/0xa0
watchdog_timer_fn+0xlac/0x2d0
hrtimer_rum_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handLe_percpu_devid_irq+0x90/0x1f4
handle domain irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
e11 ira+0xhB/Bx140
scan_swap_map_slots+0x678/0x890
get_swap_pages+0x29c/0x440
get_swap_page+0x120/0x2e0
add_to_swap+UX2U/0XyC
shrink_page_list+0x5d0/0x152c
shrink_inactive_list+0xl6c/Bx500
shrink_lruvec+0x270/0x304
WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915]
watchdog_timer_fn+0x1ac/0x2d0
__run_hrtimer+0x98/0x2a0
__hrtimer_run_queues+0xb0/0x130
hrtimer_interrupt+0x13c/0x3c0
arch_timer_handler_virt+0x3c/0x50
handle_percpu_devid_irq+0x90/0x1f4
__handle_domain_irq+0x84/0x100
gic_handle_irq+0x88/0x2b0
el1_irq+0xb8/0x140
get_swap_pages+0x1e8/0x440
get_swap_page+0x1c8/0x2e0
add_to_swap+0x20/0x9c
shrink_page_list+0x5d0/0x152c
reclaim_pages+0x160/0x310
madvise_cold_or_pageout_pte_range+0x7bc/0xe3c
walk_pmd_range.isra.0+0xac/0x22c
walk_pud_range+0xfc/0x1c0
walk_pgd_range+0x158/0x1b0
__walk_page_range+0x64/0x100
walk_page_range+0x104/0x150
Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com
Fixes:
|
||
|
|
7fb0728a9b |
hugetlb: fix __prep_compound_gigantic_page page flag setting
Commit |