mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-12-24 20:18:56 +00:00
a5492fe27f
125 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6e17c6de3d |
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing. - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability. - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning. - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface. - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree. - Johannes Weiner has done some cleanup work on the compaction code. - David Hildenbrand has contributed additional selftests for get_user_pages(). - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code. - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code. - Huang Ying has done some maintenance on the swap code's usage of device refcounting. - Christoph Hellwig has some cleanups for the filemap/directio code. - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses. - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings. - John Hubbard has a series of fixes to the MM selftesting code. - ZhangPeng continues the folio conversion campaign. - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock. - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8. - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management. - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code. - Vishal Moola also has done some folio conversion work. - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh J0qXCzqaaN8+BuEyLGDVPaXur9KirwY= =B7yQ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ... |
||
|
|
6e2332e0ab |
cgroup: Changes for v6.5
* Whenever cpuset needs to rebuild sched_domain, it walked all tasks looking for DEADLINE tasks as they need to be accounted on the new domain. Walking all tasks can be expensive and there may not be any DEADLINE tasks at all. Task iteration is now omitted if there are no DEADLINE tasks. * Fixes DEADLINE bandwidth misaccounting after task migration failures. * When no controller is enabled, -Wstringop-overflow warning is triggered. The fix patch added an early exit which is too eager and got reverted for now. Will fix later. * Everything else are minor cleanups. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZJoRHw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGZatAQCKTv8pb5HEgochph4n26laSdVZs6ce3Y+s7V1T rum+3QD/TyJFmCkZSMscolZGFuafpg41sjPbmc4SexeuAMYCMgY= =nioD -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Whenever cpuset needs to rebuild sched_domain, it walked all tasks looking for DEADLINE tasks as they need to be accounted on the new domain. Walking all tasks can be expensive and there may not be any DEADLINE tasks at all. Task iteration is now omitted if there are no DEADLINE tasks - Fixes DEADLINE bandwidth misaccounting after task migration failures - When no controller is enabled, -Wstringop-overflow warning is triggered. The fix patch added an early exit which is too eager and got reverted for now. Will fix later - Everything else is minor cleanups * tag 'cgroup-for-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Revert "cgroup: Avoid -Wstringop-overflow warnings" cgroup/misc: Expose misc.current on cgroup v2 root cgroup: Avoid -Wstringop-overflow warnings cgroup: remove obsolete comment on cgroup_on_dfl() cgroup: remove unused task_cgroup_path() cgroup/cpuset: remove unneeded header files cgroup: make cgroup_is_threaded() and cgroup_is_thread_root() static rdmacg: fix kernel-doc warnings in rdmacg cgroup: Replace the css_set call with cgroup_get cgroup: remove unused macro for_each_e_css() cgroup: Update out-of-date comment in cgroup_migrate() cgroup: Replace all non-returning strlcpy with strscpy cgroup/cpuset: remove unneeded header files cgroup/cpuset: Free DL BW in case can_attach() fails sched/deadline: Create DL BW alloc, free & check overflow interface cgroup/cpuset: Iterate only if DEADLINE tasks are present sched/cpuset: Keep track of SCHED_DEADLINE task in cpusets sched/cpuset: Bring back cpuset_mutex cgroup/cpuset: Rename functions dealing with DEADLINE accounting |
||
|
|
a0433f8cae |
for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
1ADPteUOTA==
=zP4Y
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
|
||
|
|
e973dfe929 |
cgroup/misc: Expose misc.current on cgroup v2 root
Hello, This patch is to expose misc.current on cgroup v2 root for tracking how much of the resource has been consumed in total on the system. Most of the cloud infrastucture use cgroup to fetch the host information for scheduling purpose. Currently, the misc controller can be used by Intel TDX HKIDs and AMD SEV ASIDs, which are both used for creating encrypted VMs. Intel TDX and AMD SEV are mostly be used by the cloud providers for providing confidential VMs. In actual use of a server, these confidential VMs may be launched in different ways. For the cloud solution, there are kubvirt and coco (tracked by kubepods.slice); on host, they can be booted directly through qemu by end user (tracked by user.slice), etc. In this complex environment, when wanting to know how many resource is used in total it has to iterate through all existing slices to get the value of each misc.current and add them up to calculate the total number of consumed keys. So exposing misc.current to root cgroup tends to give much easier when calculates how much resource has been used in total, which helps to schedule and count resources for the cloud infrastucture. Signed-off-by: LeiZhou-97 <lei.zhou@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
e0e0b4126c |
mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg
This patch is similar to commit
|
||
|
|
ddf63516d8 |
blk-ioprio: Introduce promote-to-rt policy
Since commit
|
||
|
|
5647e53f78 |
cgroup: Documentation: Clarify usage of memory limits
The existing documentation refers to memory.high as the "main mechanism to control memory usage." This seems incorrect to me - memory.high can result in reclaim pressure which simply leads to stalls unless some external component observes and actions on it (e.g. systemd-oomd can be used for this purpose). While this is feasible, users are unaware of this interaction and are led to believe that memory.high alone is an effective mechanism for limiting memory. The documentation should recommend the use of memory.max as the effective way to enforce memory limits - it triggers reclaim and results in OOM kills by itself. Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
70756b49be |
It has been a moderately calm cycle for documentation; the significant
changes include: - Some significant additions to the memory-management documentation - Some improvements to navigation in the HTML-rendered docs - More Spanish and Chinese translations ...and the usual set of typo fixes and such. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmPzkQUPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YC0QH/09u10xV3N+RuveNE/tArVxKcQi7JZd/xugQ toSXygh64WY10lzwi7Ms1bHZzpPYB0fOrqTGNqNQuhrVTjQzaZB0BBJqm8lwt2w/ S/Z5wj+IicJTmQ7+0C2Hc/dcK5SCPfY3CgwqOUVdr3dEm1oU+4QaBy31fuIJJ0Hx NdbXBco8BZqJX9P67jwp9vbrFrSGBjPI0U4HNHVjrWlcBy8JT0aAnf0fyWFy3orA T86EzmEw8drA1mXsHa5pmVwuHDx2X+D+eRurG9llCBrlIG9EDSmnalY4BeGqR4LS oDrEH6M91I5+9iWoJ0rBheD8rPclXO2HpjXLApXzTjrORgEYZsM= =MCdX -----END PGP SIGNATURE----- Merge tag 'docs-6.3' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "It has been a moderately calm cycle for documentation; the significant changes include: - Some significant additions to the memory-management documentation - Some improvements to navigation in the HTML-rendered docs - More Spanish and Chinese translations ... and the usual set of typo fixes and such" * tag 'docs-6.3' of git://git.lwn.net/linux: (68 commits) Documentation/watchdog/hpwdt: Fix Format Documentation/watchdog/hpwdt: Fix Reference Documentation: core-api: padata: correct spelling docs/mm: Physical Memory: correct spelling in reference to CONFIG_PAGE_EXTENSION docs: Use HTML comments for the kernel-toc SPDX line docs: Add more information to the HTML sidebar Documentation: KVM: Update AMD memory encryption link printk: Document that CONFIG_BOOT_PRINTK_DELAY required for boot_delay= Documentation: userspace-api: correct spelling Documentation: sparc: correct spelling Documentation: driver-api: correct spelling Documentation: admin-guide: correct spelling docs: add workload-tracing document to admin-guide docs/admin-guide/mm: remove useless markup docs/mm: remove useless markup docs/mm: Physical Memory: remove useless markup docs/sp_SP: Add process magic-number translation docs: ftrace: always use canonical ftrace path Doc/damon: fix the data path error dma-buf: Add "dma-buf" to title of documentation ... |
||
|
|
1f2d9ffc7a |
Scheduler updates in this cycle are:
- Improve the scalability of the CFS bandwidth unthrottling logic
with large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with
the generic scheduler code. Add __cpuidle methods as noinstr to
objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
to query previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period,
to improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- ... Misc other cleanups, fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
skkQXoONNaM=
=l1nN
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Improve the scalability of the CFS bandwidth unthrottling logic with
large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with the
generic scheduler code. Add __cpuidle methods as noinstr to objtool's
noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period, to
improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- Misc other cleanups, fixes
* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
sched/rt: pick_next_rt_entity(): check list_entry
sched/deadline: Add more reschedule cases to prio_changed_dl()
sched/fair: sanitize vruntime of entity being placed
sched/fair: Remove capacity inversion detection
sched/fair: unlink misfit task from cpu overutilized
objtool: mem*() are not uaccess safe
cpuidle: Fix poll_idle() noinstr annotation
sched/clock: Make local_clock() noinstr
sched/clock/x86: Mark sched_clock() noinstr
x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
x86/atomics: Always inline arch_atomic64*()
cpuidle: tracing, preempt: Squash _rcuidle tracing
cpuidle: tracing: Warn about !rcu_is_watching()
cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
cpuidle: drivers: firmware: psci: Dont instrument suspend code
KVM: selftests: Fix build of rseq test
exit: Detect and fix irq disabled state in oops
cpuidle, arm64: Fix the ARM64 cpuidle logic
cpuidle: mvebu: Fix duplicate flags assignment
sched/fair: Limit sched slice duration
...
|
||
|
|
dbeb56fe80 |
Documentation: admin-guide: correct spelling
Correct spelling problems for Documentation/admin-guide/ as reported by codespell. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-media@vger.kernel.org Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230129231053.20863-2-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net> |
||
|
|
55ab834a86 |
Revert "mm: add nodes= arg to memory.reclaim"
This reverts commit |
||
|
|
acbee592f1 |
sched/documentation: Document the util clamp feature
Add a document explaining the util clamp feature: what it is and how to use it. The new document hopefully covers everything one needs to know about uclamp. Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20221216235716.201923-1-qyousef@layalina.io Cc: Jonathan Corbet <corbet@lwn.net> |
||
|
|
12a5d39552 |
mm: add nodes= arg to memory.reclaim
The nodes= arg instructs the kernel to only scan the given nodes for
proactive reclaim. For example use cases, consider a 2 tier memory
system:
nodes 0,1 -> top tier
nodes 2,3 -> second tier
$ echo "1m nodes=0" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory from node 0.
Since node 0 is a top tier node, demotion will be attempted first. This
is useful to direct proactive reclaim to specific nodes that are under
pressure.
$ echo "1m nodes=2,3" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory in the second
tier, since this tier of memory has no demotion targets the memory will be
reclaimed.
$ echo "1m nodes=0,1" > memory.reclaim
Instructs the kernel to reclaim memory from the top tier nodes, which can
be desirable according to the userspace policy if there is pressure on the
top tiers. Since these nodes have demotion targets, the kernel will
attempt demotion first.
Since commit
|
||
|
|
57e9cc50f4 |
mm: vmscan: split khugepaged stats from direct reclaim stats
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior. Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623 Link: https://lkml.kernel.org/r/20221026180133.377671-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Eric Bergen <ebergen@meta.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
bd9a3dba18 |
PSI updates for v6.1:
- Various performance optimizations, resulting in a 4%-9% speedup
in the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmNJKPsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iPmg//aovCitAQX2lLoHJDIgdQibU40oaEpKTX
wM549EGz3Dr6qmwF8+qT1U2Ge6af/hHQc5G/ZqDpKbuTjUIc3RmBkqX80dNKFLuH
uyi9UtfsSriw+ks8fWuDdjr+S4oppwW9ZoIXvK8v4bisd3F31DNGvKPTayNxt73m
lExfzJiD1oJixDxGX8MGO9QpcoywmjWjzjrB2P+J8hnTpArouHx/HOKdQOpG6wXq
ZRr9kZvju6ucDpXCTa1HJrfVRxNAh35tx/b4cDtXbBFifVAeKaPOrHapMTVsqfel
Z7T+2DymhidNYK0hrRJoGUwa/vkz+2Sm1ZLG9LlgUCXVco/9S1zw1ZuQakVvzPen
wriuxRaAkR+szCP0L8js5+/DAkGa43MjKsvQHmDVnetQtlsAD4eYnn+alQ837SXv
MP3jwFqF+e4mcWdoQcfh0OWUgGec5XZzdgRYrFkBKyTWGLB2iPivcAMNf0X/h82Q
xxv4DQJIIJ017GOQ/ho2saq+GbtFCvX8YnGYas9T47Bjjluhjo7jgTVtPTo+mhtN
RfwMdG718Ap/gvnAX7wMe/t+L/4AP8AIgDRi5L35dTRqETwOjH+LAvOYjleQFYgu
kMVtLMyzU+TGwHscuzPFRh7TnvSJ4sD48Ll1BPnyZsh3SS9u0gAs1bml7Cu7JbmW
SIZD/S/hzdI=
=91tB
-----END PGP SIGNATURE-----
Merge tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull PSI updates from Ingo Molnar:
- Various performance optimizations, resulting in a 4%-9% speedup in
the mmtests/config-scheduler-perfpipe micro-benchmark.
- New interface to turn PSI on/off on a per cgroup level.
* tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
sched/psi: Cache parent psi_group to speed up group iteration
sched/psi: Consolidate cgroup_psi()
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
sched/psi: Remove NR_ONCPU task accounting
sched/psi: Optimize task switch inside shared cgroups again
sched/psi: Move private helpers to sched/stats.h
sched/psi: Save percpu memory when !psi_cgroups_enabled
sched/psi: Don't create cgroup PSI files when psi_disabled
sched/psi: Fix periodic aggregation shut off
|
||
|
|
adf4bfc4a9 |
cgroup changes for v6.1-rc1.
* cpuset now support isolated cpus.partition type, which will enable dynamic
CPU isolation.
* pids.peak added to remember the max number of pids used.
* Holes in cgroup namespace plugged.
* Internal cleanups.
Note that for-6.1-fixes was pulled into for-6.1 twice. Both were for
follow-up cleanups and each merge commit has details.
Also,
|
||
|
|
34f26a1561 |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit
|
||
|
|
52b1364ba0 |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com |
||
|
|
8cbfdc24fc |
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as other changes made in other cpuset patches. Signed-off-by: Waiman Long <longman@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
ebc97a52b5 |
mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page
tables, stack, vmalloc, etc) on multiple levels (global, per-node,
per-memcg, etc). These stats give insights to users to how much memory
is used by the kernel and for what purposes.
Currently, memory used by KVM mmu is not accounted in any of those
kernel memory stats. This patch series accounts the memory pages
used by KVM for page tables in those stats in a new
NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account
for other types of secondary pages tables (e.g. iommu page tables).
KVM has a decent number of large allocations that aren't for page
tables, but for most of them, the number/size of those allocations
scales linearly with either the number of vCPUs or the amount of memory
assigned to the VM. KVM's secondary page table allocations do not scale
linearly, especially when nested virtualization is in use.
From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's
per-VM pages_{4k,2m,1g} stats unless the guest is doing something
bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is
forced to allocate a large number of page tables even though the guest
isn't accessing that much memory). However, someone would need to either
understand how KVM works to make that connection, or know (or be told) to
go look at KVM's stats if they're running VMs to better decipher the stats.
Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE
is informative. For example, when backing a VM with THP vs. HugeTLB,
NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order
of magnitude higher with THP. So having this stat will at the very least
prove to be useful for understanding tradeoffs between VM backing types,
and likely even steer folks towards potential optimizations.
The original discussion with more details about the rationale:
https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org
This stat will be used by subsequent patches to count KVM mmu
memory usage.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
||
|
|
6614a3c316 |
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
||
|
|
73b73bac90 |
mm: vmpressure: don't count proactive reclaim in vmpressure
memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Reclaim operations invoke vmpressure, which is used: (a) To notify userspace of reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being under memory pressure for networking (see mem_cgroup_under_socket_pressure()). For (a), vmpressure notifications in v1 are not affected by this change since memory.reclaim is a v2 feature. For (b), the effects of the vmpressure signal (according to Shakeel [1]) are as follows: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path. Since proactive reclaim is invoked directly by userspace, not by memory pressure, it makes sense not to throttle networking. Hence, this change makes sure that proactive reclaim caused by memory.reclaim does not trigger vmpressure. [1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/ [yosryahmed@google.com: update documentation] Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c808f46323 |
cgroup: remove "no" prefixed mount options
|
||
|
|
6a010a49b6 |
cgroup: Make !percpu threadgroup_rwsem operations optional
|
||
|
|
30312730bd |
cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed variants so that they can be turned off too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> |
||
|
|
673520f8da |
mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of cgroup v2
There are already statistics of {pgscan,pgsteal}_kswapd and
{pgscan,pgsteal}_direct of memcg event here, but now only the sum of the
two is displayed in memory.stat of cgroup v2.
In order to obtain more accurate information during monitoring and
debugging, and to align with the display in /proc/vmstat, it better to
display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately.
Also, for forward compatibility, we still display pgscan and pgsteal items
so that it won't break existing applications.
[zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)]
Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com
[zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes]
Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
98931dd95f |
Yang Shi has improved the behaviour of khugepaged collapsing of readonly
file-backed transparent hugepages.
Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
enablement of the recent huge page vmemmap optimization feature.
Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
David Vernet has done some fixup work on the memcg selftests.
Peter Xu has taught userfaultfd to handle write protection faults against
shmem- and hugetlbfs-backed files.
More DAMON development from SeongJae Park - adding online tuning of the
feature and support for monitoring of fixed virtual address ranges. Also
easier discovery of which monitoring operations are available.
Nadav Amit has done some optimization of TLB flushing during mprotect().
Neil Brown continues to labor away at improving our swap-over-NFS support.
David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
Peng Liu fixed some errors in the core hugetlb code.
Joao Martins has reduced the amount of memory consumed by device-dax's
compound devmaps.
Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
Roman Gushchin has done more work on the memcg selftests.
And, of course, many smaller fixes and cleanups. Notably, the customary
million cleanup serieses from Miaohe Lin.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
=nFe6
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Almost all of MM here. A few things are still getting finished off,
reviewed, etc.
- Yang Shi has improved the behaviour of khugepaged collapsing of
readonly file-backed transparent hugepages.
- Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
- Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
runtime enablement of the recent huge page vmemmap optimization
feature.
- Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
- Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
- Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
- David Vernet has done some fixup work on the memcg selftests.
- Peter Xu has taught userfaultfd to handle write protection faults
against shmem- and hugetlbfs-backed files.
- More DAMON development from SeongJae Park - adding online tuning of
the feature and support for monitoring of fixed virtual address
ranges. Also easier discovery of which monitoring operations are
available.
- Nadav Amit has done some optimization of TLB flushing during
mprotect().
- Neil Brown continues to labor away at improving our swap-over-NFS
support.
- David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
- Peng Liu fixed some errors in the core hugetlb code.
- Joao Martins has reduced the amount of memory consumed by
device-dax's compound devmaps.
- Some cleanups of the arch-specific pagemap code from Anshuman
Khandual.
- Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
- Roman Gushchin has done more work on the memcg selftests.
... and, of course, many smaller fixes and cleanups. Notably, the
customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
mm: kfence: use PAGE_ALIGNED helper
selftests: vm: add the "settings" file with timeout variable
selftests: vm: add "test_hmm.sh" to TEST_FILES
selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
selftests: vm: add migration to the .gitignore
selftests/vm/pkeys: fix typo in comment
ksm: fix typo in comment
selftests: vm: add process_mrelease tests
Revert "mm/vmscan: never demote for memcg reclaim"
mm/kfence: print disabling or re-enabling message
include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
mm: fix a potential infinite loop in start_isolate_page_range()
MAINTAINERS: add Muchun as co-maintainer for HugeTLB
zram: fix Kconfig dependency warning
mm/shmem: fix shmem folio swapoff hang
cgroup: fix an error handling path in alloc_pagecache_max_30M()
mm: damon: use HPAGE_PMD_SIZE
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
nodemask.h: fix compilation error with GCC12
...
|
||
|
|
f4840ccfca |
zswap: memcg accounting
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this. The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set. The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice. [hannes@cmpxchg.org: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org [hannes@cmpxchg.org: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8e20d4b332 |
mm/memcontrol: export memcg->watermark via sysfs for v2 memcg
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data. This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups. Hence, expose memcg->watermark as "memory.peak" for v2 memcg. Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.com Signed-off-by: Ganesan Rajagopal <rganesan@arista.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
94968384dd |
memcg: introduce per-memcg reclaim interface
This patch series adds a memory.reclaim proactive reclaim interface. The rationale behind the interface and how it works are in the first patch. This patch (of 4): Introduce a memcg interface to trigger memory reclaim on a memory cgroup. Use case: Proactive Reclaim --------------------------- A userspace proactive reclaimer can continuously probe the memcg to reclaim a small amount of memory. This gives more accurate and up-to-date workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive. A userspace reclaimer's purpose in this case is not a complete replacement for kswapd or direct reclaim, it is to proactively identify memory savings opportunities and reclaim some amount of cold pages set by the policy to free up the memory for more demanding jobs or scheduling new jobs. A user space proactive reclaimer is used in Google data centers. Additionally, Meta's TMO paper recently referenced a very similar interface used for user space proactive reclaim: https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 Benefits of a user space reclaimer: ----------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to be centralized. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application. Why memory.high is not enough? ------------------------------ - memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim. However there is a big downside in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit. - Userspace proactive reclaimers usually use feedback loops to decide how much memory to proactively reclaim from a workload. The metrics used for this are usually either refaults or PSI, and these metrics will become messy if the application gets throttled by hitting the high limit. - memory.high is a stateful interface, if the userspace proactive reclaimer crashes for any reason while triggering reclaim it can leave the application in a bad state. - If a workload is rapidly expanding, setting memory.high to proactively reclaim memory can result in actually reclaiming more memory than intended. The benefits of such interface and shortcomings of existing interface were further discussed in this RFC thread: https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ Interface: ---------- Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup. The interface is introduced as a nested-keyed file to allow for future optional arguments to be easily added to configure the behavior of reclaim. Possible Extensions: -------------------- - This interface can be extended with an additional parameter or flags to allow specifying one or more types of memory to reclaim from (e.g. file, anon, ..). - The interface can also be extended with a node mask to reclaim from specific nodes. This has use cases for reclaim-based demotion in memory tiering systens. - A similar per-node interface can also be added to support proactive reclaim and reclaim-based demotion in systems without memcg. - Add a timeout parameter to make it easier for user space to call the interface without worrying about being blocked for an undefined amount of time. For now, let's keep things simple by adding the basic functionality. [yosryahmed@google.com: worked on versions v2 onwards, refreshed to current master, updated commit message based on recent discussions and use cases] Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Co-developed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Wei Xu <weixugc@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a477b94d65 |
Documentation: add missing angle bracket in cgroup-v2 doc
Trivial addition of missing closing angle backet. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/r/20220422164526.3464306-1-jsavitz@redhat.com Signed-off-by: Jonathan Corbet <corbet@lwn.net> |
||
|
|
a8c49af3be |
memcg: add per-memcg total kernel memory stat
Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted in any of those stats, a few examples are: - various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets Keeping track of the total kernel memory is essential for the ease of migration from cgroup v1 to v2 as there are large discrepancies between v1's kmem.usage_in_bytes and the sum of the available kernel memory stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel allocations is an impractical maintenance burden as there a lot of those all over the kernel code, with more use cases likely to show up in the future. Therefore, add a "kernel" memcg stat that is analogous to kmem page counter, with added benefits such as using rstat infrastructure which aggregates stats more efficiently. Additionally, this provides a lighter alternative in case the legacy kmem is deprecated in the future [yosryahmed@google.com: v2] Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
f477619990 |
hugetlb: add hugetlb.*.numa_stat file
For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance. Currently this technically can be queried from /proc/self/numa_maps, but there are significant issues with that. Namely: 1. Memory can be mapped or unmapped. 2. numa_maps are per process and need to be aggregated across all processes in the cgroup. For shared memory this is more involved as the userspace needs to make sure it doesn't double count shared mappings. 3. I believe querying numa_maps needs to hold the mmap_lock which adds to the contention on this lock. For these reasons I propose simply adding hugetlb.*.numa_stat file, which shows the numa information of the cgroup similarly to memory.numa_stat. On cgroup-v2: cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 On cgroup-v1: cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 hierarichal_total=2097152 N0=2097152 N1=0 This patch was tested manually by allocating hugetlb memory and querying the hugetlb.*.numa_stat file of the cgroup and its parents. [colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"] Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com [keescook@chromium.org: fix copy/paste array assignment] Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Jue Wang <juew@google.com> Cc: Yang Yao <ygyao@google.com> Cc: Joanna Li <joannali@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
4e5aa1f4c2 |
memcg: add per-memcg vmalloc stat
The kvmalloc* allocation functions can fallback to vmalloc allocations and more often on long running machines. In addition the kernel does have __GFP_ACCOUNT kvmalloc* calls. So, often on long running machines, the memory.stat does not tell the complete picture which type of memory is charged to the memcg. So add a per-memcg vmalloc stat. [shakeelb@google.com: page_memcg() within rcu lock, per Muchun] Link: https://lkml.kernel.org/r/20211222052457.1960701-1-shakeelb@google.com [akpm@linux-foundation.org: remove cast, per Muchun] [shakeelb@google.com: remove area->page[0] checks and move to page by page accounting per Michal] Link: https://lkml.kernel.org/r/20220104222341.3972772-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20211221215336.1922823-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b6bf9abb0a |
mm/memcg: add oom_group_kill memory event
Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered: 1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container. 2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup. This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed. [schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a85373fe44 |
Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - The misc controller now reports allocation rejections through misc.events instead of printking - cgroup_mutex usage is reduced to improve scalability of some operations - vhost helper threads are now assigned to the right cgroup on cgroup2 - Bug fixes * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c cgroup: Fix rootcg cpu.stat guest double counting cgroup: no need for cgroup_mutex for /proc/cgroups cgroup: remove cgroup_mutex from cgroupstats_build cgroup: reduce dependency on cgroup_mutex cgroup: cgroup-v1: do not exclude cgrp_dfl_root cgroup: Make rebind_subsystems() disable v2 controllers all at once docs/cgroup: add entry for misc.events misc_cgroup: remove error log to avoid log flood misc_cgroup: introduce misc.events to count failures |
||
|
|
9a7e0a90a4 |
Scheduler updates:
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
|
||
|
|
d73df887b6 |
sched/fair: Add document for burstable CFS bandwidth
Basic description of usage and effect for CFS Bandwidth Control Burst. Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210830032215.16302-3-changhuaixin@linux.alibaba.com |
||
|
|
4b53bb873f |
docs/cgroup: add entry for misc.events
Added descriptions of misc.events. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
c0002d11d7 |
cgroupv2, docs: fix misinformation in "device controller" section
Hotmail was rejected by the mailing list, switched to gmail to resend. 1. Clarify cgroup BPF program type and attach type; 2. Fix file path broken. Signed-off-by: ArthurChiao <arthurchiao@hotmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
22b1255792 |
docs/cgroup: remove some duplicate words
When I tried to add some new entries to cgroup-v2.rst, I found that the description of memory.events had some repetitive words, so I tried to delete them. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
ee9707e859 |
cgroup/cpuset: Enable memory migration for cpuset v2
When a user changes cpuset.cpus, each task in a v2 cpuset will be moved to one of the new cpus if it is not there already. For memory, however, they won't be migrated to the new nodes when cpuset.mems changes. This is an inconsistency in behavior. In cpuset v1, there is a memory_migrate control file to enable such behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default for cpuset v2 so that we have a consistent set of behavior for both cpus and memory. There is certainly a cost to make memory migration the default, but it is a one time cost that shouldn't really matter as long as cpuset.mems isn't changed frequenty. Update the cgroup-v2.rst file to document the new behavior and recommend against changing cpuset.mems frequently. Since there won't be any concurrent access to the newly allocated cpuset structure in cpuset_css_alloc(), we can use the cheaper non-atomic __set_bit() instead of the more expensive atomic set_bit(). Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
3dbdb38e28 |
Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: - cgroup.kill is added which implements atomic killing of the whole subtree. Down the line, this should be able to replace the multiple userland implementations of "keep killing till empty". - PSI can now be turned off at boot time to avoid overhead for configurations which don't care about PSI. * 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: make per-cgroup pressure stall tracking configurable cgroup: Fix kernel-doc cgroup: inline cgroup_task_freeze() tests/cgroup: test cgroup.kill tests/cgroup: move cg_wait_for(), cg_prepare_for_wait() tests/cgroup: use cgroup.kill in cg_killall() docs/cgroup: add entry for cgroup.kill cgroup: introduce cgroup.kill |
||
|
|
556910e392 |
block: Introduce the ioprio rq-qos policy
Introduce an rq-qos policy that assigns an I/O priority to requests based on blk-cgroup configuration settings. This policy has the following advantages over the ioprio_set() system call: - This policy is cgroup based so it has all the advantages of cgroups. - While ioprio_set() does not affect page cache writeback I/O, this rq-qos controller affects page cache writeback I/O for filesystems that support assiociating a cgroup with writeback I/O. See also Documentation/admin-guide/cgroup-v2.rst. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
340272b040 |
docs/cgroup: add entry for cgroup.kill
Give a brief overview of the cgroup.kill functionality. Link: https://lore.kernel.org/r/20210503143922.3093755-2-brauner@kernel.org Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: cgroups@vger.kernel.org Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
25259fc914 |
cgroup: Miscellaneous cgroup documentation.
Documentation of miscellaneous cgroup controller. This new controller is used to track and limit the usage of scalar resources. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Rientjes <rientjes@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
3fb6d0e00e |
A handful of late-arriving documentation fixes, nothing all that notable.
-----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmA5Qu8PHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5Yiz0H/jTF+JcYstvGINx7jLZH4j9Pa4b/IZ3RO5uR OpjkzhTNangk2pSS4nuoQGjDRz1miBRaY1yE923Wxk1T1Nk+DA6aYJbVTqpn962S Z5IyQWzMIHFTAhSle0GeuTBk9Qx46ONhBJH1qsHCraAUtsQrxSUoF95ZftKD54gz Eg+eFQscHen9on2ZlqypauZebVbAa3zq1JCyohK5URiXLXpNq7ASCcOZ6v1OJb76 thgxOQgb1/TQ+ZNEeRs8Bv5g6kcTlWhapIrnsYPrmCEYaj2ghvGbbSlWyAmJRPqT PH+ucFCyjZqGcPmM5zerhVI+scQOLAJigAQa/B6HhRfmCyI1kkE= =ZKHc -----END PGP SIGNATURE----- Merge tag 'docs-5.12-2' of git://git.lwn.net/linux Pull documentation fixes from Jonathan Corbet: "A handful of late-arriving documentation fixes, nothing all that notable" * tag 'docs-5.12-2' of git://git.lwn.net/linux: docs: proc.rst: fix indentation warning Documentation: cgroup-v2: fix path to example BPF program docs: powerpc: Fix tables in syscall64-abi.rst Documentation: features: refresh feature list Documentation: features: remove c6x references docs: ABI: testing: ima_policy: Fixed missing bracket Fix unaesthetic indentation scripts: kernel-doc: fix array element capture in pointer-to-func parsing doc: use KCFLAGS instead of EXTRA_CFLAGS to pass flags from command line Documentation: proc.rst: add more about the 6 fields in loadavg |
||
|
|
43c4f657c8 |
Documentation: cgroup-v2: fix path to example BPF program
This file has been moved into the "progs" subdirectory, together with all
test BPF programs.
Fixes:
|
||
|
|
b603894248 |
mm: memcg: add swapcache stat for memcg v2
This patch adds swapcache stat for the cgroup v2. The swapcache represents the memory that is accounted against both the memory and the swap limit of the cgroup. The main motivation behind exposing the swapcache stat is for enabling users to gracefully migrate from cgroup v1's memsw counter to cgroup v2's memory and swap counters. Cgroup v1's memsw limit allows users to limit the memory+swap usage of a workload but without control on the exact proportion of memory and swap. Cgroup v2 provides separate limits for memory and swap which enables more control on the exact usage of memory and swap individually for the workload. With some little subtleties, the v1's memsw limit can be switched with the sum of the v2's memory and swap limits. However the alternative for memsw usage is not yet available in cgroup v2. Exposing per-cgroup swapcache stat enables that alternative. Adding the memory usage and swap usage and subtracting the swapcache will approximate the memsw usage. This will help in the transparent migration of the workloads depending on memsw usage and limit to v2' memory and swap counters. The reasons these applications are still interested in this approximate memsw usage are: (1) these applications are not really interested in two separate memory and swap usage metrics. A single usage metric is more simple to use and reason about for them. (2) The memsw usage metric hides the underlying system's swap setup from the applications. Applications with multiple instances running in a datacenter with heterogeneous systems (some have swap and some don't) will keep seeing a consistent view of their usage. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lkml.kernel.org/r/20210108155813.2914586-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0e63a5c6ba |
It has been a relatively quiet cycle in docsland.
- As promised, the minimum Sphinx version to build the docs is now 1.7,
and we have dropped support for Python 2 entirely. That allowed the
removal of a bunch of compatibility code.
- A set of treewide warning fixups from Mauro that I applied after it
became clear nobody else was going to deal with them.
- The automarkup mechanism can now create cross-references from relative
paths to RST files.
- More translations, typo fixes, and warning fixes.
No conflicts with any other tree as far as I know.
-----BEGIN PGP SIGNATURE-----
iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmAq4EUPHGNvcmJldEBs
d24ubmV0AAoJEBdDWhNsDH5YTIAH/1I5MlVQwuvNKjwCAEdmltQgHv6SmXSpDkrp
SGuviWVXxqz8dTyo7C2R12qE/7nP8zGAmclNdX78ynl5qWaj05lQsjBgMYSoQO/F
+akyLQSL8/8SQrtDPPBcboPuIz9DzkX51kkQthvCf0puJi0ScKVHO9Sk9SKUgDoK
cnCE9VwpGL7YX/ee2wt91UYREijgJ9P7eQ6rqKvUZ5Itu9ikfu9vQU41GR9tOXDK
MQK+k38pWdl8wRgTgA0pkVhMf1G732bxTTicvFHXcyqmCkh7++m2+ysT8O+SBBMX
e5BbP0yysSqThjwFHOW5PWM1AWD5iVz+pnwJwEaJ4K76tJJOw9M=
=bcDk
-----END PGP SIGNATURE-----
Merge tag 'docs-5.12' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"It has been a relatively quiet cycle in docsland.
- As promised, the minimum Sphinx version to build the docs is now
1.7, and we have dropped support for Python 2 entirely. That
allowed the removal of a bunch of compatibility code.
- A set of treewide warning fixups from Mauro that I applied after it
became clear nobody else was going to deal with them.
- The automarkup mechanism can now create cross-references from
relative paths to RST files.
- More translations, typo fixes, and warning fixes"
* tag 'docs-5.12' of git://git.lwn.net/linux: (75 commits)
docs: kernel-hacking: be more civil
docs: Remove the Microsoft rhetoric
Documentation/admin-guide: kernel-parameters: Update nohlt section
doc/admin-guide: fix spelling mistake: "perfomance" -> "performance"
docs: Document cross-referencing using relative path
docs: Enable usage of relative paths to docs on automarkup
docs: thermal: fix spelling mistakes
Documentation: admin-guide: Update kvm/xen config option
docs: Make syscalls' helpers naming consistent
coding-style.rst: Avoid comma statements
Documentation: /proc/loadavg: add 3 more field descriptions
Documentation/submitting-patches: Add blurb about backtraces in commit messages
Docs: drop Python 2 support
Move our minimum Sphinx version to 1.7
Documentation: input: define ABS_PRESSURE/ABS_MT_PRESSURE resolution as grams
scripts/kernel-doc: add internal hyperlink to DOC: sections
Update Documentation/admin-guide/sysctl/fs.rst
docs: Update DTB format references
docs: zh_CN: add iio index.rst translation
docs/zh_CN: add iio ep93xx_adc.rst translation
...
|