mirror of
https://git.proxmox.com/git/mirror_ubuntu-kernels.git
synced 2025-11-20 08:37:24 +00:00
d05374dee2
22462 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
845982eb26 |
mm: swap: allow storage of all mTHP orders
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9faaa0f816 |
mm: swap: update get_swap_pages() to take folio order
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/20240408183946.2991168-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
14c62da21b |
mm: swap: simplify struct percpu_cluster
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Link: https://lkml.kernel.org/r/20240408183946.2991168-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a62fb92ac1 |
mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d7d0d389ff |
mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
Patch series "Swap-out mTHP without splitting", v7.
This series adds support for swapping out multi-size THP (mTHP) without
needing to first split the large folio via
split_huge_page_to_list_to_order(). It closely follows the approach
already used to swap-out PMD-sized THP.
There are a couple of reasons for swapping out mTHP without splitting:
- Performance: It is expensive to split a large folio and under
extreme memory pressure some workloads regressed performance when
using 64K mTHP vs 4K small folios because of this extra cost in the
swap-out path. This series not only eliminates the regression but
makes it faster to swap out 64K mTHP vs 4K small folios.
- Memory fragmentation avoidance: If we can avoid splitting a large
folio memory is less likely to become fragmented, making it easier to
re-allocate a large folio in future.
- Performance: Enables a separate series [7] to swap-in whole mTHPs,
which means we won't lose the TLB-efficiency benefits of mTHP once the
memory has been through a swap cycle.
I've done what I thought was the smallest change possible, and as a
result, this approach is only employed when the swap is backed by a
non-rotating block device (just as PMD-sized THP is supported today).
Discussion against the RFC concluded that this is sufficient.
Performance Testing
===================
I've run some swap performance tests on Ampere Altra VM (arm64) with 8
CPUs. The VM is set up with a 35G block ram device as the swap device and
the test is run from inside a memcg limited to 40G memory. I've then run
`usemem` from vm-scalability with 70 processes, each allocating and
writing 1G of memory. I've repeated everything 6 times and taken the mean
performance improvement relative to 4K page baseline:
| alloc size | baseline | + this series |
| | mm-unstable (~v6.9-rc1) | |
|:-----------|------------------------:|------------------------:|
| 4K Page | 0.0% | 1.3% |
| 64K THP | -13.6% | 46.3% |
| 2M THP | 91.4% | 89.6% |
So with this change, the 64K swap performance goes from a 14% regression to a
46% improvement. While 2M shows a small regression I'm confident that this is
just noise.
[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.com/
[5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/
[6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.roberts@arm.com/
[7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/
[9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/
This patch (of 7):
As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.
The only use of the flag was to determine whether a swap entry refers to a
single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead
of relying on the flag, we now pass in order, which originates from the
folio's order. This allows the logic to work for folios of any order.
The one snag is that one of the swap_page_trans_huge_swapped() call sites
does not have the folio. But it was only being called there to shortcut a
call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets
the folio and (via some other functions) calls
swap_page_trans_huge_swapped(). So I've removed the problematic call site
and believe the new logic should be functionally equivalent.
That said, removing the fast path means that we will take a reference and
trylock a large folio much more often, which we would like to avoid. The
next patch will solve this.
Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.
Link: https://lkml.kernel.org/r/20240408183946.2991168-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20240408183946.2991168-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
6303d1c553 |
mm: page_alloc: use the correct THP order for THP PCP
Commit |
||
|
|
43849758fd |
khugepaged: use a folio throughout hpage_collapse_scan_file()
Replace the use of pages with folios. Saves a few calls to compound_head() and removes some uses of obsolete functions. Link: https://lkml.kernel.org/r/20240403171838.1445826-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8d1e24c0b8 |
khugepaged: use a folio throughout collapse_file()
Pull folios from the page cache instead of pages. Half of this work had been done already, but we were still operating on pages for a large chunk of this function. There is no attempt in this patch to handle large folios that are smaller than a THP; that will have to wait for a future patch. [willy@infradead.org: the unlikely() is embedded in IS_ERR()] Link: https://lkml.kernel.org/r/ZhIWX8K0E2tSyMSr@casper.infradead.org Link: https://lkml.kernel.org/r/20240403171838.1445826-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
610ff817b9 |
khugepaged: remove hpage from collapse_file()
Use new_folio throughout where we had been using hpage. Link: https://lkml.kernel.org/r/20240403171838.1445826-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8eca68e2cf |
khugepaged: pass a folio to __collapse_huge_page_copy()
Simplify the body of __collapse_huge_page_copy() while I'm looking at it. Link: https://lkml.kernel.org/r/20240403171838.1445826-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0234779276 |
khugepaged: remove hpage from collapse_huge_page()
Work purely in terms of the folio. Removes a call to compound_head() in put_page(). Link: https://lkml.kernel.org/r/20240403171838.1445826-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d5ab50b941 |
khugepaged: convert alloc_charge_hpage to alloc_charge_folio
Both callers want to deal with a folio, so return a folio from this function. Link: https://lkml.kernel.org/r/20240403171838.1445826-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4746f5ce0f |
khugepaged: inline hpage_collapse_alloc_folio()
Patch series "khugepaged folio conversions". We've been kind of hacking piecemeal at converting khugepaged to use folios instead of compound pages, and so this patchset is a little larger than it should be as I undo some of our wrong moves in the past. In particular, collapse_file() now consistently uses 'new_folio' for the freshly allocated folio and 'folio' for the one that's currently in use. This patch (of 7): This function has one caller, and the combined function is simpler to read, reason about and modify. Link: https://lkml.kernel.org/r/20240403171838.1445826-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240403171838.1445826-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7998df0b64 |
memory: remove the now superfluous sentinel element from ctl_table array
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from all files under mm/ that register a sysctl table. Link: https://lkml.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-1-47c1463b3af2@samsung.com Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e0abfbb671 |
mm: rename vma_pgoff_address back to vma_address
With all callers converted, we can use the nice shorter name. Take this opportunity to reorder the arguments to the logical order (larger object first). Link: https://lkml.kernel.org/r/20240328225831.1765286-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
412ad5fbe9 |
mm: remove vma_address()
Convert the three remaining callers to call vma_pgoff_address() directly. This removes an ambiguity where we'd check just one page if passed a tail page and all N pages if passed a head page. Also add better kernel-doc for vma_pgoff_address(). Link: https://lkml.kernel.org/r/20240328225831.1765286-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7e8347413e |
mm: correct page_mapped_in_vma() for large folios
Patch series "Unify vma_address and vma_pgoff_address". The current vma_address() pretends that the ambiguity between head & tail page is an advantage. If you pass a head page to vma_address(), it will operate on all pages in the folio, while if you pass a tail page, it will operate on a single page. That's not what any of the callers actually want, so first convert all callers to use vma_pgoff_address() and then rename vma_pgoff_address() to vma_address(). This patch (of 3): If 'page' is the first page of a large folio then vma_address() will scan for any page in the entire folio. This can lead to page_mapped_in_vma() returning true if some of the tail pages are mapped and the head page is not. This could lead to memory failure choosing to kill a task unnecessarily. Link: https://lkml.kernel.org/r/20240328225831.1765286-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240328225831.1765286-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
835c3a25aa |
mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split statistics
Now the mTHP can also be split or added into the deferred list, so add folio_test_pmd_mappable() validation for PMD mapped THP, to avoid confusion with PMD mapped THP related statistics. [baolin.wang@linux.alibaba.com: check THP earlier in case folio is split, per Lance] Link: https://lkml.kernel.org/r/b99f8cb14bc85fdb6ab43721d1331cb5ebed2581.1713771041.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d2136d749d |
mm: support multi-size THP numa balancing
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable. Allow scanning mTHP: Commit |
||
|
|
6b0ed7b3c7 |
mm: factor out the numa mapping rebuilding into a new helper
Patch series "support multi-size THP numa balancing", v2. This patchset tries to support mTHP numa balancing, as a simple solution to start, the NUMA balancing algorithm for mTHP will follow the THP strategy as the basic support. Please find details in each patch. This patch (of 2): To support large folio's numa balancing, factor out the numa mapping rebuilding into a new helper as a preparation. Link: https://lkml.kernel.org/r/cover.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/cover.1711683069.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
68dbcf4899 |
mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback cases
Fallback rates surpassing 90% have been observed on phones utilizing 64KiB CONT-PTE mTHP. In these scenarios, when one out of every 16 PTEs fails to allocate large folios, the remaining 15 PTEs fallback. Consequently, invoking vma_thp_gfp_mask seems redundant in such cases. Furthermore, abstaining from its use can also contribute to improved code readability. Link: https://lkml.kernel.org/r/20240329073750.20012-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Itaru Kitayama <itaru.kitayama@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ba42b524a0 |
mm: init_mlocked_on_free_v3
Implements the "init_mlocked_on_free" boot option. When this boot option
is enabled, any mlock'ed pages are zeroed on free. If
the pages are munlock'ed beforehand, no initialization takes place.
This boot option is meant to combat the performance hit of
"init_on_free" as reported in commit
|
||
|
|
44bd7ace9f |
mm: take placement mappings gap into account
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap()
needs to consider two things:
1. That the new mapping isn't placed in an any existing mappings guard
gaps.
2. That the new mapping isn't placed such that any existing mappings
are not in *its* guard gaps.
The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2. So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area. Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.
For MAP_GROWSDOWN/VM_GROWSDOWN and MAP_GROWSUP/VM_GROWSUP this has not
been a problem in practice because applications place these kinds of
mappings very early, when there is not many mappings to find a space
between. But for shadow stacks, they may be placed throughout the
lifetime of the application.
Use the start_gap field to find a space that includes the guard gap for
the new mapping. Take care to not interfere with the alignment.
Link: https://lkml.kernel.org/r/20240326021656.202649-12-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
b80fa3cbb7 |
treewide: use initializer for struct vm_unmapped_area_info
Future changes will need to add a new member to struct
vm_unmapped_area_info. This would cause trouble for any call site that
doesn't initialize the struct. Currently every caller sets each member
manually, so if new ones are added they will be uninitialized and the core
code parsing the struct will see garbage in the new member.
It could be possible to initialize the new member manually to 0 at each
call site. This and a couple other options were discussed. Having some
struct vm_unmapped_area_info instances not zero initialized will put those
sites at risk of feeding garbage into vm_unmapped_area(), if the
convention is to zero initialize the struct and any new field addition
missed a call site that initializes each field manually. So it is useful
to do things similar across the kernel.
The consensus (see links) was that in general the best way to accomplish
taking into account both code cleanliness and minimizing the chance of
introducing bugs, was to do C99 static initialization. As in: struct
vm_unmapped_area_info info = {};
With this method of initialization, the whole struct will be zero
initialized, and any statements setting fields to zero will be unneeded.
The change should not leave cleanup at the call sides.
While iterating though the possible solutions a few archs kindly acked
other variations that still zero initialized the struct. These sites have
been modified in previous changes using the pattern acked by the
respective arch.
So to be reduce the chance of bugs via uninitialized fields, perform a
tree wide change using the consensus for the best general way to do this
change. Use C99 static initializing to zero the struct and remove and
statements that simply set members to zero.
Link: https://lkml.kernel.org/r/20240326021656.202649-11-rick.p.edgecombe@intel.com
Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t
Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/
Link: https://lore.kernel.org/lkml/ec3e377a-c0a0-4dd3-9cb9-96517e54d17e@csgroup.eu/
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
ed48e87c7d |
thp: add thp_get_unmapped_area_vmflags()
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap()
needs to consider two things:
1. That the new mapping isn't placed in an any existing mappings guard
gaps.
2. That the new mapping isn't placed such that any existing mappings
are not in *its* guard gaps.
The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2. So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area. Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.
Add a THP implementations of the vm_flags variant of get_unmapped_area().
Future changes will call this from mmap.c in the do_mmap() path to allow
shadow stacks to be placed with consideration taken for the start guard
gap. Shadow stack memory is always private and anonymous and so special
guard gap logic is not needed in a lot of caseis, but it can be mapped by
THP, so needs to be handled.
Link: https://lkml.kernel.org/r/20240326021656.202649-7-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
8a0fe564bb |
mm: use get_unmapped_area_vmflags()
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap()
needs to consider two things:
1. That the new mapping isn't placed in an any existing mappings guard
gaps.
2. That the new mapping isn't placed such that any existing mappings
are not in *its* guard gaps.
The long standing behavior of mmap() is to ensure 1, but not take any care
around 2. So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area. Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.
Use mm_get_unmapped_area_vmflags() in the do_mmap() so future changes can
cause shadow stack mappings to be placed with a guard gap. Also use the
THP variant that takes vm_flags, such that THP shadow stack can get the
same treatment. Adjust the vm_flags calculation to happen earlier so that
the vm_flags can be passed into __get_unmapped_area().
Link: https://lkml.kernel.org/r/20240326021656.202649-6-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
529781b24b |
mm: remove export for get_unmapped_area()
The mm/mmap.c function get_unmapped_area() is not used from any modules, so it doesn't need to be exported. Remove the export. Link: https://lkml.kernel.org/r/20240326021656.202649-5-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
961148704a |
mm: introduce arch_get_unmapped_area_vmflags()
When memory is being placed, mmap() will take care to respect the guard
gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and
VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap()
needs to consider two things:
1. That the new mapping isn't placed in an any existing mappings guard
gaps.
2. That the new mapping isn't placed such that any existing mappings
are not in *its* guard gaps.
The longstanding behavior of mmap() is to ensure 1, but not take any care
around 2. So for example, if there is a PAGE_SIZE free area, and a mmap()
with a PAGE_SIZE size, and a type that has a guard gap is being placed,
mmap() may place the shadow stack in the PAGE_SIZE free area. Then the
mapping that is supposed to have a guard gap will not have a gap to the
adjacent VMA.
In order to take the start gap into account, the maple tree search needs
to know the size of start gap the new mapping will need. The call chain
from do_mmap() to the actual maple tree search looks like this:
do_mmap(size, vm_flags, map_flags, ..)
mm/mmap.c:get_unmapped_area(size, map_flags, ...)
arch_get_unmapped_area(size, map_flags, ...)
vm_unmapped_area(struct vm_unmapped_area_info)
One option would be to add another MAP_ flag to mean a one page start gap
(as is for shadow stack), but this consumes a flag unnecessarily. Another
option could be to simply increase the size passed in do_mmap() by the
start gap size, and adjust after the fact, but this will interfere with
the alignment requirements passed in struct vm_unmapped_area_info, and
unknown to mmap.c. Instead, introduce variants of
arch_get_unmapped_area/_topdown() that take vm_flags. In future changes,
these variants can be used in mmap.c:get_unmapped_area() to allow the
vm_flags to be passed through to vm_unmapped_area(), while preserving the
normal arch_get_unmapped_area/_topdown() for the existing callers.
Link: https://lkml.kernel.org/r/20240326021656.202649-4-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin (Intel) <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
529ce23a76 |
mm: switch mm->get_unmapped_area() to a flag
The mm_struct contains a function pointer *get_unmapped_area(), which is set to either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() during the initialization of the mm. Since the function pointer only ever points to two functions that are named the same across all arch's, a function pointer is not really required. In addition future changes will want to add versions of the functions that take additional arguments. So to save a pointers worth of bytes in mm_struct, and prevent adding additional function pointers to mm_struct in future changes, remove it and keep the information about which get_unmapped_area() to use in a flag. Add the new flag to MMF_INIT_MASK so it doesn't get clobbered on fork by mmf_init_flags(). Most MM flags get clobbered on fork. In the pre-existing behavior mm->get_unmapped_area() would get copied to the new mm in dup_mm(), so not clobbering the flag preserves the existing behavior around inheriting the topdown-ness. Introduce a helper, mm_get_unmapped_area(), to easily convert code that refers to the old function pointer to instead select and call either arch_get_unmapped_area() or arch_get_unmapped_area_topdown() based on the flag. Then drop the mm->get_unmapped_area() function pointer. Leave the get_unmapped_area() pointer in struct file_operations alone. The main purpose of this change is to reorganize in preparation for future changes, but it also converts the calls of mm->get_unmapped_area() from indirect branches into a direct ones. The stress-ng bigheap benchmark calls realloc a lot, which calls through get_unmapped_area() in the kernel. On x86, the change yielded a ~1% improvement there on a retpoline config. In testing a few x86 configs, removing the pointer unfortunately didn't result in any actual size reductions in the compiled layout of mm_struct. But depending on compiler or arch alignment requirements, the change could shrink the size of mm_struct. Link: https://lkml.kernel.org/r/20240326021656.202649-3-rick.p.edgecombe@intel.com Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
08b8247ebd |
mm: remove __set_page_dirty_nobuffers()
There are no more callers of __set_page_dirty_nobuffers(), remove it. Link: https://lkml.kernel.org/r/20240327143008.3739435-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
82a616d0f3 |
mm: remove "prot" parameter from move_pte()
The "prot" parameter is unused, and using it instead of what's stored in that particular PTE would very likely be wrong. Let's simply remove it. Link: https://lkml.kernel.org/r/20240327143301.741807-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andreas Larsson <andreas@gaisler.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c0bff412e6 |
mm: allow anon exclusive check over hugetlb tail pages
PageAnonExclusive() used to forbid tail pages for hugetlbfs, as that used
to be called mostly in hugetlb specific paths and the head page was
guaranteed.
As we move forward towards merging hugetlb paths into generic mm, we may
start to pass in tail hugetlb pages (when with cont-pte/cont-pmd huge
pages) for such check. Allow it to properly fetch the head, in which case
the anon-exclusiveness of the head will always represents the tail page.
There's already a sign of it when we look at the GUP-fast which already
contain the hugetlb processing altogether: we used to have a specific
commit
|
||
|
|
9cb28da546 |
mm/gup: handle hugetlb in the generic follow_page_mask code
Now follow_page() is ready to handle hugetlb pages in whatever form, and over all architectures. Switch to the generic code path. Time to retire hugetlb_follow_page_mask(), following the previous retirement of follow_hugetlb_page() in |
||
|
|
a12083d721 |
mm/gup: handle hugepd for follow_page()
Hugepd is only used in PowerPC so far on 4K page size kernels where hash mmu is used. follow_page_mask() used to leverage hugetlb APIs to access hugepd entries. Teach follow_page_mask() itself on hugepd. With previous refactors on fast-gup gup_huge_pd(), most of the code can be leveraged. There's something not needed for follow page, for example, gup_hugepte() tries to detect pgtable entry change which will never happen with slow gup (which has the pgtable lock held), but that's not a problem to check. Since follow_page() always only fetch one page, set the end to "address + PAGE_SIZE" should suffice. We will still do the pgtable walk once for each hugetlb page by setting ctx->page_mask properly. One thing worth mentioning is that some level of pgtable's _bad() helper will report is_hugepd() entries as TRUE on Power8 hash MMUs. I think it at least applies to PUD on Power8 with 4K pgsize. It means feeding a hugepd entry to pud_bad() will report a false positive. Let's leave that for now because it can be arch-specific where I am a bit declined to touch. In this patch it's not a problem as long as hugepd is detected before any bad pgtable entries. To allow slow gup like follow_*_page() to access hugepd helpers, hugepd codes are moved to the top. Besides that, the helper record_subpages() will be used by either hugepd or fast-gup now. To avoid "unused function" warnings we must provide a "#ifdef" for it, unfortunately. Link: https://lkml.kernel.org/r/20240327152332.950956-13-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4418c522f6 |
mm/gup: handle huge pmd for follow_pmd_mask()
Replace pmd_trans_huge() with pmd_leaf() to also cover pmd_huge() as long as enabled. FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge. Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it into follow_huge_pmd() to match what it does. Move it into gup.c so not depend on CONFIG_THP. When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set it when the page is valid. It was not a bug to set it before even if GUP failed (page==NULL), because follow_page_mask() callers always ignores page_mask if so. But doing so makes the code cleaner. [peterx@redhat.com: allow follow_pmd_mask() to take hugetlb tail pages] Link: https://lkml.kernel.org/r/20240403013249.1418299-3-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1b16761802 |
mm/gup: handle huge pud for follow_pud_mask()
Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb. Rename follow_devmap_pud() to follow_huge_pud() so that it can process either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and and huge_memory.c (which relies on CONFIG_THP). Switch to pud_leaf() to detect both cases in the slow gup. In the new follow_huge_pud(), taking care of possible CoR for hugetlb if necessary. touch_pud() needs to be moved out of huge_memory.c to be accessable from gup.c even if !THP. Since at it, optimize the non-present check by adding a pud_present() early check before taking the pgtable lock, failing the follow_page() early if PUD is not present: that is required by both devmap or hugetlb. Use pud_huge() to also cover the pud_devmap() case. One more trivial thing to mention is, introduce "pud_t pud" in the code paths along the way, so the code doesn't dereference *pudp multiple time. Not only because that looks less straightforward, but also because if the dereference really happened, it's not clear whether there can be race to see different *pudp values when it's being modified at the same time. Setting ctx->page_mask properly for a PUD entry. As a side effect, this patch should also be able to optimize devmap GUP on PUD to be able to jump over the whole PUD range, but not yet verified. Hugetlb already can do so prior to this patch. Link: https://lkml.kernel.org/r/20240327152332.950956-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
caf8cab798 |
mm/gup: cache *pudp in follow_pud_mask()
Introduce "pud_t pud" in the function, so the code won't dereference *pudp multiple time. Not only because that looks less straightforward, but also because if the dereference really happened, it's not clear whether there can be race to see different *pudp values if it's being modified at the same time. Link: https://lkml.kernel.org/r/20240327152332.950956-10-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: James Houghton <jthoughton@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
878b0c4516 |
mm/gup: handle hugetlb for no_page_table()
no_page_table() is not yet used for hugetlb code paths. Make it prepared. The major difference here is hugetlb will return -EFAULT as long as page cache does not exist, even if VM_SHARED. See hugetlb_follow_page_mask(). Pass "address" into no_page_table() too, as hugetlb will need it. Link: https://lkml.kernel.org/r/20240327152332.950956-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f3c94c625f |
mm/gup: refactor record_subpages() to find 1st small page
All the fast-gup functions take a tail page to operate, always need to do page mask calculations before feeding that into record_subpages(). Merge that logic into record_subpages(), so that it will do the nth_page() calculation. Link: https://lkml.kernel.org/r/20240327152332.950956-8-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
607c63195d |
mm/gup: drop gup_fast_folio_allowed() in hugepd processing
Hugepd format for GUP is only used in PowerPC with hugetlbfs. There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.
Commit
|
||
|
|
239e9a90c8 |
mm: introduce vma_pgtable_walk_{begin|end}()
Introduce per-vma begin()/end() helpers for pgtable walks. This is a preparation work to merge hugetlb pgtable walkers with generic mm. The helpers need to be called before and after a pgtable walk, will start to be needed if the pgtable walker code supports hugetlb pages. It's a hook point for any type of VMA, but for now only hugetlb uses it to stablize the pgtable pages from getting away (due to possible pmd unsharing). Link: https://lkml.kernel.org/r/20240327152332.950956-5-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Muchun Song <muchun.song@linux.dev> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
24334e78e8 |
mm/hugetlb: declare hugetlbfs_pagecache_present() non-static
It will be used outside hugetlb.c soon. Link: https://lkml.kernel.org/r/20240327152332.950956-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ac3830c3b2 |
mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
Patch series "mm/gup: Unify hugetlb, part 2", v4. The series removes the hugetlb slow gup path after a previous refactor work [1], so that slow gup now uses the exact same path to process all kinds of memory including hugetlb. For the long term, we may want to remove most, if not all, call sites of huge_pte_offset(). It'll be ideal if that API can be completely dropped from arch hugetlb API. This series is one small step towards merging hugetlb specific codes into generic mm paths. From that POV, this series removes one reference to huge_pte_offset() out of many others. One goal of such a route is that we can reconsider merging hugetlb features like High Granularity Mapping (HGM). It was not accepted in the past because it may add lots of hugetlb specific codes and make the mm code even harder to maintain. With a merged codeset, features like HGM can hopefully share some code with THP, legacy (PMD+) or modern (continuous PTEs). To make it work, the generic slow gup code will need to at least understand hugepd, which is already done like so in fast-gup. Due to the specialty of hugepd to be software-only solution (no hardware recognizes the hugepd format, so it's purely artificial structures), there's chance we can merge some or all hugepd formats with cont_pte in the future. That question is yet unsettled from Power side to have an acknowledgement. As of now for this series, I kept the hugepd handling because we may still need to do so before getting a clearer picture of the future of hugepd. The other reason is simply that we did it already for fast-gup and most codes are still around to be reused. It'll make more sense to keep slow/fast gup behave the same before a decision is made to remove hugepd. There's one major difference for slow-gup on cont_pte / cont_pmd handling, currently supported on three architectures (aarch64, riscv, ppc). Before the series, slow gup will be able to recognize e.g. cont_pte entries with the help of huge_pte_offset() when hstate is around. Now it's gone but still working, by looking up pgtable entries one by one. It's not ideal, but hopefully this change should not affect yet on major workloads. There's some more information in the commit message of the last patch. If this would be a concern, we can consider teaching slow gup to recognize cont pte/pmd entries, and that should recover the lost performance. But I doubt its necessity for now, so I kept it as simple as it can be. Patch layout ============= Patch 1-8: Preparation works, or cleanups in relevant code paths Patch 9-11: Teach slow gup with all kinds of huge entries (pXd, hugepd) Patch 12: Drop hugetlb_follow_page_mask() More information can be found in the commit messages of each patch. [1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com [2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com Introduce a config option that will be selected as long as huge leaves are involved in pgtable (thp or hugetlbfs). It would be useful to mark any code with this new config that can process either hugetlb or thp pages in any level that is higher than pte level. Link: https://lkml.kernel.org/r/20240327152332.950956-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240327152332.950956-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Jones <andrew.jones@linux.dev> Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
632230ff19 |
mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio
Also remove mm_get_huge_zero_page() now it has no users. Link: https://lkml.kernel.org/r/20240326202833.523759-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e28833bc4a |
mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
Use folios more widely. Link: https://lkml.kernel.org/r/20240326202833.523759-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5691753d73 |
mm: convert huge_zero_page to huge_zero_folio
With all callers of is_huge_zero_page() converted, we can now switch the huge_zero_page itself from being a compound page to a folio. Link: https://lkml.kernel.org/r/20240326202833.523759-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b002a7b0a5 |
mm: convert migrate_vma_collect_pmd to use a folio
Convert the pmd directly to a folio and use it. Turns four calls to compound_head() into one. Link: https://lkml.kernel.org/r/20240326202833.523759-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e06d03d559 |
mm: add pmd_folio()
Convert directly from a pmd to a folio without going through another representation first. For now this is just a slightly shorter way to write it, but it might end up being more efficient later. Link: https://lkml.kernel.org/r/20240326202833.523759-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5beaee54a3 |
mm: add is_huge_zero_folio()
This is the folio equivalent of is_huge_zero_page(). It doesn't add any efficiency, but it does prevent the caller from passing a tail page and getting confused when the predicate returns false. Link: https://lkml.kernel.org/r/20240326202833.523759-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
796c2c23e1 |
zswap: replace RB tree with xarray
Very deep RB tree requires rebalance at times. That contributes to the
zswap fault latencies. Xarray does not need to perform tree rebalance.
Replacing RB tree to xarray can have some small performance gain.
One small difference is that xarray insert might fail with ENOMEM, while
RB tree insert does not allocate additional memory.
The zswap_entry size will reduce a bit due to removing the RB node, which
has two pointers and a color field. Xarray store the pointer in the
xarray tree rather than the zswap_entry. Every entry has one pointer from
the xarray tree. Overall, switching to xarray should save some memory, if
the swap entries are densely packed.
Notice the zswap_rb_search and zswap_rb_insert often followed by
zswap_rb_erase. Use xa_erase and xa_store directly. That saves one tree
lookup as well.
Remove zswap_invalidate_entry due to no need to call zswap_rb_erase any
more. Use zswap_free_entry instead.
The "struct zswap_tree" has been replaced by "struct xarray". The tree
spin lock has transferred to the xarray lock.
Run the kernel build testing 5 times for each version, averages:
(memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile,
24 HT core, 32 jobs)
mm-unstable-4aaccadb5c04 xarray v9
user 3548.902 3534.375
sys 522.232 520.976
real 202.796 200.864
[chrisl@kernel.org: restore original comment "erase" to "invalidate"]
Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v10-1-bf698417c968@kernel.org
Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v9-1-d2891a65dfc7@kernel.org
Signed-off-by: Chris Li <chrisl@kernel.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0aac45663a |
mm/page_alloc.c: change the array-length to MIGRATE_PCPTYPES
Earlier, in commit
|
||
|
|
96a5c186ef |
mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone
On one node, for lower zone's ->lowmem_reserve[], it will show how much
memory is reserved in this lower zone to avoid excessive page allocation
from the relevant higher zone's fallback allocation.
However, currently lower zone's lowmem_reserve[] element will be filled
even though the relevant higher zone is empty. That doesnt' make sense
and can cause confusion.
E.g on node 0 of one system as below, it has zone
DMA/DMA32/NORMAL/MOVABLE/DEVICE, among them zone MOVABLE/DEVICE are the
highest and both are empty. In zone DMA/DMA32's protection array, we can
see that it has value for zone MOVABLE and DEVICE.
Node 0, zone DMA
......
pages free 2816
boost 0
min 7
low 10
high 13
spanned 4095
present 3998
managed 3840
cma 0
protection: (0, 1582, 23716, 23716, 23716)
......
Node 0, zone DMA32
pages free 403269
boost 0
min 753
low 1158
high 1563
spanned 1044480
present 487039
managed 405070
cma 0
protection: (0, 0, 22134, 22134, 22134)
......
Node 0, zone Normal
pages free 5423879
boost 0
min 10539
low 16205
high 21871
spanned 5767168
present 5767168
managed 5666438
cma 0
protection: (0, 0, 0, 0, 0)
......
Node 0, zone Movable
pages free 0
boost 0
min 32
low 32
high 32
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)
Node 0, zone Device
pages free 0
boost 0
min 0
low 0
high 0
spanned 0
present 0
managed 0
cma 0
protection: (0, 0, 0, 0, 0)
Here, clear out the element value in lower zone's ->lowmem_reserve[] if the
relevant higher zone is empty.
And also replace space with tab in _deferred_grow_zone()
Link: https://lkml.kernel.org/r/20240326061134.1055295-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f55d3471b7 |
mm/mm_init.c: remove the outdated code comment above deferred_grow_zone()
The noinline attribute has been taken off in commit
|
||
|
|
bb8ea62daa |
mm/page_alloc.c: remove unneeded codes in !NUMA version of build_zonelists()
When CONFIG_NUMA=n, MAX_NUMNODES is always 1 because Kconfig item NODES_SHIFT depends on NUMA. So in !NUMA version of build_zonelists(), no need to bother with the two for loop because code execution won't enter them ever. Here, remove those unneeded codes in !NUMA version of build_zonelists(). [bhe@redhat.com: remove unused locals] Link: https://lkml.kernel.org/r/ZgQL1WOf9K88nLpQ@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240326061134.1055295-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b6dd94596f |
mm: make __absent_pages_in_range() as static
It's only called in mm/mm_init.c now. Link: https://lkml.kernel.org/r/20240326061134.1055295-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c091dd963f |
mm/init: remove the unnecessary special treatment for memory-less node
Because memory-less node's ->node_present_pages and its zone's ->present_pages are all 0, the judgement before calling node_set_state() to set N_MEMORY, N_HIGH_MEMORY, N_NORMAL_MEMORY for node is enough to skip memory-less node. The 'continue;' statement inside for_each_node() loop of free_area_init() is gilding the lily. Here, remove the special handling to make memory-less node share the same code flow as normal node. And also rephrase the code comments above the 'continue' statement and move them above above line 'if (pgdat->node_present_pages)'. [bhe@redhat.com: redo code comments, per Mike] Link: https://lkml.kernel.org/r/ZhYJAVQRYJSTKZng@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240326061134.1055295-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
850ed20539 |
mm: move array mem_section init code out of memory_present()
Patch series "mm/init: minor clean up and improvement". These are all observed when going through code flow during mm init. This patch (of 7): When CONFIG_SPARSEMEM_EXTREME is enabled, mem_section need be initialized to point at a two-dimensional array, and its 1st dimension of length NR_SECTION_ROOTS will be dynamically allocated. Once the allocation is done, it's available for all nodes. So take the 1st dimension of mem_section initialization out of memory_present()(), and put it into memblocks_present() which is a more appripriate place. Link: https://lkml.kernel.org/r/20240326061134.1055295-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240326061134.1055295-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e6100a4590 |
mm, slab: move slab_memcg hooks to mm/memcontrol.c
The hooks make multiple calls to functions in mm/memcontrol.c, including to th current_obj_cgroup() marked __always_inline. It might be faster to make a single call to the hook in mm/memcontrol.c instead. The hooks also don't use almost anything from mm/slub.c. obj_full_size() can move with the hooks and cache_vmstat_idx() to the internal mm/slab.h Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-2-d85d2563287a@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9f9796b413 |
mm, slab: move memcg charging to post-alloc hook
Patch series "memcg_kmem hooks refactoring", v3. This patch (of 2): The MEMCG_KMEM integration with slab currently relies on two hooks during allocation. memcg_slab_pre_alloc_hook() determines the objcg and charges it, and memcg_slab_post_alloc_hook() assigns the objcg pointer to the allocated object(s). As Linus pointed out, this is unnecessarily complex. Failing to charge due to memcg limits should be rare, so we can optimistically allocate the object(s) and do the charging together with assigning the objcg pointer in a single post_alloc hook. In the rare case the charging fails, we can free the object(s) back. This simplifies the code (no need to pass around the objcg pointer) and potentially allows to separate charging from allocation in cases where it's common that the allocation would be immediately freed, and the memcg handling overhead could be saved. [vbabka@suse.cz: fix call to memcg_alloc_abort_single()] Link: https://lkml.kernel.org/r/4af50be2-4109-45e5-8a36-2136252a635e@suse.cz [roman.gushchin@linux.dev: comment fixup] Link: https://lkml.kernel.org/r/Zg2LsNm6twOmG69l@P9FQF9L96D.corp.robot.car Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-0-d85d2563287a@suse.cz Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-1-d85d2563287a@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/ Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Aishwarya TCV <aishwarya.tcv@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4dc7d37370 |
remove references to page->flags in documentation
Mostly rewording, but remove entirely the copy of page_fixed_fake_head() in the documentation; we can refer people to the actual source if necessary. Link: https://lkml.kernel.org/r/20240326171045.410737-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5e0debe012 |
slub: remove use of page->flags
Use slub->__page_flags instead. We can also remove the assertion that it's not a tail page as struct slab never points to a tail page. Link: https://lkml.kernel.org/r/20240326171045.410737-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
51718e25c5 |
mm: convert arch_clear_hugepage_flags to take a folio
All implementations that aren't no-ops just set a bit in the flags, and we want to use the folio flags rather than the page flags for that. Rename it to arch_clear_hugetlb_flags() while we're touching it so nobody thinks it's used for THP. [willy@infradead.org: fix arm64 build] Link: https://lkml.kernel.org/r/ZgQvNKGdlDkwhQEX@casper.infradead.org Link: https://lkml.kernel.org/r/20240326171045.410737-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b84fd2835c |
mm: make page_mapped() take a const argument
None of the functions called by page_mapped() modify the page or folio, so mark them all as const. Link: https://lkml.kernel.org/r/20240326171045.410737-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2ace5a670e |
mm: make is_free_buddy_page() take a const argument
This function does not modify its argument; let the callers know that so they can make better optimisation decisions. Link: https://lkml.kernel.org/r/20240326171045.410737-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6e65aa55cd |
mm: make page_ext_get() take a const argument
In order to constify other functions, we need page_ext_get() to be const. This is no problem as lookup_page_ext() already takes a const argument. Link: https://lkml.kernel.org/r/20240326171045.410737-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f002882ca3 |
mm: merge folio_is_secretmem() and folio_fast_pin_allowed() into gup_fast_folio_allowed()
folio_is_secretmem() is currently only used during GUP-fast. Nowadays, folio_fast_pin_allowed() performs similar checks during GUP-fast and contains a lot of careful handling -- READ_ONCE() -- , sanity checks -- lockdep_assert_irqs_disabled() -- and helpful comments on how this handling is safe and correct. So let's merge folio_is_secretmem() into folio_fast_pin_allowed(). Rename folio_fast_pin_allowed() to gup_fast_folio_allowed(), to better match the new semantics. Link: https://lkml.kernel.org/r/20240326143210.291116-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: xingwei lee <xrivendell7@gmail.com> Cc: yue sun <samsun1006219@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5b34b76cb0 |
mm: move follow_phys to arch/x86/mm/pat/memtype.c
follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. Move it there and hardcode the two arguments that get the same values passed by both callers. [david@redhat.com: conflict resolutions] Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cb10c28ac8 |
mm: remove follow_pfn
Remove follow_pfn now that the last user is gone. Link: https://lkml.kernel.org/r/20240324234542.2038726-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fei Li <fei1.li@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
85109a8a9a |
mm: backing-dev: use group allocation/free of per-cpu counters API
Use group allocation/free of per-cpu counters api to accelerate wb_init/exit() and simplify code. Link: https://lkml.kernel.org/r/20240325035635.49342-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
a8353dc98f |
huge_memory.c: document huge page splitting rules more thoroughly
1. Add information about the behavior of huge page splitting, with respect to page/folio refcounts, and gup/pup pins. 2. Update and clarify the existing documentation, to compensate for the ravages of time and code change. Link: https://lkml.kernel.org/r/20240325044452.217463-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d4e6b397be |
mm/mmap: convert all mas except mas_detach to vma iterator
There are two types of iterators mas and vmi in the current code. If the
maple tree comes from the mm structure, we can use the vma iterator.
Avoid using mas directly as possible.
Keep using mas for the mt_detach tree, since it doesn't come from the mm
structure.
Remove as many uses of mas as possible, but we will still have a few that
must be passed through in unmap_vmas() and free_pgtables().
Also introduce vma_iter_reset, vma_iter_{prev, next}_range_limit and
vma_iter_area_{lowest, highest} helper functions for using the vma
interator.
Link: https://lkml.kernel.org/r/20240325063258.1437618-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Helge Deller <deller@gmx.de> [parisc]
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0b52663f75 |
mm/mm_init.c: remove arch_reserved_kernel_pages()
Since the current calculation of calc_nr_kernel_pages() has taken into consideration of kernel reserved memory, no need to have arch_reserved_kernel_pages() any more. Link: https://lkml.kernel.org/r/20240325145646.1044760-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
90e796e22e |
mm/mm_init.c: remove unneeded calc_memmap_size()
Nobody calls calc_memmap_size() now. Link: https://lkml.kernel.org/r/20240325145646.1044760-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0ac5e785dc |
mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()
Currently, in free_area_init_core(), when initialize zone's field, a rough
value is set to zone->managed_pages. That value is calculated by
(zone->present_pages - memmap_pages).
In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately). Both of them are gonna be used in
alloc_large_system_hash().
However, the rough calculation and setting of zone->managed_pages is
meaningless because
a) memmap pages are allocated on units of node in sparse_init() or
alloc_node_mem_map(pgdat); The simple (zone->present_pages -
memmap_pages) is too rough to make sense for zone;
b) the set zone->managed_pages will be zeroed out and reset with
acutal value in mem_init() via memblock_free_all(). Before the
resetting, no buddy allocation request is issued.
Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), directly set zone->managed_pages as
zone->present_pages for now. It will be adjusted in mem_init().
And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core(). Instead, call the newly added
calc_nr_kernel_pages() to count up all free but not reserved memory in
memblock and assign to nr_all_pages and nr_kernel_pages. The counting
excludes memmap_pages, and other kernel used data, which is more accurate
than old way and simpler, and can also cover the ppc required
arch_reserved_kernel_pages() case.
And also clean up the outdated code comment above free_area_init_core().
And free_area_init_core() is easy to understand now, no need to add words
to explain.
[bhe@redhat.com: initialize zone->managed_pages as zone->present_pages for now]
Link: https://lkml.kernel.org/r/ZgU0bsJ2FEjykvju@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240325145646.1044760-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
8ad4184985 |
mm/mm_init.c: add new function calc_nr_all_pages()
This is a preparation to calculate nr_kernel_pages and nr_all_pages, both of which will be used later in alloc_large_system_hash(). nr_all_pages counts up all free but not reserved memory in memblock allocator, including HIGHMEM memory. While nr_kernel_pages counts up all free but not reserved low memory in memblock allocator, excluding HIGHMEM memory. Link: https://lkml.kernel.org/r/20240325145646.1044760-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6600a6b10c |
mm/mm_init.c: remove the useless dma_reserve
Now nobody calls set_dma_reserve() to set value for dma_reserve, remove set_dma_reserve(), global variable dma_reserve and the codes using it. Link: https://lkml.kernel.org/r/20240325145646.1044760-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
6758c1128c |
mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+12.5%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
b2ebcf9d3d |
mm/filemap: clean up hugetlb exclusion code
__filemap_add_folio only has two callers, one never passes hugetlb folio and one always passes in hugetlb folio. So move the hugetlb related cgroup charging out of it to make the code cleaner. Link: https://lkml.kernel.org/r/20240415171857.19244-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
de60fd8dde |
mm/filemap: return early if failed to allocate memory for split
Patch series "mm/filemap: optimize folio adding and splitting", v4.
Currently, at least 3 tree walks are needed for filemap folio adding if
the folio is previously evicted. One for getting the order of current
slot, one for ranged conflict check, and one for another order retrieving.
If a split is needed, more walks are needed.
This series is trying to merge these walks, and speed up
filemap_add_folio, I see a 7.5% - 12.5% performance gain for fio stress
test.
So instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion. If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.
Drop the lock and do the allocation only if a split is needed.
In the best case, it only need to walk the tree once. If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).
Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap --rw=randread --time_based \
--ramp_time=30s --runtime=5m --group_reporting
Before:
bw ( MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691
After (+7.3%):
bw ( MiB/s): min= 493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651
Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):
echo 3 > /proc/sys/vm/drop_caches
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap -thp=1 --readonly \
--rw=randread --time_based --ramp_time=30s --runtime=10m \
--group_reporting
fio -name=cached --numjobs=16 --filename=/mnt/test.img \
--buffered=1 --ioengine=mmap \
--rw=randread --time_based --runtime=5s --group_reporting
Before:
bw ( KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·
READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec
After (+10.4%):
bw ( KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146
READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec
The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.
This patch (of 4):
xas_split_alloc could fail with NOMEM, and in such case, it should abort
early instead of keep going and fail the xas_split below.
Link: https://lkml.kernel.org/r/20240416071722.45997-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
ebb34f78d7 |
mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
Callers of folio_estimated_sharers() only care about "mapped shared vs. mapped exclusively", not the exact estimate of sharers. Let's consolidate and unify the condition users are checking. While at it clarify the semantics and extend the discussion on the fuzziness. Use the "likely mapped shared" terminology to better express what the (adjusted) function actually checks. Whether a partially-mappable folio is more likely to not be partially mapped than partially mapped is debatable. In the future, we might be able to improve our estimate for partially-mappable folios, though. Note that we will now consistently detect "mapped shared" only if the first subpage is actually mapped multiple times. When the first subpage is not mapped, we will consistently detect it as "mapped exclusively". This change should currently only affect the usage in madvise_free_pte_range() and queue_folios_pte_range() for large folios: if the first page was already unmapped, we would have skipped the folio. [david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup] Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7262f208ca |
mm/migrate: split source folio if it is on deferred split list
If the source folio is on deferred split list, it is likely some subpages are not used. Split it before migration to avoid migrating unused subpages. Commit |
||
|
|
73bc32875e |
mm: hold PTL from the first PTE while reclaiming a large folio
Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE modifications preceded by pte clear. While iterating over PTEs of a large folio, it only starts acquiring PTL from the first valid (present) PTE. PTE modifications can temporarily set PTEs to pte_none. Consequently, the initial PTEs of a large folio might be skipped in try_to_unmap_one(). For example, for an anon folio, if we skip PTE0, we may have PTE0 which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after try_to_unmap_one(). So folio will be still mapped, the folio fails to be reclaimed and is put back to LRU in this round. This also breaks up PTEs optimization such as CONT-PTE on this large folio and may lead to accident folio_split() afterwards. And since a part of PTEs are now swap entries, accessing those parts will introduce overhead - do_swap_page. Although the kernel can withstand all of the above issues, the situation still seems quite awkward and warrants making it more ideal. The same race also occurs with small folios, but they have only one PTE, thus, it won't be possible for them to be partially unmapped. This patch holds PTL from PTE0, allowing us to avoid reading PTE values that are in the process of being transformed. With stable PTE values, we can ensure that this large folio is either completely reclaimed or that all PTEs remain untouched in this round. A corner case is that if we hold PTL from PTE0 and most initial PTEs have been really unmapped before that, we may increase the duration of holding PTL. Thus we only apply this optimization to folios which are still entirely mapped (not in deferred_split list). [akpm@linux-foundation.org: rewrap comment, per Matthew] Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Huang, Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4b68a773a7 |
mm/vmalloc.c: optimize to reduce arguments of alloc_vmap_area()
If called by __get_vm_area_node(), by open coding the field assignments of 'struct vm_struct *vm', and move the vm->flags and vm->caller assignments into __get_vm_area_node(), the passed in arguments 'flags' and 'caller' can be removed. This alleviates overloaded arguments passed in for alloc_vmap_area(). Link: https://lkml.kernel.org/r/20240309044454.648888-1-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
5c46d5319b |
mm/filemap: don't decrease mmap_miss when folio has workingset flag
If there are too many folios that are recently evicted in a file, then they will probably continue to be evicted. In such situation, there is no positive effect to read-ahead this file since it is only a waste of IO. The mmap_miss is increased in do_sync_mmap_readahead() and decreased in both do_async_mmap_readahead() and filemap_map_pages(). In order to skip read-ahead in above scenario, the mmap_miss have to increased exceed MMAP_LOTSAMISS. This can be done by stop decreased mmap_miss when folio has workingset flag. The async path is not to care because in above scenario, it's hard to run into the async path. [liushixin2@huawei.com: add comments] Link: https://lkml.kernel.org/r/20240326065026.1910584-1-liushixin2@huawei.com Link: https://lkml.kernel.org/r/20240322093555.226789-3-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0fd44ab213 |
mm/readahead: break read-ahead loop if filemap_add_folio return -ENOMEM
Patch series "Fix I/O high when memory almost met memcg limit", v2.
Recently, when install package in a docker which almost reached its memory
limit, the installer has no respond severely for more than 15 minutes.
During this period, I/O stays high(~1G/s) and influence the whole machine.
I've constructed a use case as follows:
1. create a docker:
$ cat test.sh
#!/bin/bash
docker rm centos7 --force
docker create --name centos7 --memory 4G --memory-swap 6G centos:7 /usr/sbin/init
docker start centos7
sleep 1
docker cp ./alloc_page centos7:/
docker cp ./reproduce.sh centos7:/
docker exec -it centos7 /bin/bash
2. try reproduce the problem in docker:
$ cat reproduce.sh
#!/bin/bash
while true; do
flag=$(ps -ef | grep -v grep | grep alloc_page| wc -l)
if [ "$flag" -eq 0 ]; then
/alloc_page &
fi
sleep 30
start_time=$(date +%s)
yum install -y expect > /dev/null 2>&1
end_time=$(date +%s)
elapsed_time=$((end_time - start_time))
echo "$elapsed_time seconds"
yum remove -y expect > /dev/null 2>&1
done
$ cat alloc_page.c:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#define SIZE 1*1024*1024 //1M
int main()
{
void *addr = NULL;
int i;
for (i = 0; i < 1024 * 6 - 50;i++) {
addr = (void *)malloc(SIZE);
if (!addr)
return -1;
memset(addr, 0, SIZE);
}
sleep(99999);
return 0;
}
We found that this problem is caused by a lot ot meaningless read-ahead.
Since the docker is almost met memory limit, the page will be reclaimed
immediately after read-ahead and will read-ahead again immediately. The
program is executed slowly and waste a lot of I/O resource.
These two patch aim to break the read-ahead in above scenario.
[1] https://lore.kernel.org/linux-mm/c2f4a2fa-3bde-72ce-66f5-db81a373fdbc@huawei.com/T/
[2] https://lore.kernel.org/all/20240201100835.1626685-1-liushixin2@huawei.com/
[3] https://lore.kernel.org/all/20240201173130.frpaqpy7iyzias5j@quack3/
This patch (of 2):
When filemap_add_folio() return -ENOMEM, break read-ahead loop like what
filemap_alloc_folio() does.
Link: https://lkml.kernel.org/r/20240322093555.226789-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20240322093555.226789-2-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f238b8c33c |
arm64: mm: swap: support THP_SWAP on hardware with MTE
Commit
|
||
|
|
42d0c3fbb5 |
mm: hugetlb: make the hugetlb migration strategy consistent
As discussed in previous thread [1], there is an inconsistency when handing hugetlb migration. When handling the migration of freed hugetlb, it prevents fallback to other NUMA nodes in alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use hugetlb, it allows fallback to other NUMA nodes in alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool and might result in unexpected failures when node bound workloads doesn't get what is asssumed available. To make hugetlb migration strategy more clear, we should list all the scenarios of hugetlb migration and analyze whether allocation fallback is permitted: 1) Memory offline: will call dissolve_free_huge_pages() to free the freed hugetlb, and call do_migrate_range() to migrate the in-use hugetlb. Both can break the per-node hugetlb pool, but as this is an explicit offlining operation, no better choice. So should allow the hugetlb allocation fallback. 2) Memory failure: same as memory offline. Should allow fallback to a different node might be the only option to handle it, otherwise the impact of poisoned memory can be amplified. 3) Longterm pinning: will call migrate_longterm_unpinnable_pages() to migrate in-use and not-longterm-pinnable hugetlb, which can break the per-node pool. But we should fail to longterm pinning if can not allocate on current node to avoid breaking the per-node pool. 4) Syscalls (mbind, migrate_pages, move_pages): these are explicit users operation to move pages to other nodes, so fallback to other nodes should not be prohibited. 5) alloc_contig_range: used by CMA allocation and virtio-mem fake-offline to allocate given range of pages. Now the freed hugetlb migration is not allowed to fallback, to keep consistency, the in-use hugetlb migration should be also not allowed to fallback. 6) alloc_contig_pages: used by kfence, pgtable_debug etc. The strategy should be consistent with that of alloc_contig_range(). Based on the analysis of the various scenarios above, introducing a new helper to determine whether fallback is permitted according to the migration reason.. [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/ Link: https://lkml.kernel.org/r/3519fcd41522817307a05b40fb551e2e17e68101.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e42dfe4e0a |
mm: record the migration reason for struct migration_target_control
Patch series "make the hugetlb migration strategy consistent", v2. As discussed in previous thread [1], there is an inconsistency when handling hugetlb migration. When handling the migration of freed hugetlb, it prevents fallback to other NUMA nodes in alloc_and_dissolve_hugetlb_folio(). However, when dealing with in-use hugetlb, it allows fallback to other NUMA nodes in alloc_hugetlb_folio_nodemask(), which can break the per-node hugetlb pool and might result in unexpected failures when node bound workloads doesn't get what is asssumed available. This patchset tries to make the hugetlb migration strategy more clear and consistent. Please find details in each patch. [1] https://lore.kernel.org/all/6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com/ This patch (of 2): To support different hugetlb allocation strategies during hugetlb migration based on various migration reasons, record the migration reason in the migration_target_control structure as a preparation. Link: https://lkml.kernel.org/r/cover.1709719720.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/7b95d4981e07211f57139fc5b1f7ce91b920cee4.1709719720.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
aaab830ad8 |
mm/vmalloc: eliminated the lock contention from twice to once
When allocating a new memory area where the mapping address range is known, it is observed that the vmap_node->busy.lock is acquired twice. The first acquisition occurs in the alloc_vmap_area() function when inserting the vm area into the vm mapping red-black tree. The second acquisition occurs in the setup_vmalloc_vm() function when updating the properties of the vm, such as flags and address, etc. Combine these two operations together in alloc_vmap_area(), which improves scalability when the vmap_node->busy.lock is contended. By doing so, the need to acquire the lock twice can also be eliminated to once. With the above change, tested on intel sapphire rapids platform(224 vcpu), a 4% performance improvement is gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), which is the stress test of thread creations. Link: https://lkml.kernel.org/r/20240307021440.64967-1-rulin.huang@intel.com Co-developed-by: "Chen, Tim C" <tim.c.chen@intel.com> Signed-off-by: "Chen, Tim C" <tim.c.chen@intel.com> Co-developed-by: "King, Colin" <colin.king@intel.com> Signed-off-by: "King, Colin" <colin.king@intel.com> Signed-off-by: rulinhuang <rulin.huang@intel.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Wangyang Guo <wangyang.guo@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c8d36bc2df |
mm/kmemleak: disable KASAN instrumentation in kmemleak
Kmemleak ia a memory leak checker. KASAN is also a memory checker but it focuses more on finding out-of-bounds and use-after-free bugs. Since kmemleak is inherently slow especially on systems with large number of CPUs, adding KASAN instrumentation will make it slower even more. As kmemleak is not for production use, the utility of enabling KASAN there is questionable. This patch disables KASAN instrumentation for configurations that enable both of them to slightly reduce performance overhead. Link: https://lkml.kernel.org/r/20240307190548.963626-3-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b04da04220 |
mm/kmemleak: compact kmemleak_object further
Patch series "mm/kmemleak: Minor cleanup & performance tuning".
This series contains 2 simple cleanup patches to slightly reduce memory
and performance overhead.
This patch (of 2):
With commit
|
||
|
|
cc9bc36ebe |
mm: zswap: remove nr_zswap_stored atomic
nr_stored was introduced by commit |
||
|
|
883dd161e9 |
mm: page_alloc: batch vmstat updates in expand()
expand() currently updates vmstat for every subpage. This is unnecessary, since they're all of the same zone and migratetype. Count added pages locally, then do a single vmstat update. Link: https://lkml.kernel.org/r/20240327190111.GC7597@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e1f42a577f |
mm: page_alloc: change move_freepages() to __move_freepages_block()
The function is now supposed to be called only on a single pageblock and checks start_pfn and end_pfn accordingly. Rename it to make this more obvious and drop the end_pfn parameter which can be determined trivially and none of the callers use it for anything else. Also make the (now internal) end_pfn exclusive, which is more common. Link: https://lkml.kernel.org/r/81b1d642-2ec0-49f5-89fc-19a3828419ff@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e0932b6c1f |
mm: page_alloc: consolidate free page accounting
Free page accounting currently happens a bit too high up the call stack, where it has to deal with guard pages, compaction capturing, block stealing and even page isolation. This is subtle and fragile, and makes it difficult to hack on the code. Now that type violations on the freelists have been fixed, push the accounting down to where pages enter and leave the freelist. [hannes@cmpxchg.org: undo unrelated drive-by line wrap] Link: https://lkml.kernel.org/r/20240327185736.GA7597@cmpxchg.org [hannes@cmpxchg.org: remove unused page parameter from account_freepages()] Link: https://lkml.kernel.org/r/20240327185831.GB7597@cmpxchg.org [baolin.wang@linux.alibaba.com: fix free page accounting] Link: https://lkml.kernel.org/r/a2a48baca69f103aa431fd201f8a06e3b95e203d.1712648441.git.baolin.wang@linux.alibaba.com [andriy.shevchenko@linux.intel.com: avoid defining unused function] Link: https://lkml.kernel.org/r/20240423161506.2637177-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20240320180429.678181-11-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fd919a85cd |
mm: page_isolation: prepare for hygienic freelists
Page isolation currently sets MIGRATE_ISOLATE on a block, then drops
zone->lock and scans the block for straddling buddies to split up.
Because this happens non-atomically wrt the page allocator, it's possible
for allocations to get a buddy whose first block is a regular pcp
migratetype but whose tail is isolated. This means that in certain cases
memory can still be allocated after isolation. It will also trigger the
freelist type hygiene warnings in subsequent patches.
start_isolate_page_range()
isolate_single_pageblock()
set_migratetype_isolate(tail)
lock zone->lock
move_freepages_block(tail) // nop
set_pageblock_migratetype(tail)
unlock zone->lock
__rmqueue_smallest()
del_page_from_freelist(head)
expand(head, head_mt)
WARN(head_mt != tail_mt)
start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
for (pfn = start_pfn, pfn < end_pfn)
if (PageBuddy())
split_free_page(head)
Introduce a variant of move_freepages_block() provided by the allocator
specifically for page isolation; it moves free pages, converts the block,
and handles the splitting of straddling buddies while holding zone->lock.
The allocator knows that pageblocks and buddies are always naturally
aligned, which means that buddies can only straddle blocks if they're
actually >pageblock_order. This means the search-and-split part can be
simplified compared to what page isolation used to do.
Also tighten up the page isolation code around the expectations of which
pages can be large, and how they are freed.
Based on extensive discussions with and invaluable input from Zi Yan.
[hannes@cmpxchg.org: work around older gcc warning]
Link: https://lkml.kernel.org/r/20240321142426.GB777580@cmpxchg.org
Link: https://lkml.kernel.org/r/20240320180429.678181-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
f37c0f6876 |
mm: page_alloc: set migratetype inside move_freepages()
This avoids changing migratetype after move_freepages() or move_freepages_block(), which is error prone. It also prepares for upcoming changes to fix move_freepages() not moving free pages partially in the range. Link: https://lkml.kernel.org/r/20240320180429.678181-9-hannes@cmpxchg.org Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
55612e80e7 |
mm: page_alloc: close migratetype race between freeing and stealing
There are three freeing paths that read the page's migratetype optimistically before grabbing the zone lock. When this races with block stealing, those pages go on the wrong freelist. The paths in question are: - when freeing >costly orders that aren't THP - when freeing pages to the buddy upon pcp lock contention - when freeing pages that are isolated - when freeing pages initially during boot - when freeing the remainder in alloc_pages_exact() - when "accepting" unaccepted VM host memory before first use - when freeing pages during unpoisoning None of these are so hot that they would need this optimization at the cost of hampering defrag efforts. Especially when contrasted with the fact that the most common buddy freeing path - free_pcppages_bulk - is checking the migratetype under the zone->lock just fine. In addition, isolated pages need to look up the migratetype under the lock anyway, which adds branches to the locked section, and results in a double lookup when the pages are in fact isolated. Move the lookups into the lock. Link: https://lkml.kernel.org/r/20240320180429.678181-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c0cd6f557b |
mm: page_alloc: fix freelist movement during block conversion
Currently, page block type conversion during fallbacks, atomic reservations and isolation can strand various amounts of free pages on incorrect freelists. For example, fallback stealing moves free pages in the block to the new type's freelists, but then may not actually claim the block for that type if there aren't enough compatible pages already allocated. In all cases, free page moving might fail if the block straddles more than one zone, in which case no free pages are moved at all, but the block type is changed anyway. This is detrimental to type hygiene on the freelists. It encourages incompatible page mixing down the line (ask for one type, get another) and thus contributes to long-term fragmentation. Split the process into a proper transaction: check first if conversion will happen, then try to move the free pages, and only if that was successful convert the block to the new type. [baolin.wang@linux.alibaba.com: fix allocation failures with CONFIG_CMA] Link: https://lkml.kernel.org/r/a97697e0-45b0-4f71-b087-fdc7a1d43c0e@linux.alibaba.com Link: https://lkml.kernel.org/r/20240320180429.678181-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2dd482ba62 |
mm: page_alloc: fix move_freepages_block() range error
When a block is partially outside the zone of the cursor page, the function cuts the range to the pivot page instead of the zone start. This can leave large parts of the block behind, which encourages incompatible page mixing down the line (ask for one type, get another), and thus long-term fragmentation. This triggers reliably on the first block in the DMA zone, whose start_pfn is 1. The block is stolen, but everything before the pivot page (which was often hundreds of pages) is left on the old list. Link: https://lkml.kernel.org/r/20240320180429.678181-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b54ccd3c6b |
mm: page_alloc: move free pages when converting block during isolation
When claiming a block during compaction isolation, move any remaining free pages to the correct freelists as well, instead of stranding them on the wrong list. Otherwise, this encourages incompatible page mixing down the line, and thus long-term fragmentation. Link: https://lkml.kernel.org/r/20240320180429.678181-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e6cf9e1c4c |
mm: page_alloc: fix up block types when merging compatible blocks
The buddy allocator coalesces compatible blocks during freeing, but it doesn't update the types of the subblocks to match. When an allocation later breaks the chunk down again, its pieces will be put on freelists of the wrong type. This encourages incompatible page mixing (ask for one type, get another), and thus long-term fragmentation. Update the subblocks when merging a larger chunk, such that a later expand() will maintain freelist type hygiene. Link: https://lkml.kernel.org/r/20240320180429.678181-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9cbe97bad5 |
mm: page_alloc: optimize free_unref_folios()
Move direct freeing of isolated pages to the lock-breaking block in the second loop. This saves an unnecessary migratetype reassessment. Minor comment and local variable scoping cleanups. Link: https://lkml.kernel.org/r/20240320180429.678181-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
17edeb5d3f |
mm: page_alloc: remove pcppage migratetype caching
Patch series "mm: page_alloc: freelist migratetype hygiene", v4.
The page allocator's mobility grouping is intended to keep unmovable pages
separate from reclaimable/compactable ones to allow on-demand
defragmentation for higher-order allocations and huge pages.
Currently, there are several places where accidental type mixing occurs:
an allocation asks for a page of a certain migratetype and receives
another. This ruins pageblocks for compaction, which in turn makes
allocating huge pages more expensive and less reliable.
The series addresses those causes. The last patch adds type checks on all
freelist movements to prevent new violations being introduced.
The benefits can be seen in a mixed workload that stresses the machine
with a memcache-type workload and a kernel build job while periodically
attempting to allocate batches of THP. The following data is aggregated
over 50 consecutive defconfig builds:
VANILLA PATCHED
Hugealloc Time mean 165843.93 ( +0.00%) 113025.88 ( -31.85%)
Hugealloc Time stddev 158957.35 ( +0.00%) 114716.07 ( -27.83%)
Kbuild Real time 310.24 ( +0.00%) 300.73 ( -3.06%)
Kbuild User time 1271.13 ( +0.00%) 1259.42 ( -0.92%)
Kbuild System time 582.02 ( +0.00%) 559.79 ( -3.81%)
THP fault alloc 30585.14 ( +0.00%) 40853.62 ( +33.57%)
THP fault fallback 36626.46 ( +0.00%) 26357.62 ( -28.04%)
THP fault fail rate % 54.49 ( +0.00%) 39.22 ( -27.53%)
Pagealloc fallback 1328.00 ( +0.00%) 1.00 ( -99.85%)
Pagealloc type mismatch 181009.50 ( +0.00%) 0.00 ( -100.00%)
Direct compact stall 434.56 ( +0.00%) 257.66 ( -40.61%)
Direct compact fail 421.70 ( +0.00%) 249.94 ( -40.63%)
Direct compact success 12.86 ( +0.00%) 7.72 ( -37.09%)
Direct compact success rate % 2.86 ( +0.00%) 2.82 ( -0.96%)
Compact daemon scanned migrate 3370059.62 ( +0.00%) 3612054.76 ( +7.18%)
Compact daemon scanned free 7718439.20 ( +0.00%) 5386385.02 ( -30.21%)
Compact direct scanned migrate 309248.62 ( +0.00%) 176721.04 ( -42.85%)
Compact direct scanned free 433582.84 ( +0.00%) 315727.66 ( -27.18%)
Compact migrate scanned daemon % 91.20 ( +0.00%) 94.48 ( +3.56%)
Compact free scanned daemon % 94.58 ( +0.00%) 94.42 ( -0.16%)
Compact total migrate scanned 3679308.24 ( +0.00%) 3788775.80 ( +2.98%)
Compact total free scanned 8152022.04 ( +0.00%) 5702112.68 ( -30.05%)
Alloc stall 872.04 ( +0.00%) 5156.12 ( +490.71%)
Pages kswapd scanned 510645.86 ( +0.00%) 3394.94 ( -99.33%)
Pages kswapd reclaimed 134811.62 ( +0.00%) 2701.26 ( -98.00%)
Pages direct scanned 99546.06 ( +0.00%) 376407.52 ( +278.12%)
Pages direct reclaimed 62123.40 ( +0.00%) 289535.70 ( +366.06%)
Pages total scanned 610191.92 ( +0.00%) 379802.46 ( -37.76%)
Pages scanned kswapd % 76.36 ( +0.00%) 0.10 ( -98.58%)
Swap out 12057.54 ( +0.00%) 15022.98 ( +24.59%)
Swap in 209.16 ( +0.00%) 256.48 ( +22.52%)
File refaults 17701.64 ( +0.00%) 11765.40 ( -33.53%)
Huge page success rate is higher, allocation latencies are shorter and
more predictable.
Stealing (fallback) rate is drastically reduced. Notably, while the
vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
kernel enters a steady state once the distribution of block types is
adequate for the workload. Steals over 50 runs:
VANILLA PATCHED
1504.0 227.0
1557.0 6.0
1391.0 13.0
1080.0 26.0
1057.0 40.0
1156.0 6.0
805.0 46.0
736.0 20.0
1747.0 2.0
1699.0 34.0
1269.0 13.0
1858.0 12.0
907.0 4.0
727.0 2.0
563.0 2.0
3094.0 2.0
10211.0 3.0
2621.0 1.0
5508.0 2.0
1060.0 2.0
538.0 3.0
5773.0 2.0
2199.0 0.0
3781.0 2.0
1387.0 1.0
4977.0 0.0
2865.0 1.0
1814.0 1.0
3739.0 1.0
6857.0 0.0
382.0 0.0
407.0 1.0
3784.0 0.0
297.0 0.0
298.0 0.0
6636.0 0.0
4188.0 0.0
242.0 0.0
9960.0 0.0
5816.0 0.0
354.0 0.0
287.0 0.0
261.0 0.0
140.0 1.0
2065.0 0.0
312.0 0.0
331.0 0.0
164.0 0.0
465.0 1.0
219.0 0.0
Type mismatches are down too. Those count every time an allocation
request asks for one migratetype and gets another. This can still occur
minimally in the patched kernel due to non-stealing fallbacks, but it's
quite rare and follows the pattern of overall fallbacks - once the block
type distribution settles, mismatches cease as well:
VANILLA: PATCHED:
182602.0 268.0
135794.0 20.0
88619.0 19.0
95973.0 0.0
129590.0 0.0
129298.0 0.0
147134.0 0.0
230854.0 0.0
239709.0 0.0
137670.0 0.0
132430.0 0.0
65712.0 0.0
57901.0 0.0
67506.0 0.0
63565.0 4.0
34806.0 0.0
42962.0 0.0
32406.0 0.0
38668.0 0.0
61356.0 0.0
57800.0 0.0
41435.0 0.0
83456.0 0.0
65048.0 0.0
28955.0 0.0
47597.0 0.0
75117.0 0.0
55564.0 0.0
38280.0 0.0
52404.0 0.0
26264.0 0.0
37538.0 0.0
19671.0 0.0
30936.0 0.0
26933.0 0.0
16962.0 0.0
44554.0 0.0
46352.0 0.0
24995.0 0.0
35152.0 0.0
12823.0 0.0
21583.0 0.0
18129.0 0.0
31693.0 0.0
28745.0 0.0
33308.0 0.0
31114.0 0.0
35034.0 0.0
12111.0 0.0
24885.0 0.0
Compaction work is markedly reduced despite much better THP rates.
In the vanilla kernel, reclaim seems to have been driven primarily by
watermark boosting that happens as a result of fallbacks. With those all
but eliminated, watermarks average lower and kswapd does less work. The
uptick in direct reclaim is because THP requests have to fend for
themselves more often - which is intended policy right now. Aggregate
reclaim activity is lowered significantly, though.
This patch (of 10):
The idea behind the cache is to save get_pageblock_migratetype() lookups
during bulk freeing. A microbenchmark suggests this isn't helping,
though. The pcp migratetype can get stale, which means that bulk freeing
has an extra branch to check if the pageblock was isolated while on the
pcp.
While the variance overlaps, the cache write and the branch seem to make
this a net negative. The following test allocates and frees batches of
10,000 pages (~3x the pcp high marks to trigger flushing):
Before:
8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
19 context-switches # 4.341 /sec ( +- 3.24% )
0 cpu-migrations # 0.000 /sec
17,440 page-faults # 3.984 K/sec ( +- 2.90% )
41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
After:
8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
22 context-switches # 5.160 /sec ( +- 3.23% )
0 cpu-migrations # 0.000 /sec
17,443 page-faults # 4.091 K/sec ( +- 2.90% )
40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
A side effect is that this also ensures that pages whose pageblock gets
stolen while on the pcplist end up on the right freelist and we don't
perform potentially type-incompatible buddy merges (or skip merges when we
shouldn't), which is likely beneficial to long-term fragmentation
management, although the effects would be harder to measure. Settle for
simpler and faster code as justification here.
Link: https://lkml.kernel.org/r/20240320180429.678181-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20240320180429.678181-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: "Huang, Ying" <ying.huang@intel.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
42a346b41c |
hugetlb: remove mention of destructors
We no longer have destructors or dtors, merely a page flag (technically a page type flag, but that's an implementation detail). Remove __clear_hugetlb_destructor, fix up comments and the occasional variable name. Link: https://lkml.kernel.org/r/20240321142448.1645400-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8f790d0c7c |
mm: improve dumping of mapcount and page_type
For pages that have a page_type, set the mapcount to 0, which will reduce
the confusion in people reading page dumps ("Why does this page have a
mapcount of -128?"). Now that hugetlbfs is a page_type, read the
entire_mapcount for any large folio; this is fine for all folios as no
user reuses the entire_mapcount field.
For pages which do not have a page type, do not print it to reduce
clutter.
Link: https://lkml.kernel.org/r/20240321142448.1645400-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
46df8e73a4 |
mm: free up PG_slab
Reclaim the Slab page flag by using a spare bit in PageType. We are perennially short of page flags for various purposes, and now that the original SLAB allocator has been retired, SLUB does not use the mapcount/page_type field. This lets us remove a number of special cases for ignoring mapcount on Slab pages. [willy@infradead.org: update vmcoreinfo] Link: https://lkml.kernel.org/r/ZgGV-O8WYQ_83kxp@casper.infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
85edc15a4c |
mm: remove folio_prep_large_rmappable()
Now that prep_compound_page() initialises folio->_deferred_list, folio_prep_large_rmappable()'s only purpose is to set the large_rmappable flag, so inline it into the two callers. Take the opportunity to convert the large_rmappable definition from PAGEFLAG to FOLIO_FLAG and remove the existance of PageTestLargeRmappable and friends. Link: https://lkml.kernel.org/r/20240321142448.1645400-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b7b098cf00 |
mm: always initialise folio->_deferred_list
Patch series "Various significant MM patches". These patches all interact in annoying ways which make it tricky to send them out in any way other than a big batch, even though there's not really an overarching theme to connect them. The big effects of this patch series are: - folio_test_hugetlb() becomes reliable, even when called without a page reference - We free up PG_slab, and we could always use more page flags - We no longer need to check PageSlab before calling page_mapcount() This patch (of 9): For compound pages which are at least order-2 (and hence have a deferred_list), initialise it and then we can check at free that the page is not part of a deferred list. We recently found this useful to rule out a source of corruption. [peterx@redhat.com: always initialise folio->_deferred_list] Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3b89ec4174 |
mm/slub: avoid recursive loop with kmemleak
The system will immediate fill up stack and crash when both CONFIG_DEBUG_KMEMLEAK and CONFIG_MEM_ALLOC_PROFILING are enabled. Avoid allocation tagging of kmemleak caches, otherwise recursive allocation tracking occurs. Link: https://lkml.kernel.org/r/20240425205516.work.220-kees@kernel.org Fixes: 279bb991b4d9 ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
09c46563ff |
codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations
If slabobj_ext vector allocation for a slab object fails and later on it succeeds for another object in the same slab, the slabobj_ext for the original object will be NULL and will be flagged in case when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled. Mark failed slabobj_ext vector allocations using a new objext_flags flag stored in the lower bits of slab->obj_exts. When new allocation succeeds it marks all tag references in the same slabobj_ext vector as empty to avoid warnings implemented by CONFIG_MEM_ALLOC_PROFILING_DEBUG checks. Link: https://lkml.kernel.org/r/20240321163705.3067592-36-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d224eb0287 |
codetag: debug: mark codetags for reserved pages as empty
To avoid debug warnings while freeing reserved pages which were not allocated with usual allocators, mark their codetags as empty before freeing. Link: https://lkml.kernel.org/r/20240321163705.3067592-35-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
239d6c96d8 |
codetag: debug: skip objext checking when it's for objext itself
objext objects are created with __GFP_NO_OBJ_EXT flag and therefore have no corresponding objext themselves (otherwise we would get an infinite recursion). When freeing these objects their codetag will be empty and when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled this will lead to false warnings. Introduce CODETAG_EMPTY special codetag value to mark allocations which intentionally lack codetag to avoid these warnings. Set objext codetags to CODETAG_EMPTY before freeing to indicate that the codetag is expected to be empty. Link: https://lkml.kernel.org/r/20240321163705.3067592-34-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1438d349d1 |
lib: add memory allocations report in show_mem()
Include allocations in show_mem reports. Link: https://lkml.kernel.org/r/20240321163705.3067592-33-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
88ae5fb755 |
mm: vmalloc: enable memory allocation profiling
This wrapps all external vmalloc allocation functions with the alloc_hooks() wrapper, and switches internal allocations to _noprof variants where appropriate, for the new memory allocation profiling feature. [surenb@google.com: arch/um: fix forward declaration for vmalloc] Link: https://lkml.kernel.org/r/20240326073750.726636-1-surenb@google.com [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-5-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-31-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
24e44cc22a |
mm: percpu: enable per-cpu allocation tagging
Redefine __alloc_percpu, __alloc_percpu_gfp and __alloc_reserved_percpu to record allocations and deallocations done by these functions. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-6-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-30-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
60fa4a9e23 |
mm: percpu: add codetag reference into pcpuobj_ext
To store codetag for every per-cpu allocation, a codetag reference is embedded into pcpuobj_ext when CONFIG_MEM_ALLOC_PROFILING=y. Hooks to use the newly introduced codetag are added. Link: https://lkml.kernel.org/r/20240321163705.3067592-29-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
8f30d2660a |
mm: percpu: introduce pcpuobj_ext
Upcoming alloc tagging patches require a place to stash per-allocation metadata. We already do this when memcg is enabled, so this patch generalizes the obj_cgroup * vector in struct pcpu_chunk by creating a pcpu_obj_ext type, which we will be adding to in an upcoming patch - similarly to the previous slabobj_ext patch. Link: https://lkml.kernel.org/r/20240321163705.3067592-28-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: linux-mm@kvack.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e26d8769da |
mempool: hook up to memory allocation profiling
This adds hooks to mempools for correctly annotating mempool-backed allocations at the correct source line, so they show up correctly in /sys/kernel/debug/allocations. Various inline functions are converted to wrappers so that we can invoke alloc_hooks() in fewer places. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-4-surenb@google.com [surenb@google.com: add missing mempool_create_node documentation] Link: https://lkml.kernel.org/r/20240402180835.1661905-1-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-27-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7bd230a266 |
mm/slab: enable slab allocation tagging for kmalloc and friends
Redefine kmalloc, krealloc, kzalloc, kcalloc, etc. to record allocations and deallocations done by these functions. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-7-surenb@google.com [rdunlap@infradead.org: fix kcalloc() kernel-doc warnings] Link: https://lkml.kernel.org/r/20240327044649.9199-1-rdunlap@infradead.org Link: https://lkml.kernel.org/r/20240321163705.3067592-26-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4b87369646 |
mm/slab: add allocation accounting into slab allocation and free paths
Account slab allocations using codetag reference embedded into slabobj_ext. Link: https://lkml.kernel.org/r/20240321163705.3067592-24-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
26865a1bfa |
mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
For all page allocations to be tagged, page_ext has to be initialized before the first page allocation. Early tasks allocate their stacks using page allocator before alloc_node_page_ext() initializes page_ext area, unless early_page_ext is enabled. Therefore these allocations will generate a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled. Enable early_page_ext whenever CONFIG_MEM_ALLOC_PROFILING_DEBUG=y to ensure page_ext initialization prior to any page allocation. This will have all the negative effects associated with early_page_ext, such as possible longer boot time, therefore we enable it only when debugging with CONFIG_MEM_ALLOC_PROFILING_DEBUG enabled and not universally for CONFIG_MEM_ALLOC_PROFILING. Link: https://lkml.kernel.org/r/20240321163705.3067592-22-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
cc92eba1c8 |
mm: fix non-compound multi-order memory accounting in __free_pages
When a non-compound multi-order page is freed, it is possible that a speculative reference keeps the page pinned. In this case we free all pages except for the first page, which will be freed later by the last put_page(). However the page passed to put_page() is indistinguishable from an order-0 page, so it cannot do the accounting, just as it cannot free the subsequent pages. Do the accounting here, where we free the pages. Link: https://lkml.kernel.org/r/20240321163705.3067592-21-surenb@google.com Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
be25d1d4e8 |
mm: create new codetag references during page splitting
When a high-order page is split into smaller ones, each newly split page should get its codetag. After the split each split page will be referencing the original codetag. The codetag's "bytes" counter remains the same because the amount of allocated memory has not changed, however the "calls" counter gets increased to keep the counter correct when these individual pages get freed. Link: https://lkml.kernel.org/r/20240321163705.3067592-20-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b951aaff50 |
mm: enable page allocation tagging
Redefine page allocators to record allocation tags upon their invocation. Instrument post_alloc_hook and free_pages_prepare to modify current allocation tag. [surenb@google.com: undo _noprof additions in the documentation] Link: https://lkml.kernel.org/r/20240326231453.1206227-3-surenb@google.com Link: https://lkml.kernel.org/r/20240321163705.3067592-19-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
dcfe378c81 |
lib: introduce support for page allocation tagging
Introduce helper functions to easily instrument page allocators by storing a pointer to the allocation tag associated with the code that allocated the page in a page_ext field. Link: https://lkml.kernel.org/r/20240321163705.3067592-15-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Co-developed-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
53ce720359 |
slab: objext: introduce objext_flags as extension to page_memcg_data_flags
Introduce objext_flags to store additional objext flags unrelated to memcg. Link: https://lkml.kernel.org/r/20240321163705.3067592-10-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
45012241ec |
mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
Slab extension objects can't be allocated before slab infrastructure is initialized. Some caches, like kmem_cache and kmem_cache_node, are created before slab infrastructure is initialized. Objects from these caches can't have extension objects. Introduce SLAB_NO_OBJ_EXT slab flag to mark these caches and avoid creating extensions for objects allocated from these slabs. Link: https://lkml.kernel.org/r/20240321163705.3067592-9-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
768c33be1b |
mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation
Introduce __GFP_NO_OBJ_EXT flag in order to prevent recursive allocations when allocating slabobj_ext on a slab. Link: https://lkml.kernel.org/r/20240321163705.3067592-8-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
21c690a349 |
mm: introduce slabobj_ext to support slab object extensions
Currently slab pages can store only vectors of obj_cgroup pointers in page->memcg_data. Introduce slabobj_ext structure to allow more data to be stored for each slab object. Wrap obj_cgroup into slabobj_ext to support current functionality while allowing to extend slabobj_ext in the future. Link: https://lkml.kernel.org/r/20240321163705.3067592-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9ea9cd8e61 |
mm/slub: mark slab_free_freelist_hook() __always_inline
It seems we need to be more forceful with the compiler on this one. This is done for performance reasons only. Link: https://lkml.kernel.org/r/20240321163705.3067592-4-surenb@google.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kees Cook <keescook@chromium.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andreas Hindborg <a.hindborg@samsung.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
0069455bcb |
fix missing vmalloc.h includes
Patch series "Memory allocation profiling", v6.
Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.
Example output:
root@moria-kvm:~# sort -rn /proc/allocinfo
127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
56373248 4737 mm/slub.c:2259 func:alloc_slab_page
14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
3940352 962 mm/memory.c:4214 func:alloc_anon_folio
2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
...
Usage:
kconfig options:
- CONFIG_MEM_ALLOC_PROFILING
- CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
- CONFIG_MEM_ALLOC_PROFILING_DEBUG
adds warnings for allocations that weren't accounted because of a
missing annotation
sysctl:
/proc/sys/vm/mem_profiling
Runtime info:
/proc/allocinfo
Notes:
[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:
kmalloc pgalloc
(1 baseline) 6.764s 16.902s
(2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%)
(3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%)
(4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%)
(5 memcg) 13.388s (+97.94%) 48.460s (+186.71%)
(6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%)
(7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%)
Memory overhead:
Kernel size:
text data bss dec diff
(1) 26515311 18890222 17018880 62424413
(2) 26524728 19423818 16740352 62688898 264485
(3) 26524724 19423818 16740352 62688894 264481
(4) 26524728 19423818 16740352 62688898 264485
(5) 26541782 18964374 16957440 62463596 39183
Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags: 192 kB
PageExts: 262144 kB (256MB)
SlabExts: 9876 kB (9.6MB)
PcpuExts: 512 kB (0.5MB)
Total overhead is 0.2% of total memory.
Benchmarks:
Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
baseline disabled profiling enabled profiling
avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023)
stdev 0.0137 0.0188 0.0077
hackbench -l 10000
baseline disabled profiling enabled profiling
avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859)
stdev 0.0933 0.0286 0.0489
stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/
This patch (of 37):
The next patch drops vmalloc.h from a system header in order to fix a
circular dependency; this adds it to all the files that were pulling it in
implicitly.
[kent.overstreet@linux.dev: fix arch/alpha/lib/memcpy.c]
Link: https://lkml.kernel.org/r/20240327002152.3339937-1-kent.overstreet@linux.dev
[surenb@google.com: fix arch/x86/mm/numa_32.c]
Link: https://lkml.kernel.org/r/20240402180933.1663992-1-surenb@google.com
[kent.overstreet@linux.dev: a few places were depending on sizes.h]
Link: https://lkml.kernel.org/r/20240404034744.1664840-1-kent.overstreet@linux.dev
[arnd@arndb.de: fix mm/kasan/hw_tags.c]
Link: https://lkml.kernel.org/r/20240404124435.3121534-1-arnd@kernel.org
[surenb@google.com: fix arc build]
Link: https://lkml.kernel.org/r/20240405225115.431056-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-2-surenb@google.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Andreas Hindborg <a.hindborg@samsung.com>
Cc: Benno Lossin <benno.lossin@proton.me>
Cc: "Björn Roy Baron" <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Gary Guo <gary@garyguo.net>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
2ccd48ce35 |
percpu: clean up all mappings when pcpu_map_pages() fails
In pcpu_map_pages(), if __pcpu_map_pages() fails on a CPU, we call __pcpu_unmap_pages() to clean up mappings on all CPUs where mappings were created, but not on the CPU where __pcpu_map_pages() fails. __pcpu_map_pages() and __pcpu_unmap_pages() are wrappers around vmap_pages_range_noflush() and vunmap_range_noflush(). All other callers of vmap_pages_range_noflush() call vunmap_range_noflush() when mapping fails, except pcpu_map_pages(). The reason could be that partial mappings may be left behind from a failed mapping attempt. Call __pcpu_unmap_pages() for the failed CPU as well in pcpu_map_pages(). This was found by code inspection, no failures or bugs were observed. Link: https://lkml.kernel.org/r/20240311194346.2291333-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Christoph Lameter (Ampere) <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
133d04b1ee |
mm/numa_balancing: allow migrate on protnone reference with MPOL_PREFERRED_MANY policy
commit |
||
|
|
f8fd525ba3 |
mm/mempolicy: use numa_node_id() instead of cpu_to_node()
Patch series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY
policy:, v4.
This patchset is to optimize the cross-socket memory access with
MPOL_PREFERRED_MANY policy.
To test this patch we ran the following test on a 3 node system.
Node 0 - 2GB - Tier 1
Node 1 - 11GB - Tier 1
Node 6 - 10GB - Tier 2
Below changes are made to memcached to set the memory policy,
It select Node0 and Node1 as preferred nodes.
#include <numaif.h>
#include <numa.h>
unsigned long nodemask;
int ret;
nodemask = 0x03;
ret = set_mempolicy(MPOL_PREFERRED_MANY | MPOL_F_NUMA_BALANCING,
&nodemask, 10);
/* If MPOL_F_NUMA_BALANCING isn't supported,
* fall back to MPOL_PREFERRED_MANY */
if (ret < 0 && errno == EINVAL){
printf("set mem policy normal\n");
ret = set_mempolicy(MPOL_PREFERRED_MANY, &nodemask, 10);
}
if (ret < 0) {
perror("Failed to call set_mempolicy");
exit(-1);
}
Test Procedure:
===============
1. Make sure memory tiring and demotion are enabled.
2. Start memcached.
# ./memcached -b 100000 -m 204800 -u root -c 1000000 -t 7
-d -s "/tmp/memcached.sock"
3. Run memtier_benchmark to store 3200000 keys.
#./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
--threads=1 --pipeline=1 --ratio=1:0 --key-pattern=S:S --key-minimum=1
--key-maximum=3200000 -n allkeys -c 1 -R -x 1 -d 1024
4. Start a memory eater on node 0 and 1. This will demote all memcached
pages to node 6.
5. Make sure all the memcached pages got demoted to lower tier by reading
/proc/<memcaced PID>/numa_maps.
# cat /proc/2771/numa_maps
---
default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
default anon=1009 dirty=1009 active=0 N6=1009 kernelpagesize_kB=64
---
6. Kill memory eater.
7. Read the pgpromote_success counter.
8. Start reading the keys by running memtier_benchmark.
#./memtier_benchmark -S "/tmp/memcached.sock" --protocol=memcache_binary
--pipeline=1 --distinct-client-seed --ratio=0:3 --key-pattern=R:R
--key-minimum=1 --key-maximum=3200000 -n allkeys
--threads=64 -c 1 -R -x 6
9. Read the pgpromote_success counter.
Test Results:
=============
Without Patch
------------------
1. pgpromote_success before test
Node 0: pgpromote_success 11
Node 1: pgpromote_success 140974
pgpromote_success after test
Node 0: pgpromote_success 11
Node 1: pgpromote_success 140974
2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
==================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency
------------------------------------------------------------------
Sets 0.00 --- --- --- ---
Gets 305792.03 305791.93 0.10 0.18949 0.16700
Waits 0.00 --- --- --- ---
Totals 305792.03 305791.93 0.10 0.18949 0.16700
======================================
p99 Latency p99.9 Latency KB/sec
-------------------------------------
--- --- 0.00
0.44700 1.71100 11542.69
--- --- ---
0.44700 1.71100 11542.69
With Patch
---------------
1. pgpromote_success before test
Node 0: pgpromote_success 5
Node 1: pgpromote_success 89386
pgpromote_success after test
Node 0: pgpromote_success 57895
Node 1: pgpromote_success 141463
2. Memtier-benchmark result.
AGGREGATED AVERAGE RESULTS (6 runs)
====================================================================
Type Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency
--------------------------------------------------------------------
Sets 0.00 --- --- --- ---
Gets 521942.24 521942.07 0.17 0.11459 0.10300
Waits 0.00 --- --- --- ---
Totals 521942.24 521942.07 0.17 0.11459 0.10300
=======================================
p99 Latency p99.9 Latency KB/sec
---------------------------------------
--- --- 0.00
0.23100 0.31900 19701.68
--- --- ---
0.23100 0.31900 19701.68
Test Result Analysis:
=====================
1. With patch we could observe pages are getting promoted.
2. Memtier-benchmark results shows that, with the patch,
performance has increased more than 50%.
Ops/sec without fix - 305792.03
Ops/sec with fix - 521942.24
This patch (of 2):
Instead of using 'cpu_to_node()', we use 'numa_node_id()', which is
quicker. smp_processor_id is guaranteed to be stable in the
'mpol_misplaced()' function because it is called with ptl held.
lockdep_assert_held was added to ensure that.
No functional change in this patch.
[donettom@linux.ibm.com: add "* @vmf: structure describing the fault" comment]
Link: https://lkml.kernel.org/r/d8b993ea9dccfac0bc3ed61d3a81f4ac5f376e46.1711002865.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/6059f034f436734b472d066db69676fb3a459864.1711373653.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/cover.1709909210.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/744646531af02cc687cde8ae788fb1779e99d02c.1709909210.git.donettom@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
fea68a7565 |
mm: zswap: remove unnecessary check in zswap_find_zpool()
zswap_find_zpool() checks if ZSWAP_NR_ZPOOLS > 1, which is always true. This is a remnant from a patch version that had ZSWAP_NR_ZPOOLS as a config option and never made it upstream. Remove the unnecessary check. Link: https://lkml.kernel.org/r/20240311235210.2937484-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
4196b48ddd |
mm: zpool: return pool size in pages
All zswap backends track their pool sizes in pages. Currently they multiply by PAGE_SIZE for zswap, only for zswap to divide again in order to do limit math. Report pages directly. Link: https://lkml.kernel.org/r/20240312153901.3441-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
91cdcd8d62 |
mm: zswap: optimize zswap pool size tracking
Profiling the munmap() of a zswapped memory region shows 60% of the total
cycles currently going into updating the zswap_pool_total_size.
There are three consumers of this counter:
- store, to enforce the globally configured pool limit
- meminfo & debugfs, to report the size to the user
- shrink, to determine the batch size for each cycle
Instead of aggregating everytime an entry enters or exits the zswap
pool, aggregate the value from the zpools on-demand:
- Stores aggregate the counter anyway upon success. Aggregating to
check the limit instead is the same amount of work.
- Meminfo & debugfs might benefit somewhat from a pre-aggregated
counter, but aren't exactly hotpaths.
- Shrinking can aggregate once for every cycle instead of doing it for
every freed entry. As the shrinker might work on tens or hundreds of
objects per scan cycle, this is a large reduction in aggregations.
The paths that benefit dramatically are swapin, swapoff, and unmaps.
There could be millions of pages being processed until somebody asks for
the pool size again. This eliminates the pool size updates from those
paths entirely.
Top profile entries for a 24G range munmap(), before:
38.54% zswap-unmap [kernel.kallsyms] [k] zs_zpool_total_size
12.51% zswap-unmap [kernel.kallsyms] [k] zpool_get_total_size
9.10% zswap-unmap [kernel.kallsyms] [k] zswap_update_total_size
2.95% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
2.88% zswap-unmap [kernel.kallsyms] [k] __slab_free
2.86% zswap-unmap [kernel.kallsyms] [k] xas_store
and after:
7.70% zswap-unmap [kernel.kallsyms] [k] __slab_free
7.16% zswap-unmap [kernel.kallsyms] [k] obj_cgroup_uncharge_zswap
6.74% zswap-unmap [kernel.kallsyms] [k] xas_store
It was also briefly considered to move to a single atomic in zswap
that is updated by the backends, since zswap only cares about the sum
of all pools anyway. However, zram directly needs per-pool information
out of zsmalloc. To keep the backend from having to update two atomics
every time, I opted for the lazy aggregation instead for now.
Link: https://lkml.kernel.org/r/20240312153901.3441-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
1965e933dd |
mm/treewide: replace pXd_huge() with pXd_leaf()
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/20240318200404.448346-12-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
7db86dc389 |
mm/gup: merge pXd huge mapping checks
Huge mapping checks in GUP are slightly redundant and can be simplified. pXd_huge() now is the same as pXd_leaf(). pmd_trans_huge() and pXd_devmap() should both imply pXd_leaf(). Time to merge them into one. Link: https://lkml.kernel.org/r/20240318200404.448346-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
089f92141e |
mm/gup: check p4d presence before going on
Currently there should have no p4d swap entries so it may not matter much, however this may help us to rule out swap entries in pXd_huge() API, which will include p4d_huge(). The p4d_present() checks make it 100% clear that we won't rely on p4d_huge() for swap entries. Link: https://lkml.kernel.org/r/20240318200404.448346-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
e6fd5564c0 |
mm/gup: cache p4d in follow_p4d_mask()
Add a variable to cache p4d in follow_p4d_mask(). It's a good practise to make sure all the following checks will have a consistent view of the entry. Link: https://lkml.kernel.org/r/20240318200404.448346-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
9abc71b47b |
mm/hmm: process pud swap entry without pud_huge()
Swap pud entries do not always return true for pud_huge() for all archs. x86 and sparc (so far) allow it, but all the rest do not accept a swap entry to be reported as pud_huge(). So it's not safe to check swap entries within pud_huge(). Check swap entries before pud_huge(), so it should be always safe. This is the only place in the kernel that (IMHO, wrongly) relies on pud_huge() to return true on pud swap entries. The plan is to cleanup pXd_huge() to only report non-swap mappings for all archs. Link: https://lkml.kernel.org/r/20240318200404.448346-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Mark Salter <msalter@redhat.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
55f77df7d7 |
mm: page_alloc: control latency caused by zone PCP draining
Patch series "mm/treewide: Remove pXd_huge() API", v2. In previous work [1], we removed the pXd_large() API, which is arch specific. This patchset further removes the hugetlb pXd_huge() API. Hugetlb was never special on creating huge mappings when compared with other huge mappings. Having a standalone API just to detect such pgtable entries is more or less redundant, especially after the pXd_leaf() API set is introduced with/without CONFIG_HUGETLB_PAGE. When looking at this problem, a few issues are also exposed that we don't have a clear definition of the *_huge() variance API. This patchset started by cleaning these issues first, then replace all *_huge() users to use *_leaf(), then drop all *_huge() code. On x86/sparc, swap entries will be reported "true" in pXd_huge(), while for all the rest archs they're reported "false" instead. This part is done in patch 1-5, in which I suspect patch 1 can be seen as a bug fix, but I'll leave that to hmm experts to decide. Besides, there are three archs (arm, arm64, powerpc) that have slightly different definitions between the *_huge() v.s. *_leaf() variances. I tackled them separately so that it'll be easier for arch experts to chim in when necessary. This part is done in patch 6-9. The final patches 10-14 do the rest on the final removal, since *_leaf() will be the ultimate API in the future, and we seem to have quite some confusions on how *_huge() APIs can be defined, provide a rich comment for *_leaf() API set to define them properly to avoid future misuse, and hopefully that'll also help new archs to start support huge mappings and avoid traps (like either swap entries, or PROT_NONE entry checks). [1] https://lore.kernel.org/r/20240305043750.93762-1-peterx@redhat.com This patch (of 14): When the complete PCP is drained a much larger number of pages than the usual batch size might be freed at once, causing large IRQ and preemption latency spikes, as they are all freed while holding the pcp and zone spinlocks. To avoid those latency spikes, limit the number of pages freed in a single bulk operation to common batch limits. Link: https://lkml.kernel.org/r/20240318200404.448346-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240318200736.2835502-1-l.stach@pengutronix.de Signed-off-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bjorn Andersson <andersson@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fabio Estevam <festevam@denx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Dybcio <konrad.dybcio@linaro.org> Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Cc: Mark Salter <msalter@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fa9fcd8bb6 |
mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE)
We changed faultin_page_range() to no longer consume a VMA, because faultin_page_range() might internally release the mm lock to lookup the VMA again -- required to cleanly handle VM_FAULT_RETRY. But independent of that, __get_user_pages() will always lookup the VMA itself. Now that we let __get_user_pages() just handle VMA checks in a way that is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise() is just overhead. So let's just call madvise_populate() on the full range instead. There is one change in behavior: madvise_walk_vmas() would skip any VMA holes, and if everything succeeded, it would return -ENOMEM after processing all VMAs. However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to notice any difference: -ENOMEM might either indicate that there were VMA holes or that populating page tables failed because there was not enough memory. So it's unlikely that user space will notice the difference, and that special handling likely only makes sense for some other madvise() actions. Further, we'd already fail with -ENOMEM early in the past if looking up the VMA after dropping the MM lock failed because of concurrent VMA modifications. So let's just keep it simple and avoid the madvise VMA walk, and consistently fail early if we find a VMA hole. Link: https://lkml.kernel.org/r/20240314161300.382526-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
91b71e78b8 |
mm: memcg: add NULL check to obj_cgroup_put()
9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). Move the NULL check in the function, similar to mem_cgroup_put(). The unlikely() NULL check in current_objcg_update() was left alone to avoid dropping the unlikey() annotation as this a fast path. Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
52ccdde16b |
mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio()
When I did memory failure tests recently, below warning occurs:
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 8 PID: 1011 at kernel/locking/lockdep.c:232 __lock_acquire+0xccb/0x1ca0
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
FS: 00007ff9f32aa740(0000) GS:ffffa1ce5fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff9f3134ba0 CR3: 00000008484e4000 CR4: 00000000000006f0
Call Trace:
<TASK>
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>
Kernel panic - not syncing: kernel: panic_on_warn set ...
CPU: 8 PID: 1011 Comm: bash Kdump: loaded Not tainted 6.9.0-rc3-next-20240410-00012-gdb69f219f4be #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
panic+0x326/0x350
check_panic_on_warn+0x4f/0x50
__warn+0x98/0x190
report_bug+0x18e/0x1a0
handle_bug+0x3d/0x70
exc_invalid_op+0x18/0x70
asm_exc_invalid_op+0x1a/0x20
RIP: 0010:__lock_acquire+0xccb/0x1ca0
RSP: 0018:ffffa7a1c7fe3bd0 EFLAGS: 00000082
RAX: 0000000000000000 RBX: eb851eb853975fcf RCX: ffffa1ce5fc1c9c8
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffffa1ce5fc1c9c0
RBP: ffffa1c6865d3280 R08: ffffffffb0f570a8 R09: 0000000000009ffb
R10: 0000000000000286 R11: ffffffffb0f2ad50 R12: ffffa1c6865d3d10
R13: ffffa1c6865d3c70 R14: 0000000000000000 R15: 0000000000000004
lock_acquire+0xbe/0x2d0
_raw_spin_lock_irqsave+0x3a/0x60
hugepage_subpool_put_pages.part.0+0xe/0xc0
free_huge_folio+0x253/0x3f0
dissolve_free_huge_page+0x147/0x210
__page_handle_poison+0x9/0x70
memory_failure+0x4e6/0x8c0
hard_offline_page_store+0x55/0xa0
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xbc/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff9f3114887
RSP: 002b:00007ffecbacb458 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007ff9f3114887
RDX: 000000000000000c RSI: 0000564494164e10 RDI: 0000000000000001
RBP: 0000564494164e10 R08: 00007ff9f31d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007ff9f321b780 R14: 00007ff9f3217600 R15: 00007ff9f3216a00
</TASK>
After git bisecting and digging into the code, I believe the root cause is
that _deferred_list field of folio is unioned with _hugetlb_subpool field.
In __update_and_free_hugetlb_folio(), folio->_deferred_list is
initialized leading to corrupted folio->_hugetlb_subpool when folio is
hugetlb. Later free_huge_folio() will use _hugetlb_subpool and above
warning happens.
But it is assumed hugetlb flag must have been cleared when calling
folio_put() in update_and_free_hugetlb_folio(). This assumption is broken
due to below race:
CPU1 CPU2
dissolve_free_huge_page update_and_free_pages_bulk
update_and_free_hugetlb_folio hugetlb_vmemmap_restore_folios
folio_clear_hugetlb_vmemmap_optimized
clear_flag = folio_test_hugetlb_vmemmap_optimized
if (clear_flag) <-- False, it's already cleared.
__folio_clear_hugetlb(folio) <-- Hugetlb is not cleared.
folio_put
free_huge_folio <-- free_the_page is expected.
list_for_each_entry()
__folio_clear_hugetlb <-- Too late.
Fix this issue by checking whether folio is hugetlb directly instead of
checking clear_flag to close the race window.
Link: https://lkml.kernel.org/r/20240419085819.1901645-1-linmiaohe@huawei.com
Fixes:
|
||
|
|
37641efaa3 |
hugetlb: check for anon_vma prior to folio allocation
Commit |
||
|
|
682886ec69 |
mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
Christian reports a NULL deref in zswap that he bisected down to the zswap
shrinker. The issue also cropped up in the bug trackers of libguestfs [1]
and the Red Hat bugzilla [2].
The problem is that when memcg is disabled with the boot time flag, the
zswap shrinker might get called with sc->memcg == NULL. This is okay in
many places, like the lruvec operations. But it crashes in
memcg_page_state() - which is only used due to the non-node accounting of
cgroup's the zswap memory to begin with.
Nhat spotted that the memcg can be NULL in the memcg-disabled case, and I
was then able to reproduce the crash locally as well.
[1] https://github.com/libguestfs/libguestfs/issues/139
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
Link: https://lkml.kernel.org/r/20240418124043.GC1055428@cmpxchg.org
Link: https://lkml.kernel.org/r/20240417143324.GA1055428@cmpxchg.org
Fixes:
|
||
|
|
d99e3140a4 |
mm: turn folio_test_hugetlb into a PageType
The current folio_test_hugetlb() can be fooled by a concurrent folio split into returning true for a folio which has never belonged to hugetlbfs. This can't happen if the caller holds a refcount on it, but we have a few places (memory-failure, compaction, procfs) which do not and should not take a speculative reference. Since hugetlb pages do not use individual page mapcounts (they are always fully mapped and use the entire_mapcount field to record the number of mappings), the PageType field is available now that page_mapcount() ignores the value in this field. In compaction and with CONFIG_DEBUG_VM enabled, the current implementation can result in an oops, as reported by Luis. This happens since |
||
|
|
b76b46902c |
mm/hugetlb: fix missing hugetlb_lock for resv uncharge
There is a recent report on UFFDIO_COPY over hugetlb:
https://lore.kernel.org/all/000000000000ee06de0616177560@google.com/
350: lockdep_assert_held(&hugetlb_lock);
Should be an issue in hugetlb but triggered in an userfault context, where
it goes into the unlikely path where two threads modifying the resv map
together. Mike has a fix in that path for resv uncharge but it looks like
the locking criteria was overlooked: hugetlb_cgroup_uncharge_folio_rsvd()
will update the cgroup pointer, so it requires to be called with the lock
held.
Link: https://lkml.kernel.org/r/20240417211836.2742593-3-peterx@redhat.com
Fixes:
|
||
|
|
b413f9cd4c |
mm: Update shuffle documentation to match its current state
Commit
|
||
|
|
b3d8a8e870 |
slub: use count_partial_free_approx() in slab_out_of_memory()
slab_out_of_memory() uses count_partial() to get the exact count of free objects for each node. As it may get called in the slab allocation path, count_partial_free_approx() can be used to avoid the risk and overhead of traversing a long partial slab list. At the same time, show_slab_objects() still uses count_partial(). Thus, slub users can still have the option to access the exact count of objects via sysfs if the overhead is acceptable to them. Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
046f4c6909 |
slub: introduce count_partial_free_approx()
When reading "/proc/slabinfo", the kernel needs to report the number of free objects for each kmem_cache. The current implementation uses count_partial() to get it by scanning each kmem_cache_node's partial slab list and summing free objects from every partial slab. This process must hold per-kmem_cache_node spinlock and disable IRQ, and may take a long time. Consequently, it can block slab allocations on other CPUs and cause timeouts for network devices, when the partial list is long. In production, even NMI watchdog can be triggered due to this matter: e.g., for "buffer_head", the number of partial slabs was observed to be ~1M in one kmem_cache_node. This problem was also confirmed by others [1-3]. Iterating a partial list to get the exact count of objects can cause soft lockups for a long list with or without the lock (e.g., if preemption is disabled), and may not be very useful: the object count can change after the lock is released. The approach of maintaining free-object counters requires atomic operations on the fast path [3]. So, the fix is to introduce count_partial_free_approx(). This function can be used for getting the free object count in a kmem_cache_node's partial list. It limits the number of slabs to scan and avoids scanning the whole list by giving an approximation for a long list. Suppose the limit is N. If the list's length is not greater than N, output the exact count by traversing the list; if its length is greater than N, output an approximated count by traversing a subset of the list. The proposed method is to scan N/2 slabs from the list's head and N/2 slabs from the tail. For a partial list with ~280K slabs, benchmarks show that it performs better than just counting from the list's head, after slabs get sorted by kmem_cache_shrink(). Default the limit to 10000, as it produces an approximation within 1% of the exact count for both scenarios. Then, use count_partial_free_approx() in get_slabinfo(). Benchmarks: Diff = (exact - approximated) / exact * Normal case (w/o kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 0.43 % | 1.09 % | | 5000 | 0.06 % | 0.37 % | | 10000 | 0.02 % | 0.16 % | | 20000 | 0.009 % | -0.003 % | * Skewed case (w/ kmem_cache_shrink()): | MAX_TO_SCAN | Diff (count from head)| Diff (count head+tail)| | 1000 | 12.46 % | 6.75 % | | 5000 | 5.38 % | 1.27 % | | 10000 | 4.99 % | 0.22 % | | 20000 | 4.86 % | -0.06 % | [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.21.2003031602460.1537@www.lameter.com/T/ [2] https://lore.kernel.org/lkml/alpine.DEB.2.22.394.2008071258020.55871@www.lameter.com/T/ [3] https://lore.kernel.org/lkml/1e01092b-140d-2bab-aeba-321a74a194ee@linux.com/T/ Signed-off-by: Jianfeng Wang <jianfeng.w.wang@oracle.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
a96cb3bf39 |
Merge x86 bugfixes from Linux 6.9-rc3
Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
90a7592da1 |
mm/userfaultfd: Do not place zeropages when zeropages are disallowed
s390x must disable shared zeropages for processes running VMs, because the VMs could end up making use of "storage keys" or protected virtualization, which are incompatible with shared zeropages. Yet, with userfaultfd it is possible to insert shared zeropages into such processes. Let's fallback to simply allocating a fresh zeroed anonymous folio and insert that instead. mm_forbids_zeropage() was introduced in commit |
||
|
|
193feb69af
|
Merge patch series 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org
Pull shmem_rename2() offset fixes from Chuck Lever: The existing code in shmem_rename2() allocates a fresh directory offset value when renaming over an existing destination entry. User space does not expect this behavior. In particular, applications that rename while walking a directory can loop indefinitely because they never reach the end of the directory. * 'Fix shmem_rename2 directory offset calculation' of https://lore.kernel.org/r/20240415152057.4605-1-cel@kernel.org: (3 commits) shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() fs/libfs.c | 55 +++++++++++++++++++++++++++++++++++++++++----- include/linux/fs.h | 2 ++ mm/shmem.c | 3 +-- 3 files changed, 52 insertions(+), 8 deletions(-) Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
5a1a25be99
|
libfs: Add simple_offset_rename() API
I'm about to fix a tmpfs rename bug that requires the use of internal simple_offset helpers that are not available in mm/shmem.c Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20240415152057.4605-3-cel@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
1f737846aa |
mm/shmem: inline shmem_is_huge() for disabled transparent hugepages
In order to minimize code size (CONFIG_CC_OPTIMIZE_FOR_SIZE=y),
compiler might choose to make a regular function call (out-of-line) for
shmem_is_huge() instead of inlining it. When transparent hugepages are
disabled (CONFIG_TRANSPARENT_HUGEPAGE=n), it can cause compilation
error.
mm/shmem.c: In function `shmem_getattr':
./include/linux/huge_mm.h:383:27: note: in expansion of macro `BUILD_BUG'
383 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
| ^~~~~~~~~
mm/shmem.c:1148:33: note: in expansion of macro `HPAGE_PMD_SIZE'
1148 | stat->blksize = HPAGE_PMD_SIZE;
To prevent the possible error, always inline shmem_is_huge() when
transparent hugepages are disabled.
Link: https://lkml.kernel.org/r/20240409155407.2322714-1-sumanthk@linux.ibm.com
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0b2cf0a45e |
mm,page_owner: defer enablement of static branch
Kefeng Wang reported that he was seeing some memory leaks with kmemleak
with page_owner enabled.
The reason is that we enable the page_owner_inited static branch and then
proceed with the linking of stack_list struct to dummy_stack, which means
that exists a race window between these two steps where we can have pages
already being allocated calling add_stack_record_to_list(), allocating
objects and linking them to stack_list, but then we set stack_list
pointing to dummy_stack in init_page_owner. Which means that the objects
that have been allocated during that time window are unreferenced and
lost.
Fix this by deferring the enablement of the branch until we have properly
set up the list.
Link: https://lkml.kernel.org/r/20240409131715.13632-1-osalvador@suse.de
Fixes:
|
||
|
|
1983184c22 |
mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
When I did hard offline test with hugetlb pages, below deadlock occurs: ====================================================== WARNING: possible circular locking dependency detected 6.8.0-11409-gf6cef5f8c37f #1 Not tainted ------------------------------------------------------ bash/46904 is trying to acquire lock: ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60 but task is already holding lock: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (pcp_batch_high_lock){+.+.}-{3:3}: __mutex_lock+0x6c/0x770 page_alloc_cpu_online+0x3c/0x70 cpuhp_invoke_callback+0x397/0x5f0 __cpuhp_invoke_callback_range+0x71/0xe0 _cpu_up+0xeb/0x210 cpu_up+0x91/0xe0 cpuhp_bringup_mask+0x49/0xb0 bringup_nonboot_cpus+0xb7/0xe0 smp_init+0x25/0xa0 kernel_init_freeable+0x15f/0x3e0 kernel_init+0x15/0x1b0 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1298/0x1cd0 lock_acquire+0xc0/0x2b0 cpus_read_lock+0x2a/0xc0 static_key_slow_dec+0x16/0x60 __hugetlb_vmemmap_restore_folio+0x1b9/0x200 dissolve_free_huge_page+0x211/0x260 __page_handle_poison+0x45/0xc0 memory_failure+0x65e/0xc70 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xca/0x1e0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(pcp_batch_high_lock); lock(cpu_hotplug_lock); lock(pcp_batch_high_lock); rlock(cpu_hotplug_lock); *** DEADLOCK *** 5 locks held by bash/46904: #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0 #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0 #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70 #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40 stack backtrace: CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 check_noncircular+0x129/0x140 __lock_acquire+0x1298/0x1cd0 lock_acquire+0xc0/0x2b0 cpus_read_lock+0x2a/0xc0 static_key_slow_dec+0x16/0x60 __hugetlb_vmemmap_restore_folio+0x1b9/0x200 dissolve_free_huge_page+0x211/0x260 __page_handle_poison+0x45/0xc0 memory_failure+0x65e/0xc70 hard_offline_page_store+0x55/0xa0 kernfs_fop_write_iter+0x12c/0x1d0 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xca/0x1e0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 RIP: 0033:0x7fc862314887 Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887 RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001 RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00 In short, below scene breaks the lock dependency chain: memory_failure __page_handle_poison zone_pcp_disable -- lock(pcp_batch_high_lock) dissolve_free_huge_page __hugetlb_vmemmap_restore_folio static_key_slow_dec cpus_read_lock -- rlock(cpu_hotplug_lock) Fix this by calling drain_all_pages() instead. This issue won't occur until commit |
||
|
|
c5977c95df |
mm/userfaultfd: allow hugetlb change protection upon poison entry
After UFFDIO_POISON, there can be two kinds of hugetlb pte markers, either
the POISON one or UFFD_WP one.
Allow change protection to run on a poisoned marker just like !hugetlb
cases, ignoring the marker irrelevant of the permission.
Here the two bits are mutual exclusive. For example, when install a
poisoned entry it must not be UFFD_WP already (by checking pte_none()
before such install). And it also means if UFFD_WP is set there must have
no POISON bit set. It makes sense because UFFD_WP is a bit to reflect
permission, and permissions do not apply if the pte is poisoned and
destined to sigbus.
So here we simply check uffd_wp bit set first, do nothing otherwise.
Attach the Fixes to UFFDIO_POISON work, as before that it should not be
possible to have poison entry for hugetlb (e.g., hugetlb doesn't do swap,
so no chance of swapin errors).
Link: https://lkml.kernel.org/r/20240405231920.1772199-1-peterx@redhat.com
Link: https://lore.kernel.org/r/000000000000920d5e0615602dd1@google.com
Fixes:
|
||
|
|
7401745801 |
mm,page_owner: fix printing of stack records
When seq_* code sees that its buffer overflowed, it re-allocates a bigger
onecand calls seq_operations->start() callback again. stack_start()
naively though that if it got called again, it meant that the old record
got already printed so it returned the next object, but that is not true.
The consequence of that is that every time stack_stop() -> stack_start()
get called because we needed a bigger buffer, stack_start() will skip
entries, and those will not be printed.
Fix it by not advancing to the next object in stack_start().
Link: https://lkml.kernel.org/r/20240404070702.2744-5-osalvador@suse.de
Fixes:
|
||
|
|
718b1f3373 |
mm,page_owner: fix accounting of pages when migrating
Upon migration, new allocated pages are being given the handle of the old
pages. This is problematic because it means that for the stack which
allocated the old page, we will be substracting the old page + the new one
when that page is freed, creating an accounting imbalance.
There is an interest in keeping it that way, as otherwise the output will
biased towards migration stacks should those operations occur often, but
that is not really helpful.
The link from the new page to the old stack is being performed by calling
__update_page_owner_handle() in __folio_copy_owner(). The only thing that
is left is to link the migrate stack to the old page, so the old page will
be subtracted from the migrate stack, avoiding by doing so any possible
imbalance.
Link: https://lkml.kernel.org/r/20240404070702.2744-4-osalvador@suse.de
Fixes:
|
||
|
|
f5c12105c1 |
mm,page_owner: fix refcount imbalance
Current code does not contemplate scenarios were an allocation and free
operation on the same pages do not handle it in the same amount at once.
To give an example, page_alloc_exact(), where we will allocate a page of
enough order to stafisfy the size request, but we will free the remainings
right away.
In the above example, we will increment the stack_record refcount only
once, but we will decrease it the same number of times as number of unused
pages we have to free. This will lead to a warning because of refcount
imbalance.
Fix this by recording the number of base pages in the refcount field.
Link: https://lkml.kernel.org/r/20240404070702.2744-3-osalvador@suse.de
Reported-by: syzbot+41bbfdb8d41003d12c0f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/00000000000090e8ff0613eda0e5@google.com
Fixes:
|
||
|
|
ea4b5b33bf |
mm,page_owner: update metadata for tail pages
Patch series "page_owner: Fix refcount imbalance and print fixup", v4. This series consists of a refactoring/correctness of updating the metadata of tail pages, a couple of fixups for the refcounting part and a fixup for the stack_start() function. From this series on, instead of counting the stacks, we count the outstanding nr_base_pages each stack has, which gives us a much better memory overview. The other fixup is for the migration part. A more detailed explanation can be found in the changelog of the respective patches. This patch (of 4): __set_page_owner_handle() and __reset_page_owner() update the metadata of all pages when the page is of a higher-order, but we miss to do the same when the pages are migrated. __folio_copy_owner() only updates the metadata of the head page, meaning that the information stored in the first page and the tail pages will not match. Strictly speaking that is not a big problem because 1) we do not print tail pages and 2) upon splitting all tail pages will inherit the metadata of the head page, but it is better to have all metadata in check should there be any problem, so it can ease debugging. For that purpose, a couple of helpers are created __update_page_owner_handle() which updates the metadata on allocation, and __update_page_owner_free_handle() which does the same when the page is freed. __folio_copy_owner() will make use of both as it needs to entirely replace the page_owner metadata for the new page. Link: https://lkml.kernel.org/r/20240404070702.2744-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240404070702.2744-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
c0205eaf3a |
userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE
Commit |
||
|
|
631426ba1d |
mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
Darrick reports that in some cases where pread() would fail with -EIO and
mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ /
MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT.
While the madvise() call can be interrupted by a signal, this is not the
desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave
like page faults in that case: fail and not retry forever.
A reproducer can be found at [1].
The reason is that __get_user_pages(), as called by
faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way:
it will simply return 0 when VM_FAULT_RETRY happened, making
madvise_populate()->faultin_vma_page_range() retry again and again, never
setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages().
__get_user_pages_locked() does what we want, but duplicating that logic in
faultin_vma_page_range() feels wrong.
So let's use __get_user_pages_locked() instead, that will detect
VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler
return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT
from faultin_page() to __get_user_pages(), all the way to
madvise_populate().
But, there is an issue: __get_user_pages_locked() will end up re-taking
the MM lock and then __get_user_pages() will do another VMA lookup. In
the meantime, the VMA layout could have changed and we'd fail with
different error codes than we'd want to.
As __get_user_pages() will currently do a new VMA lookup either way, let
it do the VMA handling in a different way, controlled by a new
FOLL_MADV_POPULATE flag, effectively moving these checks from
madvise_populate() + faultin_page_range() in there.
With this change, Darricks reproducer properly fails with -EFAULT, as
documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE.
[1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/
Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com
Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com
Fixes:
|
||
|
|
5b15f3fb89 |
slub: Set __GFP_COMP in kmem_cache by default
Now the __GFP_COMP is set only if the higher-order is not 0. However, __GFP_COMP flag can be set unconditionally because compound page can not be created in the order-0 case. And this can also simplify the code a bit (no need to check the order is 0 or not). Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
62346c6cb2 |
mm: add nommu variant of vm_insert_pages()
An identical one exists for vm_insert_page(), add one for vm_insert_pages() to avoid needing to check for CONFIG_MMU in code using it. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
bd3520a93a |
iommu: observability of the IOMMU allocations
Add NR_IOMMU_PAGES into node_stat_item that counts number of pages that are allocated by the IOMMU subsystem. The allocations can be view per-node via: /sys/devices/system/node/nodeN/vmstat. For example: $ grep iommu /sys/devices/system/node/node*/vmstat /sys/devices/system/node/node0/vmstat:nr_iommu_pages 106025 /sys/devices/system/node/node1/vmstat:nr_iommu_pages 3464 The value is in page-count, therefore, in the above example the iommu allocations amount to ~428M. Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20240413002522.1101315-11-pasha.tatashin@soleen.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
||
|
|
f7842747d1 |
mm: replace set_pte_at_notify() with just set_pte_at()
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
997308f9ae |
mmu_notifier: remove the .change_pte() callback
The scope of set_pte_at_notify() has reduced more and more through the
years. Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}(). However,
that has not been so for over ten years. During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.
Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed. For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20240405115815.3226315-4-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
||
|
|
5aa5c7b9a0 |
mm/slub: remove duplicate initialization for early_kmem_cache_node_alloc()
The struct track for every object in a new slab is already set up by new_slab(), so remove the duplicate initialization in early_kmem_cache_node_alloc(). Co-developed-by: Hyunmin Lee <hyunminlr@gmail.com> Signed-off-by: Hyunmin Lee <hyunminlr@gmail.com> Co-developed-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Jeungwoo Yoo <casionwoo@gmail.com> Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
210a03c9d5
|
fs: claw back a few FMODE_* bits
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
04c35ab3bd |
x86/mm/pat: fix VM_PAT handling in COW mappings
PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.
Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().
In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.
To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.
We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.
For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.
Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():
<--- C reproducer --->
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <liburing.h>
int main(void)
{
struct io_uring_params p = {};
int ring_fd;
size_t size;
char *map;
ring_fd = io_uring_setup(1, &p);
if (ring_fd < 0) {
perror("io_uring_setup");
return 1;
}
size = p.sq_off.array + p.sq_entries * sizeof(unsigned);
/* Map the submission queue ring MAP_PRIVATE */
map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
ring_fd, IORING_OFF_SQ_RING);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}
/* We have at least one page. Let's COW it. */
*map = 0;
pause();
return 0;
}
<--- C reproducer --->
On a system with 16 GiB RAM and swap configured:
# ./iouring &
# memhog 16G
# killall iouring
[ 301.552930] ------------[ cut here ]------------
[ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[ 301.565725] PKRU: 55555554
[ 301.565944] Call Trace:
[ 301.566148] <TASK>
[ 301.566325] ? untrack_pfn+0xf4/0x100
[ 301.566618] ? __warn+0x81/0x130
[ 301.566876] ? untrack_pfn+0xf4/0x100
[ 301.567163] ? report_bug+0x171/0x1a0
[ 301.567466] ? handle_bug+0x3c/0x80
[ 301.567743] ? exc_invalid_op+0x17/0x70
[ 301.568038] ? asm_exc_invalid_op+0x1a/0x20
[ 301.568363] ? untrack_pfn+0xf4/0x100
[ 301.568660] ? untrack_pfn+0x65/0x100
[ 301.568947] unmap_single_vma+0xa6/0xe0
[ 301.569247] unmap_vmas+0xb5/0x190
[ 301.569532] exit_mmap+0xec/0x340
[ 301.569801] __mmput+0x3e/0x130
[ 301.570051] do_exit+0x305/0xaf0
...
Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes:
|
||
|
|
fc2c22693c |
mm: vmalloc: fix lockdep warning
A lockdep reports a possible deadlock in the find_vmap_area_exceed_addr_lock()
function:
============================================
WARNING: possible recursive locking detected
6.9.0-rc1-00060-ged3ccc57b108-dirty #6140 Not tainted
--------------------------------------------
drgn/455 is trying to acquire lock:
ffff0000c00131d0 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124
but task is already holding lock:
ffff0000c0011878 (&vn->busy.lock/1){+.+.}-{2:2}, at: find_vmap_area_exceed_addr_lock+0x64/0x124
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&vn->busy.lock/1);
lock(&vn->busy.lock/1);
*** DEADLOCK ***
indeed it can happen if the find_vmap_area_exceed_addr_lock() gets called
concurrently because it tries to acquire two nodes locks. It was done to
prevent removing a lowest VA found on a previous step.
To address this a lowest VA is found first without holding a node lock
where it resides. As a last step we check if a VA still there because it
can go away, if removed, proceed with next lowest.
[akpm@linux-foundation.org: fix comment typos, per Baoquan]
Link: https://lkml.kernel.org/r/20240328140330.4747-1-urezki@gmail.com
Fixes:
|
||
|
|
4ed91fa917 |
mm: vmalloc: bail out early in find_vmap_area() if vmap is not init
During the boot the s390 system triggers "spinlock bad magic" messages
if the spinlock debugging is enabled:
[ 0.465445] BUG: spinlock bad magic on CPU#0, swapper/0
[ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[ 0.466270] Call Trace:
[ 0.466470] [<00000000011f26c8>] dump_stack_lvl+0x98/0xd8
[ 0.466516] [<00000000001dcc6a>] do_raw_spin_lock+0x8a/0x108
[ 0.466545] [<000000000042146c>] find_vmap_area+0x6c/0x108
[ 0.466572] [<000000000042175a>] find_vm_area+0x22/0x40
[ 0.466597] [<000000000012f152>] __set_memory+0x132/0x150
[ 0.466624] [<0000000001cc0398>] vmem_map_init+0x40/0x118
[ 0.466651] [<0000000001cc0092>] paging_init+0x22/0x68
[ 0.466677] [<0000000001cbbed2>] setup_arch+0x52a/0x708
[ 0.466702] [<0000000001cb6140>] start_kernel+0x80/0x5c8
[ 0.466727] [<0000000000100036>] startup_continue+0x36/0x40
it happens because such system tries to access some vmap areas
whereas the vmalloc initialization is not even yet done:
[ 0.465490] lock: single+0x1860/0x1958, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 0.466067] CPU: 0 PID: 0 Comm: swapper Not tainted 6.8.0-12955-g8e938e398669 #1
[ 0.466188] Hardware name: QEMU 8561 QEMU (KVM/Linux)
[ 0.466270] Call Trace:
[ 0.466470] dump_stack_lvl (lib/dump_stack.c:117)
[ 0.466516] do_raw_spin_lock (kernel/locking/spinlock_debug.c:87 kernel/locking/spinlock_debug.c:115)
[ 0.466545] find_vmap_area (mm/vmalloc.c:1059 mm/vmalloc.c:2364)
[ 0.466572] find_vm_area (mm/vmalloc.c:3150)
[ 0.466597] __set_memory (arch/s390/mm/pageattr.c:360 arch/s390/mm/pageattr.c:393)
[ 0.466624] vmem_map_init (./arch/s390/include/asm/set_memory.h:55 arch/s390/mm/vmem.c:660)
[ 0.466651] paging_init (arch/s390/mm/init.c:97)
[ 0.466677] setup_arch (arch/s390/kernel/setup.c:972)
[ 0.466702] start_kernel (init/main.c:899)
[ 0.466727] startup_continue (arch/s390/kernel/head64.S:35)
[ 0.466811] INFO: lockdep is turned off.
...
[ 0.718250] vmalloc init - busy lock init 0000000002871860
[ 0.718328] vmalloc init - busy lock init 00000000028731b8
Some background. It worked before because the lock that is in question
was statically defined and initialized. As of now, the locks and data
structures are initialized in the vmalloc_init() function.
To address that issue add the check whether the "vmap_initialized"
variable is set, if not find_vmap_area() bails out on entry returning NULL.
Link: https://lkml.kernel.org/r/20240323141544.4150-1-urezki@gmail.com
Fixes:
|
||
|
|
b062539c4e |
mm/slub: correct comment in do_slab_free()
slab_alloc_node() should be __slab_alloc_node(). Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
ff99b18fee |
mm/slub: simplify get_partial_node()
The break conditions for filling cpu partial can be more readable and simple. If slub_get_cpu_partial() returns 0, we can confirm that we don't need to fill cpu partial, then we should break from the loop. On the other hand, we also should break from the loop if we have added enough cpu partial slabs. Meanwhile, the logic above gets rid of the #ifdef and also fixes a weird corner case that if we set cpu_partial_slabs to 0 from sysfs, we still allocate at least one here. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
721a2f8be1 |
mm/slub: add slub_get_cpu_partial() helper
Add slub_get_cpu_partial() and dummy function to help improve get_partial_node(). It can help remove #ifdef of CONFIG_SLUB_CPU_PARTIAL and improve filling cpu partial logic. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
acc8f4dbf1 |
mm/slub: remove the check of !kmem_cache_has_cpu_partial()
The check of !kmem_cache_has_cpu_partial(s) with CONFIG_SLUB_CPU_PARTIAL enabled here is always false. We have already checked kmem_cache_debug() earlier and if it was true, then we either continued or broke from the loop so we can't reach this code in that case and don't need to check kmem_cache_debug() as part of kmem_cache_has_cpu_partial() again. Here we can remove it. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
9198ffbd2b |
mm/slub: Reduce memory consumption in extreme scenarios
When kmalloc_node() is called without __GFP_THISNODE and the target node lacks sufficient memory, SLUB allocates a folio from a different node other than the requested node, instead of taking a partial slab from it. However, since the allocated folio does not belong to the requested node, on the following allocation it is deactivated and added to the partial slab list of the node it belongs to. This behavior can result in excessive memory usage when the requested node has insufficient memory, as SLUB will repeatedly allocate folios from other nodes without reusing the previously allocated ones. To prevent memory wastage, when a preferred node is indicated (not NUMA_NO_NODE) but without a prior __GFP_THISNODE constraint: 1) try to get a partial slab from target node only by having __GFP_THISNODE in pc.flags for get_partial() 2) if 1) failed, try to allocate a new slab from target node with GFP_NOWAIT | __GFP_THISNODE opportunistically. 3) if 2) failed, retry with original gfpflags which will allow get_partial() try partial lists of other nodes before potentially allocating new page from other nodes Without a preferred node, or with __GFP_THISNODE constraint, the behavior remains unchanged. On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times. cat /proc/slabinfo shows: kmalloc-256 4200530 13519712 256 32 2 : tunables.. after this patch, cat /proc/slabinfo shows: kmalloc-256 4200558 4200768 256 32 2 : tunables.. Signed-off-by: Chen Jun <chenjun102@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
7e40c2100c |
Kbuild fixes for v6.9
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
- Fix unselectable choice entry in MIPS Kconfig, and forbid this
structure
- Remove unused include/asm-generic/export.h
- Fix a NULL pointer dereference bug in modpost
- Enable -Woverride-init warning consistently with W=1
- Drop KCSAN flags from *.mod.c files
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmYJVK0VHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGf/sP/3GOk//cQGwPyWCgtCEUo6T4yyD7
1m2TTR0JQk/lcohSFtYk0I20rhKRqU6yAMAERmyehI66D2QY7lhiYVc16ram5y04
x0nWxd9IqerIlGJtaWePOvNqKdCw2EP9fS9NKz58rEDMGlsSf0Rd3NEdSsWoH8td
dECtt8yCawENAMStb/rAfsnL6kn2JIhVMyqwo0RdQfiaVT5Zk6Qgpko0Oq0ncRP2
qdNgHbvnJdKMy81FHSBAi0QEZOYvhFNX+E+6lFfWEsX6xT+wvXddCNQzJf/YV3Cw
Klw1tGveV7UGzlZ4fsnFrv4V6g1KO2AD3342efdDo++ypBEBpImVODc+Rp0jE9Nk
OgdOQRe2k9a5keH0LWY0ehvDbQlSbfNxk0wNtAfo5Kk5e41nHmHJBWCwGG+cXrjJ
mPJjSrTpuNVSaGV0kt3EskHbDBeBmIIg+5QPbldmW2qcC88kWoavkyLD3WPFsg/a
CAuR/HqH7MDfxzvsqTCjonlVcyDKX6aW66LrQ1NCtmphI4F8mdKp746CzGlziuIm
gjYJL/UWVlx0VebMo8dwDpaHvez4/4s6xAJcyqtA+TS5HbrQWKQuwFkiv4iWQxNd
MvyVdzgKhcMdoXhfFpUZ0LlFvHGefJ+Z6N1FQLoQJkTirt5aqRbEAjP0VXwQB4eH
zYygkhvvtiH9/STu
=tx+2
-----END PGP SIGNATURE-----
Merge tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Deduplicate Kconfig entries for CONFIG_CXL_PMU
- Fix unselectable choice entry in MIPS Kconfig, and forbid this
structure
- Remove unused include/asm-generic/export.h
- Fix a NULL pointer dereference bug in modpost
- Enable -Woverride-init warning consistently with W=1
- Drop KCSAN flags from *.mod.c files
* tag 'kbuild-fixes-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: Fix typo HEIGTH to HEIGHT
Documentation/llvm: Note s390 LLVM=1 support with LLVM 18.1.0 and newer
kbuild: Disable KCSAN for autogenerated *.mod.c intermediaries
kbuild: make -Woverride-init warnings more consistent
modpost: do not make find_tosym() return NULL
export.h: remove include/asm-generic/export.h
kconfig: do not reparent the menu inside a choice block
MIPS: move unselectable FIT_IMAGE_FDT_EPM5 out of the "System type" choice
cxl: remove CONFIG_CXL_PMU entry in drivers/cxl/Kconfig
|
||
|
|
c40845e319 |
kbuild: make -Woverride-init warnings more consistent
The -Woverride-init warn about code that may be intentional or not,
but the inintentional ones tend to be real bugs, so there is a bit of
disagreement on whether this warning option should be enabled by default
and we have multiple settings in scripts/Makefile.extrawarn as well as
individual subsystems.
Older versions of clang only supported -Wno-initializer-overrides with
the same meaning as gcc's -Woverride-init, though all supported versions
now work with both. Because of this difference, an earlier cleanup of
mine accidentally turned the clang warning off for W=1 builds and only
left it on for W=2, while it's still enabled for gcc with W=1.
There is also one driver that only turns the warning off for newer
versions of gcc but not other compilers, and some but not all the
Makefiles still use a cc-disable-warning conditional that is no
longer needed with supported compilers here.
Address all of the above by removing the special cases for clang
and always turning the warning off unconditionally where it got
in the way, using the syntax that is supported by both compilers.
Fixes:
|
||
|
|
1096bc93df |
mm: clean up populate_vma_page_range() FOLL_* flag handling
The code wasn't exactly wrong, but it was very odd, and it used FOLL_FORCE together with FOLL_WRITE when it really didn't need to (it only set FOLL_WRITE for writable mappings, so then the FOLL_FORCE was pointless). It also pointlessly called __get_user_pages() even when it knew it wouldn't populate anything because the vma wasn't accessible and it explicitly tested for and did *not* set FOLL_FORCE for inaccessible vma's. This code does need to use FOLL_FORCE, because we want to do fault in writable shared mappings, but then the mapping may not actually be readable. And we don't want to use FOLL_WRITE (which would match the permission of the vma), because that would also dirty the pages, which we don't want to do. For very similar reasons, FOLL_FORCE populates a executable-only mapping with no read permissions. We don't have a FOLL_EXEC flag. Yes, it would probably be cleaner to split FOLL_WRITE into two bits (for separate permission and dirty bit handling), and add a FOLL_EXEC flag for the "GUP executable page" case. That would allow us to avoid FOLL_FORCE entirely here. But that's not how our FOLL_xyz bits have traditionally worked, and that would be a much bigger patch. So this at least avoids the FOLL_FORCE | FOLL_WRITE combination that made one of my experimental validation patches trigger a warning. That warning was a false positive (and my experimental patch was incomplete anyway), but it all made me look at this and decide to clean at least this small case up. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
25cd241408 |
mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
Zhongkun He reports data corruption when combining zswap with zram.
The issue is the exclusive loads we're doing in zswap. They assume
that all reads are going into the swapcache, which can assume
authoritative ownership of the data and so the zswap copy can go.
However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try to
bypass the swapcache. This results in an optimistic read of the swap data
into a page that will be dismissed if the fault fails due to races. In
this case, zswap mustn't drop its authoritative copy.
Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
Fixes:
|
||
|
|
30af24facf |
userfaultfd: fix deadlock warning when locking src and dst VMAs
Use down_read_nested() to avoid the warning.
Link: https://lkml.kernel.org/r/20240321235818.125118-1-lokeshgidra@google.com
Fixes:
|
||
|
|
0a69b6b3a0 |
tmpfs: fix race on handling dquot rbtree
A syzkaller reproducer found a race while attempting to remove dquot
information from the rb tree.
Fetching the rb_tree root node must also be protected by the
dqopt->dqio_sem, otherwise, giving the right timing, shmem_release_dquot()
will trigger a warning because it couldn't find a node in the tree, when
the real reason was the root node changing before the search starts:
Thread 1 Thread 2
- shmem_release_dquot() - shmem_{acquire,release}_dquot()
- fetch ROOT - Fetch ROOT
- acquire dqio_sem
- wait dqio_sem
- do something, triger a tree rebalance
- release dqio_sem
- acquire dqio_sem
- start searching for the node, but
from the wrong location, missing
the node, and triggering a warning.
Link: https://lkml.kernel.org/r/20240320124011.398847-1-cem@kernel.org
Fixes:
|
||
|
|
30fb6a8d9e |
mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion
Kent forwards this bug report of zswap re-entering the block layer
from an IO request allocation and locking up:
[10264.128242] sysrq: Show Blocked State
[10264.128268] task:kworker/20:0H state:D stack:0 pid:143 tgid:143 ppid:2 flags:0x00004000
[10264.128271] Workqueue: bcachefs_io btree_write_submit [bcachefs]
[10264.128295] Call Trace:
[10264.128295] <TASK>
[10264.128297] __schedule+0x3e6/0x1520
[10264.128303] schedule+0x32/0xd0
[10264.128304] schedule_timeout+0x98/0x160
[10264.128308] io_schedule_timeout+0x50/0x80
[10264.128309] wait_for_completion_io_timeout+0x7f/0x180
[10264.128310] submit_bio_wait+0x78/0xb0
[10264.128313] swap_writepage_bdev_sync+0xf6/0x150
[10264.128317] zswap_writeback_entry+0xf2/0x180
[10264.128319] shrink_memcg_cb+0xe7/0x2f0
[10264.128322] __list_lru_walk_one+0xb9/0x1d0
[10264.128325] list_lru_walk_one+0x5d/0x90
[10264.128326] zswap_shrinker_scan+0xc4/0x130
[10264.128327] do_shrink_slab+0x13f/0x360
[10264.128328] shrink_slab+0x28e/0x3c0
[10264.128329] shrink_one+0x123/0x1b0
[10264.128331] shrink_node+0x97e/0xbc0
[10264.128332] do_try_to_free_pages+0xe7/0x5b0
[10264.128333] try_to_free_pages+0xe1/0x200
[10264.128334] __alloc_pages_slowpath.constprop.0+0x343/0xde0
[10264.128337] __alloc_pages+0x32d/0x350
[10264.128338] allocate_slab+0x400/0x460
[10264.128339] ___slab_alloc+0x40d/0xa40
[10264.128345] kmem_cache_alloc+0x2e7/0x330
[10264.128348] mempool_alloc+0x86/0x1b0
[10264.128349] bio_alloc_bioset+0x200/0x4f0
[10264.128352] bio_alloc_clone+0x23/0x60
[10264.128354] alloc_io+0x26/0xf0 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128361] dm_submit_bio+0xb8/0x580 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128366] __submit_bio+0xb0/0x170
[10264.128367] submit_bio_noacct_nocheck+0x159/0x370
[10264.128368] bch2_submit_wbio_replicas+0x21c/0x3a0 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128391] btree_write_submit+0x1cf/0x220 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128406] process_one_work+0x178/0x350
[10264.128408] worker_thread+0x30f/0x450
[10264.128409] kthread+0xe5/0x120
The zswap shrinker resumes the swap_writepage()s that were intercepted
by the zswap store. This will enter the block layer, and may even
enter the filesystem depending on the swap backing file.
Make it respect GFP_NOIO and GFP_NOFS.
Link: https://lore.kernel.org/linux-mm/rc4pk2r42oyvjo4dc62z6sovquyllq56i5cdgcaqbd7wy3hfzr@n4nbxido3fme/
Link: https://lkml.kernel.org/r/20240321182532.60000-1-hannes@cmpxchg.org
Fixes:
|
||
|
|
9c500835f2 |
mm: zswap: fix kernel BUG in sg_init_one
sg_init_one() relies on linearly mapped low memory for the safe utilization of virt_to_page(). Otherwise, we trigger a kernel BUG, kernel BUG at include/linux/scatterlist.h:187! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM Modules linked in: CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0 Hardware name: ARM-Versatile Express PC is at sg_set_buf include/linux/scatterlist.h:187 [inline] PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143 LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128 Backtrace: [<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089) r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4 [<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637) r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0 [<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518) r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c [<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684) r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca r4:00000001 [<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904) r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000 r4:00000001 [<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046) r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000 r4:df955eb8 [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline]) [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline]) [<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604) r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00 r4:00000254 [<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326) r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4 r4:df955fb0 [<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558) r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207 r4:8261d0e0 [<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427) Exception stack(0xdf955fb0 to 0xdf955ff8) 5fa0: 00000000 00000000 22d5f800 0008d158 5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c 5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4 Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2) ---[ end trace 0000000000000000 ]--- ---------------- Code disassembly (best guess): 0: 1a000004 bne 0x18 4: e1822003 orr r2, r2, r3 8: e8860094 stm r6, {r2, r4, r7} c: e89da8f0 ldm sp, {r4, r5, r6, r7, fp, sp, pc} * 10: e7f001f2 udf #18 <-- trapping instruction Consequently, we have two choices: either employ kmap_to_page() alongside sg_set_page(), or resort to copying high memory contents to a temporary buffer residing in low memory. However, considering the introduction of the WARN_ON_ONCE in commit |
||
|
|
d5d39c707a |
mm: cachestat: fix two shmem bugs
When cachestat on shmem races with swapping and invalidation, there
are two possible bugs:
1) A swapin error can have resulted in a poisoned swap entry in the
shmem inode's xarray. Calling get_shadow_from_swap_cache() on it
will result in an out-of-bounds access to swapper_spaces[].
Validate the entry with non_swap_entry() before going further.
2) When we find a valid swap entry in the shmem's inode, the shadow
entry in the swapcache might not exist yet: swap IO is still in
progress and we're before __remove_mapping; swapin, invalidation,
or swapoff have removed the shadow from swapcache after we saw the
shmem swap entry.
This will send a NULL to workingset_test_recent(). The latter
purely operates on pointer bits, so it won't crash - node 0, memcg
ID 0, eviction timestamp 0, etc. are all valid inputs - but it's a
bogus test. In theory that could result in a false "recently
evicted" count.
Such a false positive wouldn't be the end of the world. But for
code clarity and (future) robustness, be explicit about this case.
Bail on get_shadow_from_swap_cache() returning NULL.
Link: https://lkml.kernel.org/r/20240315095556.GC581298@cmpxchg.org
Fixes:
|
||
|
|
7844c01472 |
mm,page_owner: fix recursion
Prior to |
||
|
|
f8572367ea |
mm/memory: fix missing pte marker for !page on pte zaps
Commit |
||
|
|
87654cf7a9 |
mm/slub: mark racy accesses on slab->slabs
The reads of slab->slabs are racy because it may be changed by put_cpu_partial concurrently. In slabs_cpu_partial_show() and show_slab_objects(), slab->slabs is only used for showing information. Data-racy reads from shared variables that are used only for diagnostic purposes should typically use data_race(), since it is normally not a problem if the values are off by a little. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Signed-off-by: linke li <lilinke99@qq.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
ad7c5ebead |
mm/slub: remove dummy slabinfo functions
The SLAB implementation has been removed since 6.8, so there is no other version of slabinfo_show_stats() and slabinfo_write(), then we can remove these two dummy functions. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
c150b809f7 |
RISC-V Patches for the 6.9 Merge Window
* Support for various vector-accelerated crypto routines.
* Hibernation is now enabled for portable kernel builds.
* mmap_rnd_bits_max is larger on systems with larger VAs.
* Support for fast GUP.
* Support for membarrier-based instruction cache synchronization.
* Support for the Andes hart-level interrupt controller and PMU.
* Some cleanups around unaligned access speed probing and Kconfig
settings.
* Support for ACPI LPI and CPPC.
* Various cleanus related to barriers.
* A handful of fixes.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmX9icgTHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYib+UD/4xyL6UMixx6A06BVBL9UT4vOrxRvNr
JIihG5y5QNMjes9DHWL35mZTMqFtQ0tq94ViWFLmJWloV/8KRVM2C9R9KX7vplf3
M/OwvP106spxgvNHoeQbycgs42RU1t2mpqT7N1iK2hCjqieP3vLn6hsSLXWTAG0L
3gQbQw6XCLC3hPyLq+nbFY2i4faeCmpXWmixoy/IvQ5calZQrRU0LNlP6lcMBhVo
uocjG0uGAhrahw2s81jxcMZcxa3AvUCiplapdD5H5v9rBM85SkYJj2Q9SqdSorkb
xzuimRnKPI5s47yM3pTfZY0qnQUYHV7PXXuw4WujpCQVQdhaG+Ggq63UUZA61J9t
IzZK2zdcfHqICrGTtXImUzRT3dcc3oq+IFq4tTY+rEJm29hrXkAtx+qBm5xtMvax
fJz5feJ/iT0u7MDj4Oq24n+Kpl+Olm+MJaZX3m5Ovi/9V6a9iK9HXqxg9/Fs0fMO
+J/0kTgd8Vu9CYH7KNWz3uztcO9eMAH3VyzuXuab4BGj1i1Y/9EjpALQi7rDN73S
OsYQX6NnzMkBV4dvElJVLXiPlvNlMHZZwdak5CqPb48jaJu6iiIZAuvOrG6/naGP
wnQSLVA2WWWoOkl3AJhxfpa11CLhbMl9E2gYm1VtNvASXoSFIxlAq1Yv3sG8yjty
4ZT0rYFJOstYiQ==
=3dL5
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V updates from Palmer Dabbelt:
- Support for various vector-accelerated crypto routines
- Hibernation is now enabled for portable kernel builds
- mmap_rnd_bits_max is larger on systems with larger VAs
- Support for fast GUP
- Support for membarrier-based instruction cache synchronization
- Support for the Andes hart-level interrupt controller and PMU
- Some cleanups around unaligned access speed probing and Kconfig
settings
- Support for ACPI LPI and CPPC
- Various cleanus related to barriers
- A handful of fixes
* tag 'riscv-for-linus-6.9-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (66 commits)
riscv: Fix syscall wrapper for >word-size arguments
crypto: riscv - add vector crypto accelerated AES-CBC-CTS
crypto: riscv - parallelize AES-CBC decryption
riscv: Only flush the mm icache when setting an exec pte
riscv: Use kcalloc() instead of kzalloc()
riscv/barrier: Add missing space after ','
riscv/barrier: Consolidate fence definitions
riscv/barrier: Define RISCV_FULL_BARRIER
riscv/barrier: Define __{mb,rmb,wmb}
RISC-V: defconfig: Enable CONFIG_ACPI_CPPC_CPUFREQ
cpufreq: Move CPPC configs to common Kconfig and add RISC-V
ACPI: RISC-V: Add CPPC driver
ACPI: Enable ACPI_PROCESSOR for RISC-V
ACPI: RISC-V: Add LPI driver
cpuidle: RISC-V: Move few functions to arch/riscv
riscv: Introduce set_compat_task() in asm/compat.h
riscv: Introduce is_compat_thread() into compat.h
riscv: add compile-time test into is_compat_task()
riscv: Replace direct thread flag check with is_compat_task()
riscv: Improve arch_get_mmap_end() macro
...
|
||
|
|
1d35aae78f |
Kbuild updates for v6.9
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
- Use more threads when building Debian packages in parallel
- Fix warnings shown during the RPM kernel package uninstallation
- Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
Makefile
- Support GCC's -fmin-function-alignment flag
- Fix a null pointer dereference bug in modpost
- Add the DTB support to the RPM package
- Various fixes and cleanups in Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmX8HGIVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGYfIQAIl/zEFoNVSHGR4TIvO7SIwkT4MM
VAm0W6XRFaXfIGw8HL/MXe+U9jAyeQ9yL9uUVv8PqFTO+LzBbW1X1X97tlmrlQsC
7mdxbA1KJXwkwt4wH/8/EZQMwHr327vtVH4AilSm+gAaWMXaSKAye3ulKQQ2gevz
vP6aOcfbHIWOPdxA53cLdSl9LOGrYNczKySHXKV9O39T81F+ko7wPpdkiMWw5LWG
ISRCV8bdXli8j10Pmg8jlbevSKl4Z5FG2BVw/Cl8rQ5tBBoCzFsUPnnp9A29G8QP
OqRhbwxtkSm67BMJAYdHnhjp/l0AOEbmetTGpna+R06hirOuXhR3vc6YXZxhQjff
LmKaqfG5YchRALS1fNDsRUNIkQxVJade+tOUG+V4WbxHQKWX7Ghu5EDlt2/x7P0p
+XLPE48HoNQLQOJ+pgIOkaEDl7WLfGhoEtEgprZBuEP2h39xcdbYJyF10ZAAR4UZ
FF6J9lDHbf7v1uqD2YnAQJQ6jJ06CvN6/s6SdiJnCWSs5cYRW0fnYigSIuwAgGHZ
c/QFECoGEflXGGuqZDl5iXiIjhWKzH2nADSVEs7maP47vapcMWb9gA7VBNoOr5M0
IXuFo1khChF4V2pxqlDj3H5TkDlFENYT/Wjh+vvjx8XplKCRKaSh+LaZ39hja61V
dWH7BPecS44h4KXx
=tFdl
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Generate a list of built DTB files (arch/*/boot/dts/dtbs-list)
- Use more threads when building Debian packages in parallel
- Fix warnings shown during the RPM kernel package uninstallation
- Change OBJECT_FILES_NON_STANDARD_*.o etc. to take a relative path to
Makefile
- Support GCC's -fmin-function-alignment flag
- Fix a null pointer dereference bug in modpost
- Add the DTB support to the RPM package
- Various fixes and cleanups in Kconfig
* tag 'kbuild-v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (67 commits)
kconfig: tests: test dependency after shuffling choices
kconfig: tests: add a test for randconfig with dependent choices
kconfig: tests: support KCONFIG_SEED for the randconfig runner
kbuild: rpm-pkg: add dtb files in kernel rpm
kconfig: remove unneeded menu_is_visible() call in conf_write_defconfig()
kconfig: check prompt for choice while parsing
kconfig: lxdialog: remove unused dialog colors
kconfig: lxdialog: fix button color for blackbg theme
modpost: fix null pointer dereference
kbuild: remove GCC's default -Wpacked-bitfield-compat flag
kbuild: unexport abs_srctree and abs_objtree
kbuild: Move -Wenum-{compare-conditional,enum-conversion} into W=1
kconfig: remove named choice support
kconfig: use linked list in get_symbol_str() to iterate over menus
kconfig: link menus to a symbol
kbuild: fix inconsistent indentation in top Makefile
kbuild: Use -fmin-function-alignment when available
alpha: merge two entries for CONFIG_ALPHA_GAMMA
alpha: merge two entries for CONFIG_ALPHA_EV4
kbuild: change DTC_FLAGS_<basetarget>.o to take the path relative to $(obj)
...
|
||
|
|
32a50540c3 |
bcachefs updates for 6.9
- Subvolume children btree; this is needed for providing a userspace
interface for walking subvolumes, which will come later
- Lots of improvements to directory structure checking
- Improved journal pipelining, significantly improving performance on
high iodepth write workloads
- Discard path improvements: the discard path is more efficient, and no
longer flushes the journal unnecessarily
- Buffered write path can now avoid taking the inode lock
- new mm helper: memalloc_flags_{save|restore}
- mempool now does kvmalloc mempools
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXycEcACgkQE6szbY3K
bnYUTg/+K4Nv2EdAqOCyHRTKaF2OgJDUb25ZDmbGpfT1XyPrNB7/+CxHqSdEP7/e
FVuhtP61vnQAImDv82u9iZiab/TnuCZPUrjSobFEvrWYoGRtP9Bm9MyYB28NzmMa
AXGmS4yJGVwtxxrFNxZP98IbiHYiHSoYbkqxX2E5VgLag8Ru8peb7oD0Ro3zw0rb
z+6UM/seJ7on5i/9IJEMKKXFVEoZC2J5DAVoe1TghG2kgOw3cKu5OUdltLPOY5jL
jkm5J5wa6Ep46nufHat92yiMxXIQrf4U9LkXxzTi5ThoSmt+Af2qXcBjqTTVqd2D
1dGxj+UG8iu4DCCbQC6EA7J5EMvxfJM0+9lk1ULUgxUs3X69co6nlI6XH1fwEMqk
KpIqd35+Y/IYgogt9ioXI0dtXyL7dbaTVt6NZhc9SaPGPX+C2V0+l4bqToFdNaPH
0KATjjyQaJRE4ZFIjr6GliYOtKWDLi/HPEyoBivniUn7cF5vjSvti+cSQwNDSPpa
6jOd5Y923Iq9ZqDAPM3+mvTH8nNaaf2T2fmbPNrc5pdWbha9bGwOU71zvKHNFGm/
66ZsnwhKSk+uwglTMZHPKSkJJXUYAHESw3slQtEWHZVlliArc55+pBHwE00bvRt7
KHUUqkqXBUPzbp/kdZGylMAdH9+8j9TE5QJ2RaoryFm/eCfexmI=
=6xnj
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs updates from Kent Overstreet:
- Subvolume children btree; this is needed for providing a userspace
interface for walking subvolumes, which will come later
- Lots of improvements to directory structure checking
- Improved journal pipelining, significantly improving performance on
high iodepth write workloads
- Discard path improvements: the discard path is more efficient, and no
longer flushes the journal unnecessarily
- Buffered write path can now avoid taking the inode lock
- new mm helper: memalloc_flags_{save|restore}
- mempool now does kvmalloc mempools
* tag 'bcachefs-2024-03-13' of https://evilpiepirate.org/git/bcachefs: (128 commits)
bcachefs: time_stats: shrink time_stat_buffer for better alignment
bcachefs: time_stats: split stats-with-quantiles into a separate structure
bcachefs: mean_and_variance: put struct mean_and_variance_weighted on a diet
bcachefs: time_stats: add larger units
bcachefs: pull out time_stats.[ch]
bcachefs: reconstruct_alloc cleanup
bcachefs: fix bch_folio_sector padding
bcachefs: Fix btree key cache coherency during replay
bcachefs: Always flush write buffer in delete_dead_inodes()
bcachefs: Fix order of gc_done passes
bcachefs: fix deletion of indirect extents in btree_gc
bcachefs: Prefer struct_size over open coded arithmetic
bcachefs: Kill unused flags argument to btree_split()
bcachefs: Check for writing superblocks with nonsense member seq fields
bcachefs: fix bch2_journal_buf_to_text()
lib/generic-radix-tree.c: Make nodes more reasonably sized
bcachefs: copy_(to|from)_user_errcode()
bcachefs: Split out bkey_types.h
bcachefs: fix lost journal buf wakeup due to improved pipelining
bcachefs: intercept mountoption value for bool type
...
|
||
|
|
e5eb28f6d1 |
- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min
heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfMnvgAKCRDdBJ7gKXxA jjKMAP4/Upq07D4wjkMVPb+QrkipbbLpdcgJ++q3z6rba4zhPQD+M3SFriIJk/Xh tKVmvihFxfAhdDthseXcIf1nBjMALwY= =8rVc -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. * tag 'mm-nonmm-stable-2024-03-14-09-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) nilfs2: prevent kernel bug at submit_bh_wbc() nilfs2: fix failure to detect DAT corruption in btree and direct mappings ocfs2: enable ocfs2_listxattr for special files ocfs2: remove SLAB_MEM_SPREAD flag usage assoc_array: fix the return value in assoc_array_insert_mid_shortcut() buildid: use kmap_local_page() watchdog/core: remove sysctl handlers from public header nilfs2: use div64_ul() instead of do_div() mul_u64_u64_div_u64: increase precision by conditionally swapping a and b kexec: copy only happens before uchunk goes to zero get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't abuse ksig->info.si_signo and ksig->sig const_structs.checkpatch: add device_type Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() list: leverage list_is_head() for list_entry_is_head() nilfs2: MAINTAINERS: drop unreachable project mirror site smp: make __smp_processor_id() 0-argument macro fat: fix uninitialized field in nostale filehandles ... |
||
|
|
902861e34c |
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is hotplugged
as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving policy
wherein we allocate memory across nodes in a weighted fashion rather
than uniformly. This is beneficial in heterogeneous memory environments
appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the process
has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown situations.
The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's series
"Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page faults.
He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction test",
Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
our handling of DAX on archiecctures which have virtually aliasing data
caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
improvements in worst-case mmap_lock hold times during certain
userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability improvements
in his series "Mitigate a vmap lock contention". It realizes a 12x
improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging of
large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages() to
an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which are
configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
=TG55
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
|
||
|
|
0225bdfafd |
mempool: kvmalloc pool
Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(), which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc() fallback. This is part of a bcachefs cleanup - dropping an internal kvpmalloc() helper (which predates kvmalloc()) along with mempool helpers; this replaces the bcachefs-private kvpmalloc_pool. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: linux-mm@kvack.org |
||
|
|
e5e038b7ae |
\n
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmXx5kwACgkQnJ2qBz9k
QNmZowf/UlGJ1rmQFFhoodn3SyK48tQjOZ23Ygx6v9FZiLMuQ3b1k0kWKmwM4lZb
mtRriCm+lPO9Yp/Sflz+jn8S51b/2bcTXiPV4w2Y4ZIun41wwggV7rWPnTCHhu94
rGEPu/SNSBdpxWGv43BKHSDl4XolsGbyusQKBbKZtftnrpIf0y2OnyEXSV91Vnlh
KM/XxzacBD4/3r4KCljyEkORWlIIn2+gdZf58sKtxLKvnfCIxjB+BF1e0gOWgmNQ
e/pVnzbAHO3wuavRlwnrtA+ekBYQiJq7T61yyYI8zpeSoLHmwvPoKSsZP+q4BTvV
yrcVCbGp3uZlXHD93U3BOfdqS0xBmg==
=84Q4
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2, isofs, udf, and quota updates from Jan Kara:
"A lot of material this time:
- removal of a lot of GFP_NOFS usage from ext2, udf, quota (either it
was legacy or replaced with scoped memalloc_nofs_*() API)
- removal of BUG_ONs in quota code
- conversion of UDF to the new mount API
- tightening quota on disk format verification
- fix some potentially unsafe use of RCU pointers in quota code and
annotate everything properly to make sparse happy
- a few other small quota, ext2, udf, and isofs fixes"
* tag 'fs_for_v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (26 commits)
udf: remove SLAB_MEM_SPREAD flag usage
quota: remove SLAB_MEM_SPREAD flag usage
isofs: remove SLAB_MEM_SPREAD flag usage
ext2: remove SLAB_MEM_SPREAD flag usage
ext2: mark as deprecated
udf: convert to new mount API
udf: convert novrs to an option flag
MAINTAINERS: add missing git address for ext2 entry
quota: Detect loops in quota tree
quota: Properly annotate i_dquot arrays with __rcu
quota: Fix rcu annotations of inode dquot pointers
isofs: handle CDs with bad root inode but good Joliet root directory
udf: Avoid invalid LVID used on mount
quota: Fix potential NULL pointer dereference
quota: Drop GFP_NOFS instances under dquot->dq_lock and dqio_sem
quota: Set nofs allocation context when acquiring dqio_sem
ext2: Remove GFP_NOFS use in ext2_xattr_cache_insert()
ext2: Drop GFP_NOFS use in ext2_get_blocks()
ext2: Drop GFP_NOFS allocation from ext2_init_block_alloc_info()
udf: Remove GFP_NOFS allocation in udf_expand_file_adinicb()
...
|
||
|
|
babbcc0232 |
New code for 6.9:
* Online Repair;
** New ondisk structures being repaired.
- Inode's mode field by trying to obtain file type value from the a
directory entry.
- Quota counters.
- Link counts of inodes.
- FS summary counters.
- rmap btrees.
Support for in-memory btrees has been added to support repair of rmap
btrees.
** Misc changes
- Report corruption of metadata to the health tracking subsystem.
- Enable indirect health reporting when resources are scarce.
- Reduce memory usage while reparing refcount btree.
- Extend "Bmap update" intent item to support atomic extent swapping on
the realtime device.
- Extend "Bmap update" intent item to support extended attribute fork and
unwritten extents.
** Code cleanups
- Bmap log intent.
- Btree block pointer checking.
- Btree readahead.
- Buffer target.
- Symbolic link code.
* Remove mrlock wrapper around the rwsem.
* Convert all the GFP_NOFS flag usages to use the scoped
memalloc_nofs_save() API instead of direct calls with the GFP_NOFS.
* Refactor and simplify xfile abstraction. Lower level APIs in
shmem.c are required to be exported in order to achieve this.
* Skip checking alignment constraints for inode chunk allocations when block
size is larger than inode chunk size.
* Do not submit delwri buffers collected during log recovery when an error
has been encountered.
* Fix SEEK_HOLE/DATA for file regions which have active COW extents.
* Fix lock order inversion when executing error handling path during
shrinking a filesystem.
* Remove duplicate ifdefs.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZemMkgAKCRAH7y4RirJu
9ON5AP0Vda6sMn/ZUYoLo9ZUrUvlUb8L0dhEN5JL0XfyWW5ogAD/bH4G6pKSNyTw
cSEjryuDakirdHLt5g0c+QHd2a/fzw0=
=ymKk
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Chandan Babu:
- Online repair updates:
- More ondisk structures being repaired:
- Inode's mode field by trying to obtain file type value from
the a directory entry
- Quota counters
- Link counts of inodes
- FS summary counters
- Support for in-memory btrees has been added to support repair
of rmap btrees
- Misc changes:
- Report corruption of metadata to the health tracking subsystem
- Enable indirect health reporting when resources are scarce
- Reduce memory usage while repairing refcount btree
- Extend "Bmap update" intent item to support atomic extent
swapping on the realtime device
- Extend "Bmap update" intent item to support extended attribute
fork and unwritten extents
- Code cleanups:
- Bmap log intent
- Btree block pointer checking
- Btree readahead
- Buffer target
- Symbolic link code
- Remove mrlock wrapper around the rwsem
- Convert all the GFP_NOFS flag usages to use the scoped
memalloc_nofs_save() API instead of direct calls with the GFP_NOFS
- Refactor and simplify xfile abstraction. Lower level APIs in shmem.c
are required to be exported in order to achieve this
- Skip checking alignment constraints for inode chunk allocations when
block size is larger than inode chunk size
- Do not submit delwri buffers collected during log recovery when an
error has been encountered
- Fix SEEK_HOLE/DATA for file regions which have active COW extents
- Fix lock order inversion when executing error handling path during
shrinking a filesystem
- Remove duplicate ifdefs
* tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (183 commits)
xfs: shrink failure needs to hold AGI buffer
mm/shmem.c: Use new form of *@param in kernel-doc
kernel-doc: Add unary operator * to $type_param_ref
xfs: use kvfree() in xlog_cil_free_logvec()
xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL
xfs: fix scrub stats file permissions
xfs: fix log recovery erroring out on refcount recovery failure
xfs: move symlink target write function to libxfs
xfs: move remote symlink target read function to libxfs
xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h
xfs: xfs_bmap_finish_one should map unwritten extents properly
xfs: support deferred bmap updates on the attr fork
xfs: support recovering bmap intent items targetting realtime extents
xfs: add a realtime flag to the bmap update log redo items
xfs: add a xattr_entry helper
xfs: fix xfs_bunmapi to allow unmapping of partial rt extents
xfs: move xfs_bmap_defer_add to xfs_bmap_item.c
xfs: reuse xfs_bmap_update_cancel_item
xfs: add a bi_entry helper
xfs: remove xfs_trans_set_bmap_flags
...
|
||
|
|
270700dd06 |
mm/zswap: remove the memcpy if acomp is not sleepable
Most compressors are actually CPU-based and won't sleep during compression and decompression. We should remove the redundant memcpy for them. This patch checks if the algorithm is sleepable by testing the CRYPTO_ALG_ASYNC algorithm flag. Generally speaking, async and sleepable are semantically similar but not equal. But for compress drivers, they are basically equal at least due to the below facts. Firstly, scompress drivers - crypto/deflate.c, lz4.c, zstd.c, lzo.c etc have no sleep. Secondly, zRAM has been using these scompress drivers for years in atomic contexts, and never worried those drivers going to sleep. One exception is that an async driver can sometimes still return synchronously per Herbert's clarification. In this case, we are still having a redundant memcpy. But we can't know if one particular acomp request will sleep or not unless crypto can expose more details for each specific request from offload drivers. Link: https://lkml.kernel.org/r/20240222081135.173040-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Tested-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
82634d7e24 |
memtest: use {READ,WRITE}_ONCE in memory scanning
memtest failed to find bad memory when compiled with clang. So use
{WRITE,READ}_ONCE to access memory to avoid compiler over optimization.
Link: https://lkml.kernel.org/r/20240312080422.691222-1-qiang4.zhang@intel.com
Signed-off-by: Qiang Zhang <qiang4.zhang@intel.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
cd197c3a20 |
mm: prohibit the last subpage from reusing the entire large folio
In a Copy-on-Write (CoW) scenario, the last subpage will reuse the entire
large folio, resulting in the waste of (nr_pages - 1) pages. This wasted
memory remains allocated until it is either unmapped or memory reclamation
occurs.
The following small program can serve as evidence of this behavior
main()
{
#define SIZE 1024 * 1024 * 1024UL
void *p = malloc(SIZE);
memset(p, 0x11, SIZE);
if (fork() == 0)
_exit(0);
memset(p, 0x12, SIZE);
printf("done\n");
while(1);
}
For example, using a 1024KiB mTHP by:
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-1024kB/enabled
(1) w/o the patch, it takes 2GiB,
Before running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 84 5692 0 17 5669
Swap: 0 0 0
/ # /a.out &
/ # done
After running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 2149 3627 0 19 3605
Swap: 0 0 0
(2) w/ the patch, it takes 1GiB only,
Before running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 89 5687 0 17 5664
Swap: 0 0 0
/ # /a.out &
/ # done
After running the test program,
/ # free -m
total used free shared buff/cache available
Mem: 5754 1122 4655 0 17 4632
Swap: 0 0 0
This patch migrates the last subpage to a small folio and immediately
returns the large folio to the system. It benefits both memory availability
and anti-fragmentation.
Link: https://lkml.kernel.org/r/20240308092721.144735-1-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
0ea680eda6 |
slab changes for 6.9
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmXwH0wACgkQu+CwddJF iJq3HAf6A/0m0pSr0QDcwjM8D7TVYQJ+Z/jPC6Mj+HfTcF8Otrgk8c0M6EsHGIGF GQNnYJRKmBla3mpVFvDtsVZuiakEtRLCpoP5n23s8p8gY9ibJcl6bpn9NaMVMKrq kBnhQ9VdLAgKVcTH8wz6jJqdWiZ7W4jGH5NWO+nr+r0H7vay7jfB0+tur1NO8J09 HE5I76XE6ArRvaKYxvsZmOx1pihSmsJ7CerXN6Y8U5qcuxNXdUO/9rf+uv5llDIV gl54UAU79koZ9k88t5AiSKO2IZVhBgC/j66ds9MRRAFCf/ldxUtJIlsHTOnumfmy FApqwtR0MYNPeMPZpzogQbv58oOcNw== =XDxn -----END PGP SIGNATURE----- Merge tag 'slab-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Freelist loading optimization (Chengming Zhou) When the per-cpu slab is depleted and a new one loaded from the cpu partial list, optimize the loading to avoid an irq enable/disable cycle. This results in a 3.5% performance improvement on the "perf bench sched messaging" test. - Kernel boot parameters cleanup after SLAB removal (Xiongwei Song) Due to two different main slab implementations we've had boot parameters prefixed either slab_ and slub_ with some later becoming an alias as both implementations gained the same functionality (i.e. slab_nomerge vs slub_nomerge). In order to eventually get rid of the implementation-specific names, the canonical and documented parameters are now all prefixed slab_ and the slub_ variants become deprecated but still working aliases. - SLAB_ kmem_cache creation flags cleanup (Vlastimil Babka) The flags had hardcoded #define values which became tedious and error-prone when adding new ones. Assign the values via an enum that takes care of providing unique bit numbers. Also deprecate SLAB_MEM_SPREAD which was only used by SLAB, so it's a no-op since SLAB removal. Assign it an explicit zero value. The removals of the flag usage are handled independently in the respective subsystems, with a final removal of any leftover usage planned for the next release. - Misc cleanups and fixes (Chengming Zhou, Xiaolei Wang, Zheng Yejian) Includes removal of unused code or function parameters and a fix of a memleak. * tag 'slab-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: remove PARTIAL_NODE slab_state mm, slab: remove memcg_from_slab_obj() mm, slab: remove the corner case of inc_slabs_node() mm/slab: Fix a kmemleak in kmem_cache_destroy() mm, slab, kasan: replace kasan_never_merge() with SLAB_NO_MERGE mm, slab: use an enum to define SLAB_ cache creation flags mm, slab: deprecate SLAB_MEM_SPREAD flag mm, slab: fix the comment of cpu partial list mm, slab: remove unused object_size parameter in kmem_cache_flags() mm/slub: remove parameter 'flags' in create_kmalloc_caches() mm/slub: remove unused parameter in next_freelist_entry() mm/slub: remove full list manipulation for non-debug slab mm/slub: directly load freelist from cpu partial slab in the likely case mm/slub: make the description of slab_min_objects helpful in doc mm/slub: replace slub_$params with slab_$params in slub.rst mm/slub: unify all sl[au]b parameters with "slab_$param" Documentation: kernel-parameters: remove noaliencache |
||
|
|
9187210eee |
Networking changes for 6.9.
Core & protocols
----------------
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps etc.)
lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core
instead of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length
and budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global config
variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug
of ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for
use on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code
--------------------------------------------
- Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena).
- Rework selftest harness to enable the use of the full range of
ksft exit code (pass, fail, skip, xfail, xpass).
Netfilter
---------
- Allow userspace to define a table that is exclusively owned by a daemon
(via netlink socket aliveness) without auto-removing this table when
the userspace program exits. Such table gets marked as orphaned and
a restarting management daemon can re-attach/regain ownership.
- Speed up element insertions to nftables' concatenated-ranges set type.
Compact a few related data structures.
BPF
---
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between BPF
program and user space where structures inside the arena can have
pointers to other areas of the arena, and pointers work seamlessly
for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the verifier
and the program. The verifier allows the program to loop assuming it's
behaving well, but reserves the right to terminate it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF objects.
Wireless
--------
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API
----------
- Major overhaul of the Energy Efficient Ethernet internals to support
new link modes (2.5GE, 5GE), share more code between drivers
(especially those using phylib), and encourage more uniform behavior.
Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc
----
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions,
and packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message encapsulation
or "Netlink polymorphism", where interpretation of nested attributes
depends on link type, classifier type or some other "class type".
Drivers
-------
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic
on CAN BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO) support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs to
have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S
IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3
yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j
c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK
P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD
EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS
UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW
DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK
tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn
RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy
H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN
lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y=
=oY52
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
|
||
|
|
2394aef616 |
mm/huge_memory: skip invalid debugfs new_order input for folio split
User can put arbitrary new_order via debugfs for folio split test. Although new_order check is added to split_huge_page_to_list_order() in the prior commit, these two additional checks can avoid unnecessary folio locking and split_folio_to_order() calls. Link: https://lkml.kernel.org/r/20240307181854.138928-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/linux-mm/7dda9283-b437-4cf8-ab0d-83c330deb9c0@moroto.mountain/ Cc: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
1412ecb3d2 |
mm/huge_memory: check new folio order when split a folio
A folio can only be split into lower orders.
Since there are no new_order checks in debugfs, any new_order can be
passed via debugfs into split_huge_page_to_list_to_order().
Check new_order to make sure it is smaller than input folio order.
Link: https://lkml.kernel.org/r/20240307181854.138928-1-zi.yan@sent.com
Fixes:
|
||
|
|
d221dd5fea |
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
With cache_trim_mode on, reclaim logic doesn't bother reclaiming anon
pages. However, it should be more careful to use the mode because it's
going to prevent anon pages from being reclaimed even if there are a huge
number of anon pages that are cold and should be reclaimed. Even worse,
that leads kswapd_failures to reach MAX_RECLAIM_RETRIES and stopping
kswapd from functioning until direct reclaim eventually works to resume
kswapd.
So kswapd needs to retry its scan priority loop with cache_trim_mode off
again if the mode doesn't work for reclaim.
The problematic behavior can be reproduced by:
CONFIG_NUMA_BALANCING enabled
sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING
numa node0 (8GB local memory, 16 CPUs)
numa node1 (8GB slow tier memory, no CPUs)
Sequence:
1) echo 3 > /proc/sys/vm/drop_caches
2) To emulate the system with full of cold memory in local DRAM, run
the following dummy program and never touch the region:
mmap(0, 8 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
3) Run any memory intensive work e.g. XSBench.
4) Check if numa balancing is working e.i. promotion/demotion.
5) Iterate 1) ~ 4) until numa balancing stops.
With this, you could see that promotion/demotion are not working because
kswapd has stopped due to ->kswapd_failures >= MAX_RECLAIM_RETRIES.
Interesting vmstat delta's differences between before and after are like:
+-----------------------+-------------------------------+
| interesting vmstat | before | after |
+-----------------------+-------------------------------+
| nr_inactive_anon | 321935 | 1664772 |
| nr_active_anon | 1780700 | 437834 |
| nr_inactive_file | 30425 | 40882 |
| nr_active_file | 14961 | 3012 |
| pgpromote_success | 356 | 1293122 |
| pgpromote_candidate | 21953245 | 1824148 |
| pgactivate | 1844523 | 3311907 |
| pgdeactivate | 50634 |
|
||
|
|
b14d1671dd |
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may believe that all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone subsequently reading the page through the newly mapped virtual memory region. Today, such a user happens to be correct. mmget_not_zero(), for example, is called as part of UFFDIO_CONTINUE (and comes before any PTE updates), and it implicitly gives us a write barrier. To be resilient against future changes, include an explicit smp_wmb(). While we're at it, optimize the smp_wmb() that is already incidentally present for the HugeTLB case. Merely making a syscall does not generally imply the memory ordering constraints that we need (including on x86). Link: https://lkml.kernel.org/r/20240307010250.3847179-1-jthoughton@google.com Signed-off-by: James Houghton <jthoughton@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b555895c31 |
mm: fix list corruption in put_pages_list
My recent change to put_pages_list() dereferences folio->lru.next after
returning the folio to the page allocator. Usually this is now on the pcp
list with other free folios, so we try to free an already-free folio.
This only happens with lists that have more than 15 entries, so it wasn't
immediately discovered. Revert to using list_for_each_safe() so we
dereference lru.next before disposing of the folio.
Link: https://lkml.kernel.org/r/20240306212749.1823380-1-willy@infradead.org
Fixes:
|
||
|
|
47932e7048 |
mm: remove folio from deferred split list before uncharging it
When freeing a large folio, we must remove it from the deferred split list before we uncharge it as each memcg has its own deferred split list (with associated lock) and removing a folio from the deferred split list while holding the wrong lock will corrupt that list and cause various related problems. Link: https://lore.kernel.org/linux-mm/367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com/ Link: https://lkml.kernel.org/r/20240311191835.312162-1-willy@infradead.org Fixes: |
||
|
|
2184dbcde4 |
ARM: SoC drivers for 6.9
This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code.
- Feature updates for Tegra and Qualcomm platforms for added
hardware support.
- A number of platforms get soc_device additions for identifying newly
added chips from Renesas, Qualcomm, Mediatek and Google.
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmXvgbsACgkQYKtH/8kJ
UieH8Q/+LRzESrScIwFq0/V7lE1AadmhwMwcEf1Fsq8aMrelvPm/SWvHgIWIHTvV
IZ/g3XS/CnBxr1JG3nbyMMe/2otEY7JxsUOOqixIuZ2gdzJvzZOBHMi54xDwbFRx
4NbP0CRTy8K35XNnOkJO3TnwBFP+q2Fu6qHY90as8M2GIxQpWb8OONJHh8N2qPq+
Hi3H0jjKXMInnOKpNIEQI60N4F2djGMHWkDySwFtHu40RaJjCIfmVd3PWQGz7RHl
WQHjZ6CB+/BDgqfG0ccQ7Cikc4BLorZsjKCn8bsaLtdp4HvRCTp2ZpuFFTRq6vay
IxqJCXrgpKjM1k9plehObEhMv4lNMbD1djG8Y6hqC+PPKbDfOLvlcat3xUK2AGgb
ROJtKDQMXfAeSnLpw9n4Ox+BZRmwMIOcTU/20N72hlcZKY1jq/KuSqQn+LPVKIrW
pJIhWd1B8R+2O1TewuIe3fjvfQwgATMBHBUVNRkSrzqkpcZNGQ3M5koMpClVvY6T
Z/+hdAg58EQw0K6ukJLyrevxs1pHHhYXLCECIoU/xPs4NX4hDk7rKTFv6fdLS4Y2
24qzjhIGYdhRXmhRQdVq+06cr3cvtm1z7Fqna3tW1+J6wtBnHO/xZ63M9n5saPcm
NgKMAN7YLLMYuUNrd39W7U2wLGQCgknjhrbH8ZmxPypk467v08k=
=bV/K
-----END PGP SIGNATURE-----
Merge tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code
- Feature updates for Tegra and Qualcomm platforms for added hardware
support
- A number of platforms get soc_device additions for identifying
newly added chips from Renesas, Qualcomm, Mediatek and Google
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems"
* tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits)
tee: make tee_bus_type const
soc: qcom: aoss: add missing kerneldoc for qmp members
soc: qcom: geni-se: drop unused kerneldoc struct geni_wrapper param
soc: qcom: spm: fix building with CONFIG_REGULATOR=n
bus: ti-sysc: constify the struct device_type usage
memory: stm32-fmc2-ebi: keep power domain on
memory: stm32-fmc2-ebi: add MP25 RIF support
memory: stm32-fmc2-ebi: add MP25 support
memory: stm32-fmc2-ebi: check regmap_read return value
dt-bindings: memory-controller: st,stm32: add MP25 support
dt-bindings: bus: imx-weim: convert to YAML
watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs
soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regs
MAINTAINERS: Update SCMI entry with HWMON driver
MAINTAINERS: samsung: gs101: match patches touching Google Tensor SoC
memory: tegra: Fix indentation
memory: tegra: Add BPMP and ICC info for DLA clients
memory: tegra: Correct DLA client names
dt-bindings: memory: renesas,rpc-if: Document R-Car V4M support
firmware: arm_scmi: Update the supported clock protocol version
...
|
||
|
|
1a1c4e4576 |
Merge branch 'slab/for-6.9/slab-flag-cleanups' into slab/for-linus
Merge a series from myself that replaces hardcoded SLAB_ cache flag values with an enum, and explicitly deprecates the SLAB_MEM_SPREAD flag that is a no-op sine SLAB removal. |
||
|
|
466ed9eed6 |
Merge branch 'slab/for-6.9/optimize-get-freelist' into slab/for-linus
Merge a series from Chengming Zhou that optimizes cpu freelist loading when grabbing a cpu partial slab, and removes some unnecessary code. |
||
|
|
5f20e6ab1f |
for-netdev
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmXvm7IACgkQ6rmadz2v bTqdMA//VMHNHVLb4oROoXyQD9fw2mCmIUEKzP88RXfqcxsfEX7HF+k8B5ZTk0ro CHXTAnc79+Qqg0j24bkQKxup/fKBQVw9D+Ia4b3ytlm1I2MtyU/16xNEzVhAPU2D iKk6mVBsEdCbt/GjpWORy/VVnZlZpC7BOpZLxsbbxgXOndnCegyjXzSnLGJGxdvi zkrQTn2SrFzLi6aNpVLqrv6Nks6HJusfCKsIrtlbkQ85dulasHOtwK9s6GF60nte aaho+MPx3L+lWEgapsm8rR779pHaYIB/GbZUgEPxE/xUJ/V8BzDgFNLMzEiIBRMN a0zZam11BkBzCfcO9gkvDRByaei/dZz2jdqfU4GlHklFj1WFfz8Q7fRLEPINksvj WXLgJADGY5mtGbjG21FScThxzj+Ruqwx0a13ddlyI/W+P3y5yzSWsLwJG5F9p0oU 6nlkJ4U8yg+9E1ie5ae0TibqvRJzXPjfOERZGwYDSVvfQGzv1z+DGSOPMmgNcWYM dIaO+A/+NS3zdbk8+1PP2SBbhHPk6kWyCUByWc7wMzCPTiwriFGY/DD2sN+Fsufo zorzfikUQOlTfzzD5jbmT49U8hUQUf6QIWsu7BijSiHaaC7am4S8QB2O6ibJMqdv yNiwvuX+ThgVIY3QKrLLqL0KPGeKMR5mtfq6rrwSpfp/b4g27FE= =eFgA -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2024-03-11 We've added 59 non-merge commits during the last 9 day(s) which contain a total of 88 files changed, 4181 insertions(+), 590 deletions(-). The main changes are: 1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena, from Alexei. 2) Introduce bpf_arena which is sparse shared memory region between bpf program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and bpf programs, from Alexei and Andrii. 3) Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it, from Alexei. 4) Use IETF format for field definitions in the BPF standard document, from Dave. 5) Extend struct_ops libbpf APIs to allow specify version suffixes for stuct_ops map types, share the same BPF program between several map definitions, and other improvements, from Eduard. 6) Enable struct_ops support for more than one page in trampolines, from Kui-Feng. 7) Support kCFI + BPF on riscv64, from Puranjay. 8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay. 9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke. ==================== Link: https://lore.kernel.org/r/20240312003646.8692-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
0f1a876682 |
vfs-6.9.uuid
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem5LwAKCRCRxhvAZXjc onZsAQCjMNabNWAty2VBAQrNIpGkZ+AMA2DxEajPldaPiJH5zQEA9ea7feB3T47i NUrXXfMQ5DSop+k5Y65pPkEpbX4rhQo= =NZgd -----END PGP SIGNATURE----- Merge tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid() |
||
|
|
910202f00a |
vfs-6.9.super
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem4DwAKCRCRxhvAZXjc
ooTRAQDRI6Qz6wJym5Yblta8BScMGbt/SgrdgkoCvT6y83MtqwD+Nv/AZQzi3A3l
9NdULtniW1reuCYkc8R7dYM8S+yAwAc=
=Y1qX
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull block handle updates from Christian Brauner:
"Last cycle we changed opening of block devices, and opening a block
device would return a bdev_handle. This allowed us to implement
support for restricting and forbidding writes to mounted block
devices. It was accompanied by converting and adding helpers to
operate on bdev_handles instead of plain block devices.
That was already a good step forward but ultimately it isn't necessary
to have special purpose helpers for opening block devices internally
that return a bdev_handle.
Fundamentally, opening a block device internally should just be
equivalent to opening files. So now all internal opens of block
devices return files just as a userspace open would. Instead of
introducing a separate indirection into bdev_open_by_*() via struct
bdev_handle bdev_file_open_by_*() is made to just return a struct
file. Opening and closing a block device just becomes equivalent to
opening and closing a file.
This all works well because internally we already have a pseudo fs for
block devices and so opening block devices is simple. There's a few
places where we needed to be careful such as during boot when the
kernel is supposed to mount the rootfs directly without init doing it.
Here we need to take care to ensure that we flush out any asynchronous
file close. That's what we already do for opening, unpacking, and
closing the initramfs. So nothing new here.
The equivalence of opening and closing block devices to regular files
is a win in and of itself. But it also has various other advantages.
We can remove struct bdev_handle completely. Various low-level helpers
are now private to the block layer. Other helpers were simply
removable completely.
A follow-up series that is already reviewed build on this and makes it
possible to remove bdev->bd_inode and allows various clean ups of the
buffer head code as well. All places where we stashed a bdev_handle
now just stash a file and use simple accessors to get to the actual
block device which was already the case for bdev_handle"
* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
block: remove bdev_handle completely
block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
bdev: remove bdev pointer from struct bdev_handle
bdev: make struct bdev_handle private to the block layer
bdev: make bdev_{release, open_by_dev}() private to block layer
bdev: remove bdev_open_by_path()
reiserfs: port block device access to file
ocfs2: port block device access to file
nfs: port block device access to files
jfs: port block device access to file
f2fs: port block device access to files
ext4: port block device access to file
erofs: port device access to file
btrfs: port device access to file
bcachefs: port block device access to file
target: port block device access to file
s390: port block device access to file
nvme: port block device access to file
block2mtd: port device access to files
bcache: port block device access to files
...
|
||
|
|
7ea65c89d8 |
vfs-6.9.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZem3wQAKCRCRxhvAZXjc
otRMAQDeo8qsuuIAcS2KUicKqZR5yMVvrY9r4sQzf7YRcJo5HQD+NQXkKwQuv1VO
OUeScsic/+I+136AgdjWnlEYO5dp0go=
=4WKU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Misc features, cleanups, and fixes for vfs and individual filesystems.
Features:
- Support idmapped mounts for hugetlbfs.
- Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
where the passed offset is ignored if the file is O_APPEND. The new
flag allows a caller to enforce that the offset is honored to
conform to posix even if the file was opened in append mode.
- Move i_mmap_rwsem in struct address_space to avoid false sharing
between i_mmap and i_mmap_rwsem.
- Convert efs, qnx4, and coda to use the new mount api.
- Add a generic is_dot_dotdot() helper that's used by various
filesystems and the VFS code instead of open-coding it multiple
times.
- Recently we've added stable offsets which allows stable ordering
when iterating directories exported through NFS on e.g., tmpfs
filesystems. Originally an xarray was used for the offset map but
that caused slab fragmentation issues over time. This switches the
offset map to the maple tree which has a dense mode that handles
this scenario a lot better. Includes tests.
- Finally merge the case-insensitive improvement series Gabriel has
been working on for a long time. This cleanly propagates case
insensitive operations through ->s_d_op which in turn allows us to
remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
It also improves performance by trying a case-sensitive comparison
first and then fallback to case-insensitive lookup if that fails.
This also fixes a bug where overlayfs would be able to be mounted
over a case insensitive directory which would lead to all sort of
odd behaviors.
Cleanups:
- Make file_dentry() a simple accessor now that ->d_real() is
simplified because of the backing file work we did the last two
cycles.
- Use the dedicated file_mnt_idmap helper in ntfs3.
- Use smp_load_acquire/store_release() in the i_size_read/write
helpers and thus remove the hack to handle i_size reads in the
filemap code.
- The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
fs/
- It's no longer necessary to perform a second built-in initramfs
unpack call because we retain the contents of the previous
extraction. Remove it.
- Now that we have removed various allocators kfree_rcu() always
works with kmem caches and kmalloc(). So simplify various places
that only use an rcu callback in order to handle the kmem cache
case.
- Convert the pipe code to use a lockdep comparison function instead
of open-coding the nesting making lockdep validation easier.
- Move code into fs-writeback.c that was located in a header but can
be made static as it's only used in that one file.
- Rewrite the alignment checking iterators for iovec and bvec to be
easier to read, and also significantly more compact in terms of
generated code. This saves 270 bytes of text on x86-64 (with
clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
saves a bit of time for the same workload.
- Switch various places to use KMEM_CACHE instead of
kmem_cache_create().
- Use inode_set_ctime_to_ts() in inode_set_ctime_current()
- Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.
- Various smaller cleanups for eventfds.
Fixes:
- Fix various comments and typos, and unneeded initializations.
- Fix stack allocation hack for clang in the select code.
- Improve dump_mapping() debug code on a best-effort basis.
- Fix build errors in various selftests.
- Avoid wrap-around instrumentation in various places.
- Don't allow user namespaces without an idmapping to be used for
idmapped mounts.
- Fix sysv sb_read() call.
- Fix fallback implementation of the get_name() export operation"
* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
hugetlbfs: support idmapped mounts
qnx4: convert qnx4 to use the new mount api
fs: use inode_set_ctime_to_ts to set inode ctime to current time
libfs: Drop generic_set_encrypted_ci_d_ops
ubifs: Configure dentry operations at dentry-creation time
f2fs: Configure dentry operations at dentry-creation time
ext4: Configure dentry operations at dentry-creation time
libfs: Add helper to choose dentry operations at mount-time
libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
fscrypt: Drop d_revalidate once the key is added
fscrypt: Drop d_revalidate for valid dentries during lookup
fscrypt: Factor out a helper to configure the lookup dentry
ovl: Always reject mounting over case-insensitive directories
libfs: Attempt exact-match comparison first during casefolded lookup
efs: remove SLAB_MEM_SPREAD flag usage
jfs: remove SLAB_MEM_SPREAD flag usage
minix: remove SLAB_MEM_SPREAD flag usage
openpromfs: remove SLAB_MEM_SPREAD flag usage
proc: remove SLAB_MEM_SPREAD flag usage
qnx6: remove SLAB_MEM_SPREAD flag usage
...
|
||
|
|
d7bca9199a |
mm: Introduce vmap_page_range() to map pages in PCI address space
ioremap_page_range() should be used for ranges within vmalloc range only.
The vmalloc ranges are allocated by get_vm_area(). PCI has "resource"
allocator that manages PCI_IOBASE, IO_SPACE_LIMIT address range, hence
introduce vmap_page_range() to be used exclusively to map pages
in PCI address space.
Fixes:
|
||
|
|
3aaa8ce7a3 |
6 hotfixes. 4 are cc:stable and the remainder pertain to post-6.7
issues or aren't considered to be needed in earlier kernel versions. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZepZNgAKCRDdBJ7gKXxA jpEWAQC8ThQlyArXO8uXHwa8MDYgUKj02CIQE+jZ3pXIdL8w8gD/UGQQod+DBr3l zK3AljRd4hfrKVJB7H1+Zx/6PlH7Bgg= =DG4B -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-03-07-16-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "6 hotfixes. 4 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-03-07-16-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: scripts/gdb/symbols: fix invalid escape sequence warning mailmap: fix Kishon's email init/Kconfig: lower GCC version check for -Warray-bounds mm, mmap: fix vma_merge() case 7 with vma_ops->close mm: userfaultfd: fix unexpected change to src_folio when UFFDIO_MOVE fails mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations |
||
|
|
e6f798225a |
mm: Introduce VM_SPARSE kind and vm_area_[un]map_pages().
vmap/vmalloc APIs are used to map a set of pages into contiguous kernel virtual space. get_vm_area() with appropriate flag is used to request an area of kernel address range. It's used for vmalloc, vmap, ioremap, xen use cases. - vmalloc use case dominates the usage. Such vm areas have VM_ALLOC flag. - the areas created by vmap() function should be tagged with VM_MAP. - ioremap areas are tagged with VM_IOREMAP. BPF would like to extend the vmap API to implement a lazily-populated sparse, yet contiguous kernel virtual space. Introduce VM_SPARSE flag and vm_area_map_pages(area, start_addr, count, pages) API to map a set of pages within a given area. It has the same sanity checks as vmap() does. It also checks that get_vm_area() was created with VM_SPARSE flag which identifies such areas in /proc/vmallocinfo and returns zero pages on read through /proc/kcore. The next commits will introduce bpf_arena which is a sparsely populated shared memory region between bpf program and user space process. It will map privately-managed pages into a sparse vm area with the following steps: // request virtual memory region during bpf prog verification area = get_vm_area(area_size, VM_SPARSE); // on demand vm_area_map_pages(area, kaddr, kend, pages); vm_area_unmap_pages(area, kaddr, kend); // after bpf program is detached and unloaded free_vm_area(area); Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/bpf/20240305030516.41519-3-alexei.starovoitov@gmail.com |
||
|
|
58f327f2ce |
filemap: avoid unnecessary major faults in filemap_fault()
A major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) in
application, which leading to an unexpected issue[1].
This is caused by temporarily cleared PTE during a read+clear/modify/write
update of the PTE, eg, do_numa_page()/change_pte_range().
For the data segment of the user-mode program, the global variable area is
a private mapping. After the pagecache is loaded, the private anonymous
page is generated after the COW is triggered. Mlockall can lock COW pages
(anonymous pages), but the original file pages cannot be locked and may be
reclaimed. If the global variable (private anon page) is accessed when
vmf->pte is zeroed in numa fault, a file page fault will be triggered. At
this time, the original private file page may have been reclaimed. If the
page cache is not available at this time, a major fault will be triggered
and the file will be read, causing additional overhead.
This issue affects our traffic analysis service. The inbound traffic is
heavy. If a major fault occurs, the I/O schedule is triggered and the
original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If
other applications are operating disks, the system needs to wait for more
than 10 ms. However, the inbound traffic is heavy and the NIC buffer is
small. As a result, packet loss occurs. But the traffic analysis service
can't tolerate packet loss.
Fix this by holding PTL and rechecking the PTE in filemap_fault() before
triggering a major fault. We do this check only if vma is VM_LOCKED to
reduce the performance impact in common scenarios.
In our product environment, there were 7 major faults every 12 hours.
After the patch is applied, no major fault have been triggered.
Testing file page read and write page fault performance in ext4 and
ramdisk using will-it-scale[2] on a x86 physical machine. The data is the
average change compared with the mainline after the patch is applied. The
test results are within the range of fluctuation. We do this check only
if vma is VM_LOCKED, therefore, no performance regressions is caused for
most common cases.
The test results are as follows:
processes processes_idle threads threads_idle
ext4 private file write: 0.22% 0.26% 1.21% -0.15%
ext4 private file read: 0.03% 1.00% 1.39% 0.34%
ext4 shared file write: -0.50% -0.02% -0.14% -0.02%
ramdisk private file write: 0.07% 0.02% 0.53% 0.04%
ramdisk private file read: 0.01% 1.60% -0.32% -0.02%
[1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/
[2] https://github.com/antonblanchard/will-it-scale/
Link: https://lkml.kernel.org/r/20240306083809.1236634-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
4839e79c7e |
mm,page_owner: drop unnecessary check
stackdepot only saves stack_records which size is greather than 0, so we cannot possibly have empty stack_records. Drop the check. Link: https://lkml.kernel.org/r/20240306123217.29774-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
84d6ac31c3 |
mm,page_owner: check for null stack_record before bumping its refcount
Patch series "page_owner: Fixup and cleanup".
This patchset consists of a fixup by an error that was reported by intel
robot, where it seems to be that by the time page_owner gets initialized,
stackdepot has already depleted its allocation space and returns
0-handles, turning that into null stack_records when trying to retrieve
the stack_record. I was not able to reproduce that from the config
because it booted fine for me, but when setting e.g: dummy_handle to 0
artificially, I could see the same error that was reported.
The second patch is a cleanup that can also lead to a compilation warning.
This patch (of 2):
Although the retrieval of the stack_records for {dummy,failure}_handle
happen when page_owner gets initialized, there seems to be some situations
where stackdepot space has been already depleted by then, so we get
0-handles which make stack_records being NULL for those cases.
Be careful to 1) only bump stack_records refcount and 2) only access
stack_record fields if we actually have a non-null stack_record between
hands.
Link: https://lkml.kernel.org/r/20240306123217.29774-1-osalvador@suse.de
Link: https://lkml.kernel.org/r/20240306123217.29774-2-osalvador@suse.de
Fixes:
|
||
|
|
82b1c07a0a |
mm: swap: fix race between free_swap_and_cache() and swapoff()
There was previously a theoretical window where swapoff() could run and
teardown a swap_info_struct while a call to free_swap_and_cache() was
running in another thread. This could cause, amongst other bad
possibilities, swap_page_trans_huge_swapped() (called by
free_swap_and_cache()) to access the freed memory for swap_map.
This is a theoretical problem and I haven't been able to provoke it from a
test case. But there has been agreement based on code review that this is
possible (see link below).
Fix it by using get_swap_device()/put_swap_device(), which will stall
swapoff(). There was an extra check in _swap_info_get() to confirm that
the swap entry was not free. This isn't present in get_swap_device()
because it doesn't make sense in general due to the race between getting
the reference and swapoff. So I've added an equivalent check directly in
free_swap_and_cache().
Details of how to provoke one possible issue (thanks to David Hildenbrand
for deriving this):
--8<-----
__swap_entry_free() might be the last user and result in
"count == SWAP_HAS_CACHE".
swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0.
So the question is: could someone reclaim the folio and turn
si->inuse_pages==0, before we completed swap_page_trans_huge_swapped().
Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are
still references by swap entries.
Process 1 still references subpage 0 via swap entry.
Process 2 still references subpage 1 via swap entry.
Process 1 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
[then, preempted in the hypervisor etc.]
Process 2 quits. Calls free_swap_and_cache().
-> count == SWAP_HAS_CACHE
Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls
__try_to_reclaim_swap().
__try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()->
put_swap_folio()->free_swap_slot()->swapcache_free_entries()->
swap_entry_free()->swap_range_free()->
...
WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
What stops swapoff to succeed after process 2 reclaimed the swap cache
but before process1 finished its call to swap_page_trans_huge_swapped()?
--8<-----
Link: https://lkml.kernel.org/r/20240306140356.3974886-1-ryan.roberts@arm.com
Fixes:
|
||
|
|
b6c9d5a93b |
mm/kasan: use pXd_leaf() in shadow_mapped()
There is an old trick in shadow_mapped() to use pXd_bad() to detect huge
pages. After commit
|
||
|
|
e35606e416 |
mm/zswap: global lru and shrinker shared by all zswap_pools fix
Commit |
||
|
|
5aa598a72e |
mm: memory: fix shift-out-of-bounds in fault_around_bytes_set
The rounddown_pow_of_two(0) is undefined, so val = 0 is not allowed in the
fault_around_bytes_set(), and leads to shift-out-of-bounds,
UBSAN: shift-out-of-bounds in include/linux/log2.h:67:13
shift exponent 4294967295 is too large for 64-bit type 'long unsigned int'
CPU: 7 PID: 107 Comm: sh Not tainted 6.8.0-rc6-next-20240301 #294
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
dump_backtrace+0x94/0xec
show_stack+0x18/0x24
dump_stack_lvl+0x78/0x90
dump_stack+0x18/0x24
ubsan_epilogue+0x10/0x44
__ubsan_handle_shift_out_of_bounds+0x98/0x134
fault_around_bytes_set+0xa4/0xb0
simple_attr_write_xsigned.isra.0+0xe4/0x1ac
simple_attr_write+0x18/0x24
debugfs_attr_write+0x4c/0x98
vfs_write+0xd0/0x4b0
ksys_write+0x6c/0xfc
__arm64_sys_write+0x1c/0x28
invoke_syscall+0x44/0x104
el0_svc_common.constprop.0+0x40/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x34/0xdc
el0t_64_sync_handler+0xc0/0xc4
el0t_64_sync+0x190/0x194
---[ end trace ]---
Fix it by setting the minimum val to PAGE_SIZE.
Link: https://lkml.kernel.org/r/20240302064312.2358924-1-wangkefeng.wang@huawei.com
Fixes:
|
||
|
|
72741db683 |
mm: page_alloc: use div64_ul() instead of do_div()
Fixes Coccinelle/coccicheck warning reported by do_div.cocci. Compared to do_div(), div64_ul() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lkml.kernel.org/r/20240228224911.1164-2-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
f1cce6f7fa |
mm/mempolicy: use a folio in do_mbind()
We actually add folios to the pagelist already, but then work with them as pages. Removes a call to compound_head() in PageKsm() and removes a reference to page->index. Link: https://lkml.kernel.org/r/20240229153015.1996829-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Gregory Price <gregory.price@memverge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
ac96cc4d1c |
mm: make folio_pte_batch available outside of mm/memory.c
madvise, mprotect and some others might need folio_pte_batch to check if a range of PTEs are completely mapped to a large folio with contiguous physical addresses. Let's make it available in mm/internal.h. While at it, add proper kernel doc and sanity-check more input parameters using two additional VM_WARN_ON_FOLIO(). [21cnbao@gmail.com: build fix] Link: https://lkml.kernel.org/r/CAGsJ_4wWzG-37D82vqP_zt+Fcbz+URVe5oXLBc4M5wbN8A_gpQ@mail.gmail.com [david@redhat.com: improve the doc for the exported func] Link: https://lkml.kernel.org/r/20240227104201.337988-1-21cnbao@gmail.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
29cfe7556b |
mm: constify more page/folio tests
Constify the flag tests that aren't automatically generated and the tests that look like flag tests but are more complicated. Link: https://lkml.kernel.org/r/20240227192337.757313-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b3a3203309 |
mm: make dump_page() take a const argument
Now that __dump_page() takes a const argument, we can make dump_page() take a const struct page too. Link: https://lkml.kernel.org/r/20240227192337.757313-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fae7d834c4 |
mm: add __dump_folio()
Turn __dump_page() into a wrapper around __dump_folio(). Snapshot the page & folio into a stack variable so we don't hit BUG_ON() if an allocation is freed under us and what was a folio pointer becomes a pointer to a tail page. [willy@infradead.org: fix build issue] Link: https://lkml.kernel.org/r/ZeAKCyTn_xS3O9cE@casper.infradead.org [willy@infradead.org: fix __dump_folio] Link: https://lkml.kernel.org/r/ZeJJegP8zM7S9GTy@casper.infradead.org [willy@infradead.org: fix pointer confusion] Link: https://lkml.kernel.org/r/ZeYa00ixxC4k1ot-@casper.infradead.org [akpm@linux-foundation.org: s/printk/pr_warn/] Link: https://lkml.kernel.org/r/20240227192337.757313-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
b78b27d029 |
hugetlb: parallelize 1G hugetlb initialization
Optimizing the initialization speed of 1G huge pages through
parallelization.
1G hugetlbs are allocated from bootmem, a process that is already very
fast and does not currently require optimization. Therefore, we focus on
parallelizing only the initialization phase in `gather_bootmem_prealloc`.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 1G 4745 2024 57.34%
128c1T(2 node) 1G 3358 1712 49.02%
12T 1G 77000 18300 76.23%
[akpm@linux-foundation.org: s/initialied/initialized/, per Alexey]
Link: https://lkml.kernel.org/r/20240222140422.393911-9-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
c6c21c31d0 |
hugetlb: parallelize 2M hugetlb allocation and initialization
By distributing both the allocation and the initialization tasks across
multiple threads, the initialization of 2M hugetlb will be faster, thereby
improving the boot speed.
Here are some test results:
test case no patch(ms) patched(ms) saved
------------------- -------------- ------------- --------
256c2T(4 node) 2M 3336 1051 68.52%
128c1T(2 node) 2M 1943 716 63.15%
Link: https://lkml.kernel.org/r/20240222140422.393911-8-gang.li@linux.dev
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
eb52286634 |
Author: Gang Li padata: dispatch works on
different nodes Date: Thu, 22 Feb 2024 22:04:17 +0800 When a group of tasks that access different nodes are scheduled on the same node, they may encounter bandwidth bottlenecks and access latency. Thus, numa_aware flag is introduced here, allowing tasks to be distributed across different nodes to fully utilize the advantage of multi-node systems. Link: https://lkml.kernel.org/r/20240222140422.393911-5-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
2e73ff236e |
hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc
With parallelization of hugetlb allocation across different threads, each thread works on a differnet node to allocate pages from, instead of all allocating from a common node h->next_nid_to_alloc. To address this, it's necessary to assign a separate next_nid_to_alloc for each thread. Consequently, the hstate_next_node_to_alloc and for_each_node_mask_to_alloc have been modified to directly accept a *next_nid_to_alloc parameter, ensuring thread-specific allocation and avoiding concurrent access issues. Link: https://lkml.kernel.org/r/20240222140422.393911-4-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
d5c3eb3f50 |
hugetlb: split hugetlb_hstate_alloc_pages
1G and 2M huge pages have different allocation and initialization logic, which leads to subtle differences in parallelization. Therefore, it is appropriate to split hugetlb_hstate_alloc_pages into gigantic and non-gigantic. This patch has no functional changes. Link: https://lkml.kernel.org/r/20240222140422.393911-3-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
fc37bbb328 |
hugetlb: code clean for hugetlb_hstate_alloc_pages
Patch series "hugetlb: parallelize hugetlb page init on boot", v6. Introduction ------------ Hugetlb initialization during boot takes up a considerable amount of time. For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel host takes more than 1 minute[1]. This is a noteworthy figure. Inspired by [2] and [3], hugetlb initialization can also be accelerated through parallelization. Kernel already has infrastructure like padata_do_multithreaded, this patch uses it to achieve effective results by minimal modifications. [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/ [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/ [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/ [4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@google.com/ max_threads ----------- This patch use `padata_do_multithreaded` like this: ``` job.max_threads = num_node_state(N_MEMORY) * multiplier; padata_do_multithreaded(&job); ``` To fully utilize the CPU, the number of parallel threads needs to be carefully considered. `max_threads = num_node_state(N_MEMORY)` does not fully utilize the CPU, so we need to multiply it by a multiplier. Tests below indicate that a multiplier of 2 significantly improves performance, and although larger values also provide improvements, the gains are marginal. multiplier 1 2 3 4 5 ------------ ------- ------- ------- ------- ------- 256G 2node 358ms 215ms 157ms 134ms 126ms 2T 4node 979ms 679ms 543ms 489ms 481ms 50G 2node 71ms 44ms 37ms 30ms 31ms Therefore, choosing 2 as the multiplier strikes a good balance between enhancing parallel processing capabilities and maintaining efficient resource management. Test result ----------- test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% This patch (of 8): The readability of `hugetlb_hstate_alloc_pages` is poor. By cleaning the code, its readability can be improved, facilitating future modifications. This patch extracts two functions to reduce the complexity of `hugetlb_hstate_alloc_pages` and has no functional changes. - hugetlb_hstate_alloc_pages_node_specific() to handle iterates through each online node and performs allocation if necessary. - hugetlb_hstate_alloc_pages_report() report error during allocation. And the value of h->max_huge_pages is updated accordingly. Link: https://lkml.kernel.org/r/20240222140422.393911-1-gang.li@linux.dev Link: https://lkml.kernel.org/r/20240222140422.393911-2-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
|
|
3e49a866c9 |
mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.
There are various users of get_vm_area() + ioremap_page_range() APIs. Enforce that get_vm_area() was requested as VM_IOREMAP type and range passed to ioremap_page_range() matches created vm_area to avoid accidentally ioremap-ing into wrong address range. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20240305030516.41519-2-alexei.starovoitov@gmail.com |
||
|
|
a0727489ac |
net: introduce page_frag_cache_drain()
When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
4bc0d63a23 |
page_frag: unify gfp bits for order 3 page allocation
Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
411c5f3680 |
mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
napi_alloc_frag_align() and netdev_alloc_frag_align() accept align as an argument, and they are thin wrappers around the __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs doing the alignment checking and align mask conversion, in order to call page_frag_alloc_align() directly. The intention here is to keep the alignment checking and the alignmask conversion in in-line wrapper to avoid those kind of operations during execution time since it can usually be handled during compile time. We are going to use page_frag_alloc_align() in vhost_net.c, it need the same kind of alignment checking and alignmask conversion, so split up page_frag_alloc_align into an inline wrapper doing the above operation, and add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
|
|
fae1b01293 |
slab: remove PARTIAL_NODE slab_state
The PARTIAL_NODE slab_state has gone with SLAB removed, so just remove it. Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
||
|
|
26e93839d6 |
mm/zsmalloc: don't need to reserve LSB in handle
We will save allocated tag in the object header to indicate that it's allocated. handle |= OBJ_ALLOCATED_TAG; So the object header needs to reserve LSB for this tag bit. But the handle itself doesn't need to reserve LSB to save tag, since it's only used to find the position of object, by (pfn + obj_idx). So remove LSB reserve from handle, one more bit can be used as obj_idx. Link: https://lkml.kernel.org/r/20240228023854.3511239-1-chengming.zhou@linux.dev Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |