mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
synced 2025-08-15 09:35:19 +00:00
master
566 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
da23ea194d |
Significant patch series in this pull request:
- The 4 patch series "mseal cleanups" from Lorenzo Stoakes erforms some mseal cleaning with no intended functional change. - The 3 patch series "Optimizations for khugepaged" from David Hildenbrand improves khugepaged throughput by batching PTE operations for large folios. This gain is mainly for arm64. - The 8 patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" from Mike Rapoport provides a bugfix, additional debug code and cleanups to the execmem code. - The 7 patch series "mm/shmem, swap: bugfix and improvement of mTHP swap in" from Kairui Song provides bugfixes, cleanups and performance improvememnts to the mTHP swapin code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaI+6HQAKCRDdBJ7gKXxA jv7lAQCAKE5dUhdZ0pOYbhBKTlDapQh2KqHrlV3QFcxXgknEoQD/c3gG01rY3fLh Cnf5l9+cdyfKxFniO48sUPx6IpriRg8= =HT5/ -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: "Significant patch series in this pull request: - "mseal cleanups" (Lorenzo Stoakes) Some mseal cleaning with no intended functional change. - "Optimizations for khugepaged" (David Hildenbrand) Improve khugepaged throughput by batching PTE operations for large folios. This gain is mainly for arm64. - "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes" (Mike Rapoport) A bugfix, additional debug code and cleanups to the execmem code. - "mm/shmem, swap: bugfix and improvement of mTHP swap in" (Kairui Song) Bugfixes, cleanups and performance improvememnts to the mTHP swapin code" * tag 'mm-stable-2025-08-03-12-35' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (38 commits) mm: mempool: fix crash in mempool_free() for zero-minimum pools mm: correct type for vmalloc vm_flags fields mm/shmem, swap: fix major fault counting mm/shmem, swap: rework swap entry and index calculation for large swapin mm/shmem, swap: simplify swapin path and result handling mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO mm/shmem, swap: tidy up swap entry splitting mm/shmem, swap: tidy up THP swapin checks mm/shmem, swap: avoid redundant Xarray lookup during swapin x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations execmem: drop writable parameter from execmem_fill_trapping_insns() execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP) execmem: move execmem_force_rw() and execmem_restore_rox() before use execmem: rework execmem_cache_free() execmem: introduce execmem_alloc_rw() execmem: drop unused execmem_update_copy() mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped mm/rmap: add anon_vma lifetime debug check mm: remove mm/io-mapping.c ... |
||
![]() |
9109bd5255 |
mm/memory-failure: hold PTL in hwpoison_hugetlb_range
Hold PTL in hwpoison_hugetlb_range() to avoid operating on stale page, as hwpoison_pte_range() have done. This change is not known to address any issues which users have experienced. Link: https://lkml.kernel.org/r/20250725033112.2690158-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joern Engel <joern@logfs.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
beace86e61 |
Summary of significant series in this pull request:
- The 4 patch series "mm: ksm: prevent KSM from breaking merging of new VMAs" from Lorenzo Stoakes addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. - The 4 patch series "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" from SeongJae Park adds a new kernel module which simplifies the setup and usage of DAMON in production environments. - The 6 patch series "stop passing a writeback_control to swap/shmem writeout" from Christoph Hellwig is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. - The 7 patch series "drivers/base/node.c: optimization and cleanups" from Donet Tom contains largely uncorrelated cleanups to the NUMA node setup and management code. - The 4 patch series "mm: userfaultfd: assorted fixes and cleanups" from Tal Zussman does some maintenance work on the userfaultfd code. - The 5 patch series "Readahead tweaks for larger folios" from Ryan Roberts implements some tuneups for pagecache readahead when it is reading into order>0 folios. - The 4 patch series "selftests/mm: Tweaks to the cow test" from Mark Brown provides some cleanups and consistency improvements to the selftests code. - The 4 patch series "Optimize mremap() for large folios" from Dev Jain does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. - The 5 patch series "Remove zero_user()" from Matthew Wilcox expunges zero_user() in favor of the more modern memzero_page(). - The 3 patch series "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" from David Hildenbrand addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. - The 3 patch series "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" from SeongJae Park provides some cleanup and consolidation work in DAMON. - The 3 patch series "use vm_flags_t consistently" from Lorenzo Stoakes uses vm_flags_t in places where we were inappropriately using other types. - The 3 patch series "mm/memfd: Reserve hugetlb folios before allocation" from Vivek Kasireddy increases the reliability of large page allocation in the memfd code. - The 14 patch series "mm: Remove pXX_devmap page table bit and pfn_t type" from Alistair Popple removes several now-unneeded PFN_* flags. - The 5 patch series "mm/damon: decouple sysfs from core" from SeongJae Park implememnts some cleanup and maintainability work in the DAMON sysfs layer. - The 5 patch series "madvise cleanup" from Lorenzo Stoakes does quite a lot of cleanup/maintenance work in the madvise() code. - The 4 patch series "madvise anon_name cleanups" from Vlastimil Babka provides additional cleanups on top or Lorenzo's effort. - The 11 patch series "Implement numa node notifier" from Oscar Salvador creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. - The 6 patch series "Make MIGRATE_ISOLATE a standalone bit" from Zi Yan cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. - The 5 patch series "selftests/damon: add python and drgn based DAMON sysfs functionality tests" from SeongJae Park adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. - The 5 patch series "Misc rework on hugetlb faulting path" from Oscar Salvador fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. - The 3 patch series "cma: factor out allocation logic from __cma_declare_contiguous_nid" from Mike Rapoport rationalizes and cleans up the highmem-specific code in the CMA allocator. - The 28 patch series "mm/migration: rework movable_ops page migration (part 1)" from David Hildenbrand provides cleanups and future-preparedness to the migration code. - The 2 patch series "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" from SeongJae Park adds some tracepoints to some DAMON auto-tuning code. - The 6 patch series "mm/damon: fix misc bugs in DAMON modules" from SeongJae Park does that. - The 6 patch series "mm/damon: misc cleanups" from SeongJae Park also does what it claims. - The 4 patch series "mm: folio_pte_batch() improvements" from David Hildenbrand cleans up the large folio PTE batching code. - The 13 patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" from SeongJae Park facilitates dynamic alteration of DAMON's inter-node allocation policy. - The 3 patch series "Remove unmap_and_put_page()" from Vishal Moola provides a couple of page->folio conversions. - The 4 patch series "mm: per-node proactive reclaim" from Davidlohr Bueso implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. - The 14 patch series "mm/damon: remove damon_callback" from SeongJae Park replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. - The 10 patch series "mm/mremap: permit mremap() move of multiple VMAs" from Lorenzo Stoakes implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. - The 3 patch series "drop hugetlb_free_pgd_range()" from Anthony Yznaga switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). - The 4 patch series "mm/damon/sysfs: support periodic and automated stats update" from SeongJae Park augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. - The 4 patch series "Some randome fixes and cleanups to swapfile" from Kemeng Shi does what is claims. - The 4 patch series "mm: introduce snapshot_page" from Luiz Capitulino and David Hildenbrand provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. - The 6 patch series "use per-vma locks for /proc/pid/maps reads" from Suren Baghdasaryan addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. - The 6 patch series "__folio_split() clean up" from Zi Yan cleans up __folio_split()! - The 7 patch series "Optimize mprotect() for large folios" from Dev Jain provides some quite large (>3x) speedups to mprotect() when dealing with large folios. - The 2 patch series "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" from wang lian does some cleanup work in the selftests code. - The 3 patch series "tools/testing: expand mremap testing" from Lorenzo Stoakes extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. - The 22 patch series "selftests/damon/sysfs.py: test all parameters" from SeongJae Park extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaIqcCgAKCRDdBJ7gKXxA jkVBAQCCn9DR1QP0CRk961ot0cKzOgioSc0aA03DPb2KXRt2kQEAzDAz0ARurFhL 8BzbvI0c+4tntHLXvIlrC33n9KWAOQM= =XsFy -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "As usual, many cleanups. The below blurbiage describes 42 patchsets. 21 of those are partially or fully cleanup work. "cleans up", "cleanup", "maintainability", "rationalizes", etc. I never knew the MM code was so dirty. "mm: ksm: prevent KSM from breaking merging of new VMAs" (Lorenzo Stoakes) addresses an issue with KSM's PR_SET_MEMORY_MERGE mode: newly mapped VMAs were not eligible for merging with existing adjacent VMAs. "mm/damon: introduce DAMON_STAT for simple and practical access monitoring" (SeongJae Park) adds a new kernel module which simplifies the setup and usage of DAMON in production environments. "stop passing a writeback_control to swap/shmem writeout" (Christoph Hellwig) is a cleanup to the writeback code which removes a couple of pointers from struct writeback_control. "drivers/base/node.c: optimization and cleanups" (Donet Tom) contains largely uncorrelated cleanups to the NUMA node setup and management code. "mm: userfaultfd: assorted fixes and cleanups" (Tal Zussman) does some maintenance work on the userfaultfd code. "Readahead tweaks for larger folios" (Ryan Roberts) implements some tuneups for pagecache readahead when it is reading into order>0 folios. "selftests/mm: Tweaks to the cow test" (Mark Brown) provides some cleanups and consistency improvements to the selftests code. "Optimize mremap() for large folios" (Dev Jain) does that. A 37% reduction in execution time was measured in a memset+mremap+munmap microbenchmark. "Remove zero_user()" (Matthew Wilcox) expunges zero_user() in favor of the more modern memzero_page(). "mm/huge_memory: vmf_insert_folio_*() and vmf_insert_pfn_pud() fixes" (David Hildenbrand) addresses some warts which David noticed in the huge page code. These were not known to be causing any issues at this time. "mm/damon: use alloc_migrate_target() for DAMOS_MIGRATE_{HOT,COLD" (SeongJae Park) provides some cleanup and consolidation work in DAMON. "use vm_flags_t consistently" (Lorenzo Stoakes) uses vm_flags_t in places where we were inappropriately using other types. "mm/memfd: Reserve hugetlb folios before allocation" (Vivek Kasireddy) increases the reliability of large page allocation in the memfd code. "mm: Remove pXX_devmap page table bit and pfn_t type" (Alistair Popple) removes several now-unneeded PFN_* flags. "mm/damon: decouple sysfs from core" (SeongJae Park) implememnts some cleanup and maintainability work in the DAMON sysfs layer. "madvise cleanup" (Lorenzo Stoakes) does quite a lot of cleanup/maintenance work in the madvise() code. "madvise anon_name cleanups" (Vlastimil Babka) provides additional cleanups on top or Lorenzo's effort. "Implement numa node notifier" (Oscar Salvador) creates a standalone notifier for NUMA node memory state changes. Previously these were lumped under the more general memory on/offline notifier. "Make MIGRATE_ISOLATE a standalone bit" (Zi Yan) cleans up the pageblock isolation code and fixes a potential issue which doesn't seem to cause any problems in practice. "selftests/damon: add python and drgn based DAMON sysfs functionality tests" (SeongJae Park) adds additional drgn- and python-based DAMON selftests which are more comprehensive than the existing selftest suite. "Misc rework on hugetlb faulting path" (Oscar Salvador) fixes a rather obscure deadlock in the hugetlb fault code and follows that fix with a series of cleanups. "cma: factor out allocation logic from __cma_declare_contiguous_nid" (Mike Rapoport) rationalizes and cleans up the highmem-specific code in the CMA allocator. "mm/migration: rework movable_ops page migration (part 1)" (David Hildenbrand) provides cleanups and future-preparedness to the migration code. "mm/damon: add trace events for auto-tuned monitoring intervals and DAMOS quota" (SeongJae Park) adds some tracepoints to some DAMON auto-tuning code. "mm/damon: fix misc bugs in DAMON modules" (SeongJae Park) does that. "mm/damon: misc cleanups" (SeongJae Park) also does what it claims. "mm: folio_pte_batch() improvements" (David Hildenbrand) cleans up the large folio PTE batching code. "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold} actions" (SeongJae Park) facilitates dynamic alteration of DAMON's inter-node allocation policy. "Remove unmap_and_put_page()" (Vishal Moola) provides a couple of page->folio conversions. "mm: per-node proactive reclaim" (Davidlohr Bueso) implements a per-node control of proactive reclaim - beyond the current memcg-based implementation. "mm/damon: remove damon_callback" (SeongJae Park) replaces the damon_callback interface with a more general and powerful damon_call()+damos_walk() interface. "mm/mremap: permit mremap() move of multiple VMAs" (Lorenzo Stoakes) implements a number of mremap cleanups (of course) in preparation for adding new mremap() functionality: newly permit the remapping of multiple VMAs when the user is specifying MREMAP_FIXED. It still excludes some specialized situations where this cannot be performed reliably. "drop hugetlb_free_pgd_range()" (Anthony Yznaga) switches some sparc hugetlb code over to the generic version and removes the thus-unneeded hugetlb_free_pgd_range(). "mm/damon/sysfs: support periodic and automated stats update" (SeongJae Park) augments the present userspace-requested update of DAMON sysfs monitoring files. Automatic update is now provided, along with a tunable to control the update interval. "Some randome fixes and cleanups to swapfile" (Kemeng Shi) does what is claims. "mm: introduce snapshot_page" (Luiz Capitulino and David Hildenbrand) provides (and uses) a means by which debug-style functions can grab a copy of a pageframe and inspect it locklessly without tripping over the races inherent in operating on the live pageframe directly. "use per-vma locks for /proc/pid/maps reads" (Suren Baghdasaryan) addresses the large contention issues which can be triggered by reads from that procfs file. Latencies are reduced by more than half in some situations. The series also introduces several new selftests for the /proc/pid/maps interface. "__folio_split() clean up" (Zi Yan) cleans up __folio_split()! "Optimize mprotect() for large folios" (Dev Jain) provides some quite large (>3x) speedups to mprotect() when dealing with large folios. "selftests/mm: reuse FORCE_READ to replace "asm volatile("" : "+r" (XXX));" and some cleanup" (wang lian) does some cleanup work in the selftests code. "tools/testing: expand mremap testing" (Lorenzo Stoakes) extends the mremap() selftest in several ways, including adding more checking of Lorenzo's recently added "permit mremap() move of multiple VMAs" feature. "selftests/damon/sysfs.py: test all parameters" (SeongJae Park) extends the DAMON sysfs interface selftest so that it tests all possible user-requested parameters. Rather than the present minimal subset" * tag 'mm-stable-2025-07-30-15-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (370 commits) MAINTAINERS: add missing headers to mempory policy & migration section MAINTAINERS: add missing file to cgroup section MAINTAINERS: add MM MISC section, add missing files to MISC and CORE MAINTAINERS: add missing zsmalloc file MAINTAINERS: add missing files to page alloc section MAINTAINERS: add missing shrinker files MAINTAINERS: move memremap.[ch] to hotplug section MAINTAINERS: add missing mm_slot.h file THP section MAINTAINERS: add missing interval_tree.c to memory mapping section MAINTAINERS: add missing percpu-internal.h file to per-cpu section mm/page_alloc: remove trace_mm_alloc_contig_migrate_range_info() selftests/damon: introduce _common.sh to host shared function selftests/damon/sysfs.py: test runtime reduction of DAMON parameters selftests/damon/sysfs.py: test non-default parameters runtime commit selftests/damon/sysfs.py: generalize DAMON context commit assertion selftests/damon/sysfs.py: generalize monitoring attributes commit assertion selftests/damon/sysfs.py: generalize DAMOS schemes commit assertion selftests/damon/sysfs.py: test DAMOS filters commitment selftests/damon/sysfs.py: generalize DAMOS scheme commit assertion selftests/damon/sysfs.py: test DAMOS destinations commitment ... |
||
![]() |
9bbf8e17d8 |
ACPI updates for 6.17-rc1
- Printing the address in acpi_ex_trace_point() is either incorrect during early kernel boot or not really useful later when pathnames resolve properly, so stop doing it (Mario Limonciello) - Address several minor issues in the legacy ACPI proc interface (Andy Shevchenko) - Fix acpi_object union initialization in the ACPI processor driver to avoid using memory that contains leftover data (Sebastian Ott) - Make the ACPI processor perflib driver take the initial _PPC limit into account as appropriate (Jiayi Li) - Fix message formatting in the ACPI processor throttling driver and in the ACPI PCI link driver (Colin Ian King) - Clean up general ACPI PM domain handling (Rafael Wysocki) - Fix iomem-related sparse warnings in the APEI EINJ driver (Zaid Alali, Tony Luck) - Add EINJv2 error injection support to the APEI EINJ driver (Zaid Alali) - Fix memory corruption in error_type_set() in the APEI EINJ driver (Dan Carpenter) - Fix less than zero comparison on a size_t variable in the APEI EINJ driver (Colin Ian King) - Fix check and iounmap of an uninitialized pointer in the APEI EINJ driver (Colin Ian King) - Add TAINT_MACHINE_CHECK to the GHES panic path in APEI to improve diagnostics and post-mortem analysis (Breno Leitao) - Update APEI reviewer records and other ACPI-related information in MAINTAINERS as well as the contact information in the ACPI ABI documentation (Rafael Wysocki) - Fix the handling of synchronous uncorrected memory errors in APEI (Shuai Xue) - Remove an AudioDSP-related ID from the ACPI LPSS driver (Andy Shevchenko) - Replace sprintf()/scnprintf() with sysfs_emit() in the ACPI fan driver and update a debug message in fan_get_state_acpi4() (Eslam Khafagy, Abdelrahman Fekry, Sumeet Pawnikar) - Add Intel Wildcat Lake support to the ACPI DPTF driver (Srinivas Pandruvada) - Add more debug information regarding failing firmware updates to the ACPI pfr_update driver (Chen Yu) - Reduce the verbosity of the ACPI PRM (platform runtime mechanism) driver to avoid user confusion (Zhu Qiyu) - Replace sprintf() with sysfs_emit() in the ACPI TAD (time and alarm device) driver (Sukrut Heroorkar) - Enable CONFIG_ACPI_DEBUG by default to make it easier to get ACPI debug messages from OEM platforms (Mario Limonciello) - Fix parent device references in ASL examples in the ACPI documentation and fix spelling and style in the gpio-properties documentation in firmware-guide (Andy Shevchenko) - Fix typos in ACPI documentation and comments (Bjorn Helgaas) -----BEGIN PGP SIGNATURE----- iQFGBAABCAAwFiEEcM8Aw/RY0dgsiRUR7l+9nS/U47UFAmh/qzQSHHJqd0Byand5 c29ja2kubmV0AAoJEO5fvZ0v1OO1wMgH/2vklBeGYjxSIIn0qfiv/SnSW5B1jEE4 5vDuCpafesm7tZtwFB9v2mRf/8SvmJey2jfYgGnBMBlTW0JUP8eCVpRASvx1SCGH QwJFN3GCs3IjIvT2KlXeDIyQdfITIl3SNTXwTFl/ezYT0vo7VBeFtofeL9szaxlP rM1KeLE7lksjY9djT8PRwxOQ+EzuJ8uUGXYcHu797u0x5XB9ZiBDuKqngDicQONI DaRU3zXwOKPkQhldFN+9HPsDPvm7+8f1yiOEWzLnToTDggQDH4x2wdXJOOpAogoN qIYBIZG90drNj9USsWZvp0q/fT3OVbuHLuruDGieOCYUByBYddY9WHM= =5h+J -----END PGP SIGNATURE----- Merge tag 'acpi-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These update APEI (new EINJv2 error injection, assorted fixes), fix the ACPI processor driver, update the legacy ACPI /proc interface (multiple assorted fixes of minor issues) and several assorted ACPI drivers (minor fixes and cleanups): - Printing the address in acpi_ex_trace_point() is either incorrect during early kernel boot or not really useful later when pathnames resolve properly, so stop doing it (Mario Limonciello) - Address several minor issues in the legacy ACPI proc interface (Andy Shevchenko) - Fix acpi_object union initialization in the ACPI processor driver to avoid using memory that contains leftover data (Sebastian Ott) - Make the ACPI processor perflib driver take the initial _PPC limit into account as appropriate (Jiayi Li) - Fix message formatting in the ACPI processor throttling driver and in the ACPI PCI link driver (Colin Ian King) - Clean up general ACPI PM domain handling (Rafael Wysocki) - Fix iomem-related sparse warnings in the APEI EINJ driver (Zaid Alali, Tony Luck) - Add EINJv2 error injection support to the APEI EINJ driver (Zaid Alali) - Fix memory corruption in error_type_set() in the APEI EINJ driver (Dan Carpenter) - Fix less than zero comparison on a size_t variable in the APEI EINJ driver (Colin Ian King) - Fix check and iounmap of an uninitialized pointer in the APEI EINJ driver (Colin Ian King) - Add TAINT_MACHINE_CHECK to the GHES panic path in APEI to improve diagnostics and post-mortem analysis (Breno Leitao) - Update APEI reviewer records and other ACPI-related information in MAINTAINERS as well as the contact information in the ACPI ABI documentation (Rafael Wysocki) - Fix the handling of synchronous uncorrected memory errors in APEI (Shuai Xue) - Remove an AudioDSP-related ID from the ACPI LPSS driver (Andy Shevchenko) - Replace sprintf()/scnprintf() with sysfs_emit() in the ACPI fan driver and update a debug message in fan_get_state_acpi4() (Eslam Khafagy, Abdelrahman Fekry, Sumeet Pawnikar) - Add Intel Wildcat Lake support to the ACPI DPTF driver (Srinivas Pandruvada) - Add more debug information regarding failing firmware updates to the ACPI pfr_update driver (Chen Yu) - Reduce the verbosity of the ACPI PRM (platform runtime mechanism) driver to avoid user confusion (Zhu Qiyu) - Replace sprintf() with sysfs_emit() in the ACPI TAD (time and alarm device) driver (Sukrut Heroorkar) - Enable CONFIG_ACPI_DEBUG by default to make it easier to get ACPI debug messages from OEM platforms (Mario Limonciello) - Fix parent device references in ASL examples in the ACPI documentation and fix spelling and style in the gpio-properties documentation in firmware-guide (Andy Shevchenko) - Fix typos in ACPI documentation and comments (Bjorn Helgaas)" * tag 'acpi-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits) ACPI: Fix typos ACPI/PCI: Remove space before newline ACPI: processor: throttling: Remove space before newline ACPI: processor: perflib: Fix initial _PPC limit application ACPI/PNP: Use my kernel.org address in MAINTAINERS and ABI docs ACPI: TAD: Replace sprintf() with sysfs_emit() ACPI: APEI: handle synchronous exceptions in task work ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered ACPI: APEI: MAINTAINERS: Update reviewers for APEI Documentation: ACPI: Fix parent device references ACPI: fan: Update debug message in fan_get_state_acpi4() ACPI: PRM: Reduce unnecessary printing to avoid user confusion ACPI: fan: Replace sprintf() with sysfs_emit() ACPI: APEI: EINJ: Fix trigger actions ACPI: processor: fix acpi_object initialization ACPI: APEI: GHES: add TAINT_MACHINE_CHECK on GHES panic path ACPI: LPSS: Remove AudioDSP related ID Documentation: firmware-guide: gpio-properties: Spelling and style fixes ACPI: fan: Replace sprintf()/scnprintf() with sysfs_emit() in show() functions ACPI: PM: Set .detach in acpi_general_pm_domain definition ... |
||
![]() |
9f1e8cd0b7 |
mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
In shrink_folio_list(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio(). For THP, try_to_unmap_one()
must be passed with TTU_SPLIT_HUGE_PMD to split huge PMD first and then
retry. Without TTU_SPLIT_HUGE_PMD, we will trigger null-ptr deref of
pvmw.pte. Even we passed TTU_SPLIT_HUGE_PMD, we will trigger a
WARN_ON_ONCE due to the page isn't in swapcache.
Since UCE is rare in real world, and race with reclaimation is more rare,
just skipping the hwpoisoned large folio is enough. memory_failure() will
handle it if the UCE is triggered again.
This happens when memory reclaim for large folio races with
memory_failure(), and will lead to kernel panic. The race is as
follows:
cpu0 cpu1
shrink_folio_list memory_failure
TestSetPageHWPoison
unmap_poisoned_folio
--> trigger BUG_ON due to
unmap_poisoned_folio couldn't
handle large folio
[tujinjiang@huawei.com: add comment to unmap_poisoned_folio()]
Link: https://lkml.kernel.org/r/69fd4e00-1b13-d5f7-1c82-705c7d977ea4@huawei.com
Link: https://lkml.kernel.org/r/20250627125747.3094074-2-tujinjiang@huawei.com
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Fixes:
|
||
![]() |
c1f1fda141 |
ACPI: APEI: handle synchronous exceptions in task work
The memory uncorrected error could be signaled by asynchronous interrupt
(specifically, SPI in arm64 platform), e.g. when an error is detected by
a background scrubber, or signaled by synchronous exception
(specifically, data abort exception in arm64 platform), e.g. when a CPU
tries to access a poisoned cache line. Currently, both synchronous and
asynchronous errors use memory_failure_queue() to schedule
memory_failure() to exectute in a kworker context.
As a result, when a user-space process is accessing a poisoned data, a
data abort is taken and the memory_failure() is executed in the kworker
context, which:
- will send wrong si_code by SIGBUS signal in early_kill mode, and
- can not kill the user-space in some cases resulting a synchronous
error infinite loop
Issue 1: send wrong si_code in early_kill mode
Since commit
|
||
![]() |
d4fb4587bd |
mm: rename __PageMovable() to page_has_movable_ops()
Let's make it clearer that we are talking about movable_ops pages. While at it, convert a VM_BUG_ON to a VM_WARN_ON_ONCE_PAGE. Link: https://lkml.kernel.org/r/20250704102524.326966-17-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
d2734f044f |
mm: memory-failure: enhance comments for return value of memory_failure()
The comments for the return value of memory_failure are not complete, supplement the comments. Link: https://lkml.kernel.org/r/20250312112852.82415-4-xueshuai@linux.alibaba.com Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
aaf99ac2ce |
mm/hwpoison: do not send SIGBUS to processes with recovered clean pages
When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. - Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. - Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). - Why user process is killed for instr case Commit |
||
![]() |
38607c62b3 |
fs/dax: properly refcount fs dax pages
Currently fs dax pages are considered free when the refcount drops to one and their refcounts are not increased when mapped via PTEs or decreased when unmapped. This requires special logic in mm paths to detect that these pages should not be properly refcounted, and to detect when the refcount drops to one instead of zero. On the other hand get_user_pages(), etc. will properly refcount fs dax pages by taking a reference and dropping it when the page is unpinned. Tracking this special behaviour requires extra PTE bits (eg. pte_devmap) and introduces rules that are potentially confusing and specific to FS DAX pages. To fix this, and to possibly allow removal of the special PTE bits in future, convert the fs dax page refcounts to be zero based and instead take a reference on the page each time it is mapped as is currently the case for normal pages. This may also allow a future clean-up to remove the pgmap refcounting that is currently done in mm/gup.c. Link: https://lkml.kernel.org/r/c7d886ad7468a20452ef6e0ddab6cfe220874e7c.1740713401.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Asahi Lina <lina@asahilina.net> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michael "Camp Drill Sergeant" Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b81679b163 |
mm: memory-failure: update ttu flag inside unmap_poisoned_folio
Patch series "mm: memory_failure: unmap poisoned folio during migrate properly", v3. Fix two bugs during folio migration if the folio is poisoned. This patch (of 3): Commit |
||
![]() |
1751f872cc |
treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit
|
||
![]() |
408a8dc623 |
mm/memory-failure: replace sprintf() with sysfs_emit()
As Documentation/filesystems/sysfs.rst suggested, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Link: https://lkml.kernel.org/r/20241029101853.37890-1-zhangguopeng@kylinos.cn Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
68158bfa3d |
mm: mass constification of folio/page pointers
Now that page_pgoff() takes const pointers, we can constify the pointers to a lot of functions. Link: https://lkml.kernel.org/r/20241005200121.3231142-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
713da0b33b |
mm: renovate page_address_in_vma()
This function doesn't modify any of its arguments, so if we make a few other functions take const pointers, we can make page_address_in_vma() take const pointers too. All of its callers have the containing folio already, so pass that in as an argument instead of recalculating it. Also add kernel-doc Link: https://lkml.kernel.org/r/20241005200121.3231142-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f7470591f8 |
mm: convert page_to_pgoff() to page_pgoff()
Patch series "page->index removals in mm", v2. As part of shrinking struct page, we need to stop using page->index. This patchset gets rid of most of the remaining references to page->index in mm, as well as increasing the number of functions which take a const folio/page pointer. It shrinks the text segment of mm by a few hundred bytes in my test config, probably mostly from removing calls to compound_head() in page_to_pgoff(). This patch (of 7): Change the function signature to pass in the folio as all three callers have it. This removes a reference to page->index, which we're trying to get rid of. And add kernel-doc. Link: https://lkml.kernel.org/r/20241005200121.3231142-1-willy@infradead.org Link: https://lkml.kernel.org/r/20241005200121.3231142-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
f1264e9531 |
mm: migrate: add isolate_folio_to_list()
Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU movable and LRU folios to a list, which will be reused by do_migrate_range() from memory hotplug soon, also drop the mf_isolate_folio() since we could directly use new helper in the soft_offline_in_use_page(). Link: https://lkml.kernel.org/r/20240827114728.3212578-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
16038c4fff |
mm: memory-failure: add unmap_poisoned_folio()
Add unmap_poisoned_folio() helper which will be reused by do_migrate_range() from memory hotplug soon. [akpm@linux-foundation.org: whitespace tweak, per Miaohe Lin] Link: https://lkml.kernel.org/r/1f80c7e3-c30d-1ac1-6a36-d1a5f5907f7c@huawei.com Link: https://lkml.kernel.org/r/20240827114728.3212578-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
d75abd0d0b |
mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu
The memory_failure_cpu structure is a per-cpu structure. Access to its
content requires the use of get_cpu_var() to lock in the current CPU and
disable preemption. The use of a regular spinlock_t for locking purpose
is fine for a non-RT kernel.
Since the integration of RT spinlock support into the v5.15 kernel, a
spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping
lock in a preemption disabled context is illegal resulting in the
following kind of warning.
[12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
[12135.732252] preempt_count: 1, expected: 0
[12135.732255] RCU nest depth: 2, expected: 2
:
[12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
[12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
[12135.732433] Call Trace:
[12135.732436] <TASK>
[12135.732450] dump_stack_lvl+0x57/0x81
[12135.732461] __might_resched.cold+0xf4/0x12f
[12135.732479] rt_spin_lock+0x4c/0x100
[12135.732491] memory_failure_queue+0x40/0xe0
[12135.732503] ghes_do_memory_failure+0x53/0x390
[12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0
[12135.732575] ghes_proc+0xf9/0x1a0
[12135.732591] ghes_notify_hed+0x6a/0x150
[12135.732602] notifier_call_chain+0x43/0xb0
[12135.732626] blocking_notifier_call_chain+0x43/0x60
[12135.732637] acpi_ev_notify_dispatch+0x47/0x70
[12135.732648] acpi_os_execute_deferred+0x13/0x20
[12135.732654] process_one_work+0x41f/0x500
[12135.732695] worker_thread+0x192/0x360
[12135.732715] kthread+0x111/0x140
[12135.732733] ret_from_fork+0x29/0x50
[12135.732779] </TASK>
Fix it by using a raw_spinlock_t for locking instead.
Also move the pr_err() out of the lock critical section and after
put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep
with this call.
[longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe]
Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com
Fixes:
|
||
![]() |
8a78882dac |
mm/memory-failure: remove obsolete MF_MSG_DIFFERENT_COMPOUND
The page cannot become compound pages again just after a folio is split as an extra refcnt is held. So the MF_MSG_DIFFERENT_COMPOUND case is obsolete and can be removed to get rid of this false assumption and code burden. But add one WARN_ON() here to keep the situation clear. Link: https://lkml.kernel.org/r/20240708030544.196919-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
e6c0c03245 |
mm: provide mm_struct and address to huge_ptep_get()
On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is a PTE entry or a PMD entry. This cannot be known with the PMD entry itself because there is no easy way to know it from the content of the entry. So huge_ptep_get() will need to know either the size of the page or get the pmd. In order to be consistent with huge_ptep_get_and_clear(), give mm and address to huge_ptep_get(). Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
56374430c5 |
mm/memory-failure: userspace controls soft-offlining pages
Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if inuse; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later failed to mmap hugepages due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases doing so: 1. when GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl at /proc/sys/vm/enable_soft_offline. By default its value is set to 1 to preserve existing behavior in kernel. When set to 0, soft-offline (e.g. MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP. [jiaqiyan@google.com: v7] Link: https://lkml.kernel.org/r/20240628205958.2845610-3-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20240626050818.2277273-3-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
865319f772 |
mm/memory-failure: refactor log format in soft offline code
Patch series "Userspace controls soft-offline pages", v6. Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC, but with two pain points to users: 1. Correction usually happens on the fly and adds latency overhead 2. Not-fully-proved theory states excessive correctable memory errors can develop into uncorrectable memory error. Soft offline is kernel's additional solution for memory pages having (excessive) corrected memory errors. Impacted page is migrated to healthy page if it is in use, then the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later mmap hugepages MAP_FAILED due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This patch series give userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl called enable_soft_offline under /proc/sys/vm. By default enable_soft_line is 1 to preserve existing behavior in kernel. This patch (of 4): Logs from soft_offline_page and soft_offline_in_use_page have different formats than majority of the memory failure code: "Memory failure: 0x${pfn}: ${lower_case_message}" Convert them to the following format: "Soft offline: 0x${pfn}: ${lower_case_message}" No functional change in this commit. Link: https://lkml.kernel.org/r/20240626050818.2277273-1-jiaqiyan@google.com Link: https://lkml.kernel.org/r/20240626050818.2277273-2-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5cea5666e4 |
mm/memory-failure: refactor log format in unpoison_memory
Logs from memory_failure and other memory-failure.c code follow the format: "Memory failure: 0x{pfn}: ${lower_case_message}" Convert the logs in unpoison_memory to follow similar format: "Unpoison: 0x${pfn}: ${lower_case_message}" For example (from local test): [ 1331.938397] Unpoison: 0x144bc8: page was already unpoisoned No functional change in this commit. Link: https://lkml.kernel.org/r/20240619063355.171313-1-jiaqiyan@google.com Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
e5d896703d |
mm/memory-failure: correct comment in me_swapcache_dirty
Dirty swap cache page could live both in page table (not page cache) and swap cache when freshly swapped in. Correct comment. Link: https://lkml.kernel.org/r/20240612071835.157004-14-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
d49f2366e9 |
mm/memory-failure: remove obsolete comment in kill_proc()
When user sets SIGBUS to SIG_IGN, it won't cause loop now. For action required mce error, SIGBUS cannot be blocked. Also when a hwpoisoned page is re-accessed, kill_accessing_process() will be called to kill the process. Link: https://lkml.kernel.org/r/20240612071835.157004-13-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b71340ef56 |
mm/memory-failure: fix comment of get_hwpoison_page()
When return value is 0, it could also means the page is free hugetlb page or free buddy page. Fix the corresponding comment. Link: https://lkml.kernel.org/r/20240612071835.157004-12-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
28eab7d4e7 |
mm/memory-failure: remove obsolete comment in unpoison_memory()
Since commit
|
||
![]() |
96e13a4ea2 |
mm/memory-failure: use helper macro task_pid_nr()
Use helper macro task_pid_nr() to get the pid of a task. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-9-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5a8b01be4f |
mm/memory-failure: don't export hwpoison_filter() when !CONFIG_HWPOISON_INJECT
When CONFIG_HWPOISON_INJECT is not enabled, there is no user of the hwpoison_filter() outside memory-failure. So there is no need to export it in that case. Link: https://lkml.kernel.org/r/20240612071835.157004-8-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202406070136.hGQwVbsv-lkp@intel.com/ Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
4d64ab2f40 |
mm/memory-failure: remove confusing initialization to count
It's meaningless and confusing to init local variable count to 1. Remove it. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-7-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
7f8de2065d |
mm/memory-failure: remove unneeded empty string
Remove unneeded empty string in definition of macro pr_fmt. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b7c3afba24 |
mm/memory-failure: save some page_folio() calls
Use local variable folio directly to save a page_folio() call. Also use folio_mapped() to save more page_folio() calls. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
babde18650 |
mm/memory-failure: add macro GET_PAGE_MAX_RETRY_NUM
Add helper macro GET_PAGE_MAX_RETRY_NUM to replace magic number 3. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ceb32d6aa9 |
mm/memory-failure: remove MF_MSG_SLAB
Since commit
|
||
![]() |
1611753274 |
mm/memory-failure: simplify put_ref_page()
Patch series "Some cleanups for memory-failure", v3. This series contains a few cleanup patches to avoid exporting unused function, add helper macro, fix some obsolete comments and so on. More details can be found in the respective changelogs. This patch (of 13): Remove unneeded page != NULL check. pfn_to_page() won't return NULL. No functional change intended. Link: https://lkml.kernel.org/r/20240612071835.157004-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20240612071835.157004-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
29e9412b25 |
mm/memory-failure: stop setting the folio error flag
Nobody checks the error flag any more, so setting it accomplishes nothing. Remove the obsolete parts of this comment; it hasn't been true since errseq_t was used to track writeback errors in 2017. Link: https://lkml.kernel.org/r/20240531032938.2712870-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
1e25501dbc |
mm/memory-failure: use helper llist_for_each_entry()
Change the llist_for_each_entry_safe function to the llist_for_each_entry function and delete the next variable. Because the linked list is not modified,the llist_for_each_entry_safe function is not required. No functional changes are intended. Link: https://lkml.kernel.org/r/20240513075830.2611-1-liyifei28@huawei.com Signed-off-by: Yifei Li <liyifei28@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Jiaqi Yan <jiaqiyan@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
1a3798dece |
mm/memory-failure: send SIGBUS in the event of thp split fail
While handling hwpoison in a THP page, it is possible that try_to_split_thp_page() fails. For example, when the THP page has been RDMA pinned. At this point, the kernel cannot isolate the poisoned THP page, all it could do is to send a SIGBUS to the user process with meaningful payload to give user-level recovery a chance. Link: https://lkml.kernel.org/r/20240524215306.2705454-6-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9b0ab153d7 |
mm/memory-failure: move hwpoison_filter() higher up
Move hwpoison_filter() higher up as there is no need to spend a lot cycles only to find out later that the page is supposed to be skipped from hwpoison handling. Link: https://lkml.kernel.org/r/20240524215306.2705454-5-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b8b9488d50 |
mm/memory-failure: improve memory failure action_result messages
Added two explicit MF_MSG messages describing failure in get_hwpoison_page. Attemped to document the definition of various action names, and made a few adjustment to the action_result() calls. Link: https://lkml.kernel.org/r/20240524215306.2705454-4-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
aa298fdf53 |
mm/memory-failure: try to send SIGBUS even if unmap failed
Patch series "Enhance soft hwpoison handling and injection", v4. This series is aimed at the following enhancements: - Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave more like as if a real UE occurred. Because the other two injectors such as hwpoison-inject and the 'einj' on x86 can't, and it seems to me we need a better simulation to real UE scenario. - For years, if the kernel is unable to unmap a hwpoisoned page, it send a SIGKILL instead of SIGBUS to prevent user process from potentially accessing the page again. But in doing so, the user process also lose important information: vaddr, for recovery. Fortunately, the kernel already has code to kill process re-accessing a hwpoisoned page, so remove the '!unmap_success' check. - Right now, if a thp page under GUP longterm pin is hwpoisoned, and kernel cannot split the thp page, memory-failure simply ignores the UE and returns. That's not ideal, it could deliver a SIGBUS with useful information for userspace recovery. This patch (of 5): For years when it comes down to kill a process due to hwpoison, a SIGBUS is delivered only if unmap has been successful. Otherwise, a SIGKILL is delivered. And the reason for that is to prevent the involved process from accessing the hwpoisoned page again. Since then a lot has changed, a hwpoisoned page is marked and upon being re-accessed, the memory-failure handler invokes kill_accessing_process() to kill the process immediately. So let's take out the '!unmap_success' factor and try to deliver SIGBUS if possible. Link: https://lkml.kernel.org/r/20240524215306.2705454-1-jane.chu@oracle.com Link: https://lkml.kernel.org/r/20240524215306.2705454-2-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <oalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
8cf360b9d6 |
mm/memory-failure: fix handling of dissolved but not taken off from buddy pages
When I did memory failure tests recently, below panic occurs:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:1009!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
RIP: 0010:__del_page_from_free_list+0x151/0x180
RSP: 0018:ffffa49c90437998 EFLAGS: 00000046
RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0
RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69
R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80
R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009
FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0
Call Trace:
<TASK>
__rmqueue_pcplist+0x23b/0x520
get_page_from_freelist+0x26b/0xe40
__alloc_pages_noprof+0x113/0x1120
__folio_alloc_noprof+0x11/0xb0
alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130
__alloc_fresh_hugetlb_folio+0xe7/0x140
alloc_pool_huge_folio+0x68/0x100
set_max_huge_pages+0x13d/0x340
hugetlb_sysctl_handler_common+0xe8/0x110
proc_sys_call_handler+0x194/0x280
vfs_write+0x387/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xc2/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7ff916114887
RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887
RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003
RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0
R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004
R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00
</TASK>
Modules linked in: mce_inject hwpoison_inject
---[ end trace 0000000000000000 ]---
And before the panic, there had an warning about bad page state:
BUG: Bad page state in process page-types pfn:8cee00
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00
flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff)
page_type: 0xffffff7f(buddy)
raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000
raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000
page dumped because: nonzero mapcount
Modules linked in: mce_inject hwpoison_inject
CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22
Call Trace:
<TASK>
dump_stack_lvl+0x83/0xa0
bad_page+0x63/0xf0
free_unref_page+0x36e/0x5c0
unpoison_memory+0x50b/0x630
simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110
debugfs_attr_write+0x42/0x60
full_proxy_write+0x5b/0x80
vfs_write+0xcd/0x550
ksys_write+0x64/0xe0
do_syscall_64+0xc2/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f189a514887
RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887
RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003
RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2
R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8
R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040
</TASK>
The root cause should be the below race:
memory_failure
try_memory_failure_hugetlb
me_huge_page
__page_handle_poison
dissolve_free_hugetlb_folio
drain_all_pages -- Buddy page can be isolated e.g. for compaction.
take_page_off_buddy -- Failed as page is not in the buddy list.
-- Page can be putback into buddy after compaction.
page_ref_inc -- Leads to buddy page with refcnt = 1.
Then unpoison_memory() can unpoison the page and send the buddy page back
into buddy list again leading to the above bad page state warning. And
bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy
page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to
allocate this page.
Fix this issue by only treating __page_handle_poison() as successful when
it returns 1.
Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com
Fixes:
|
||
![]() |
fe6f86f4b4 |
mm/huge_memory: don't unpoison huge_zero_folio
When I did memory failure tests recently, below panic occurs:
kernel BUG at include/linux/mm.h:1135!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14
RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
Call Trace:
<TASK>
do_shrink_slab+0x14f/0x6a0
shrink_slab+0xca/0x8c0
shrink_node+0x2d0/0x7d0
balance_pgdat+0x33a/0x720
kswapd+0x1f3/0x410
kthread+0xd5/0x100
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
Modules linked in: mce_inject hwpoison_inject
---[ end trace 0000000000000000 ]---
RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0
RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246
RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0
RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492
R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00
FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0
The root cause is that HWPoison flag will be set for huge_zero_folio
without increasing the folio refcnt. But then unpoison_memory() will
decrease the folio refcnt unexpectedly as it appears like a successfully
hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when
releasing huge_zero_folio.
Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue.
We're not prepared to unpoison huge_zero_folio yet.
Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com
Fixes:
|
||
![]() |
89f5c54b22 |
memory-failure: remove calls to page_mapping()
This is mostly just inlining page_mapping() into the two callers. Link: https://lkml.kernel.org/r/20240423225552.4113447-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Eric Biggers <ebiggers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b650e1d2ae |
mm/memory-failure: pass the folio to collect_procs_ksm()
We've already calculated it, so pass it in instead of recalculating it in collect_procs_ksm(). Link: https://lkml.kernel.org/r/20240412193510.2356957-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
0edb5b282a |
mm/memory-failure: use folio functions throughout collect_procs()
Saves a couple of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
ee299e9849 |
mm/memory-failure: add some folio conversions to unpoison_memory
Some of these folio APIs didn't exist when the unpoison_memory() conversion was done originally. Link: https://lkml.kernel.org/r/20240412193510.2356957-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
03468a0f52 |
mm/memory-failure: convert hwpoison_user_mappings to take a folio
Pass the folio from the callers, and use it throughout instead of hpage. Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
5dba5c356a |
mm/memory-failure: convert memory_failure() to use a folio
Saves dozens of calls to compound_head(). Link: https://lkml.kernel.org/r/20240412193510.2356957-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |