linux-loongson

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson synced 2025-08-27 06:50:37 +00:00

Author	SHA1	Message	Date
Joshua Kinard	bb5b0b4317	rtc: ds1685: Update Joshua Kinard's email address. I am switching my address to a personal domain, so need to update the driver's files and the entry in MAINTAINERS. Signed-off-by: Joshua Kinard <kumba@gentoo.org> Link: https://lore.kernel.org/r/20250721170051.32407-1-kumba@gentoo.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 03:28:52 +02:00
Brian Masney	35d6aae85b	rtc: rv3032: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-15-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:06 +02:00
Brian Masney	c4253b0914	rtc: rv3028: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-14-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:06 +02:00
Brian Masney	e6f1af719e	rtc: pcf8563: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-13-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:06 +02:00
Brian Masney	ad853657d7	rtc: pcf85063: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-12-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	1251d043f7	rtc: nct3018y: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-11-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	9e0dfc7962	rtc: max31335: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-10-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	e05d81b75e	rtc: m41t80: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-9-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	394a4b920a	rtc: hym8563: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-8-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	31b5fea399	rtc: ds1307: convert from round_rate() to determine_rate() The round_rate() clk ops is deprecated, so migrate this driver from round_rate() to determine_rate() using the Coccinelle semantic patch on the cover letter of this series. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-7-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	b574acb3cf	rtc: rv3028: fix incorrect maximum clock rate handling When rv3028_clkout_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `f583c341a5` ("rtc: rv3028: add clkout support") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-6-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	906726a5ef	rtc: pcf8563: fix incorrect maximum clock rate handling When pcf8563_clkout_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `a39a6405d5` ("rtc: pcf8563: add CLKOUT to common clock framework") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-5-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	186ae18698	rtc: pcf85063: fix incorrect maximum clock rate handling When pcf85063_clkout_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `8c229ab604` ("rtc: pcf85063: Add pcf85063 clkout control to common clock framework") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-4-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	437c59e4b2	rtc: nct3018y: fix incorrect maximum clock rate handling When nct3018y_clkout_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `5adbaed16c` ("rtc: Add NCT3018Y real time clock driver") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-3-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	d0a518eb0a	rtc: hym8563: fix incorrect maximum clock rate handling When hym8563_clkout_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `dcaf038493` ("rtc: add hym8563 rtc-driver") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-2-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Brian Masney	cf6eb547a2	rtc: ds1307: fix incorrect maximum clock rate handling When ds3231_clk_sqw_round_rate() is called with a requested rate higher than the highest supported rate, it currently returns 0, which disables the clock. According to the clk API, round_rate() should instead return the highest supported rate. Update the function to return the maximum supported rate in this case. Fixes: `6c6ff145b3` ("rtc: ds1307: add clock provider support for DS3231") Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://lore.kernel.org/r/20250710-rtc-clk-round-rate-v1-1-33140bb2278e@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2025-08-03 02:57:05 +02:00
Helge Deller	e4fc307d8e	Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()" This reverts commit `864f9963ec`. The patch is wrong as it checks vc_origin against vc_screenbuf, while in text mode it should compare against vga_vram_base. As such it broke VGA text scrolling, which can be reproduced like this: (1) boot a kernel that is configured to use text mode VGA-console (2) type commands: ls -l /usr/bin \| less -S (3) scroll up/down with cursor-down/up keys Reported-by: Jari Ruusu <jariruusu@protonmail.com> Cc: stable@vger.kernel.org Cc: Yi Yang <yiyang13@huawei.com> Cc: GONG Ruiqi <gongruiqi1@huawei.com> Signed-off-by: Helge Deller <deller@gmx.de>	2025-08-02 21:47:33 +02:00
Sravan Kumar Gundu	af0db3c1f8	fbdev: Fix vmalloc out-of-bounds write in fast_imageblit This issue triggers when a userspace program does an ioctl FBIOPUT_CON2FBMAP by passing console number and frame buffer number. Ideally this maps console to frame buffer and updates the screen if console is visible. As part of mapping it has to do resize of console according to frame buffer info. if this resize fails and returns from vc_do_resize() and continues further. At this point console and new frame buffer are mapped and sets display vars. Despite failure still it continue to proceed updating the screen at later stages where vc_data is related to previous frame buffer and frame buffer info and display vars are mapped to new frame buffer and eventully leading to out-of-bounds write in fast_imageblit(). This bheviour is excepted only when fg_console is equal to requested console which is a visible console and updates screen with invalid struct references in fbcon_putcs(). Reported-and-tested-by: syzbot+c4b7aa0513823e2ea880@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c4b7aa0513823e2ea880 Signed-off-by: Sravan Kumar Gundu <sravankumarlpu@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>	2025-08-02 21:47:32 +02:00
Linus Torvalds	186f3edfdd	Pin control changes for v6.17 Core changes: - Open code PINCTRL_FUNCTION_DESC() instead of defining a complex macro only used in one place. - Add pinmux_generic_add_pinfunction() helper and use this in a few drivers. New drivers: - Amlogic S7, S7D and S6 pin control support. - Eswin EIC7700 pin control support. - Qualcomm PMIV0104, PM7550 and Milos pin control support. Because of unhelpful numbering schemes, the Qualcomm driver now needs to start to rely on SoC codenames. - STM32 HDP pin control support. - Mediatek MT8189 pin control support. Improvements: - Switch remaining pin control drivers over to the new GPIO set callback that provides a return value. - Support RSVD (reserved) pins in the STM32 driver. - Move many fixed assignments over to pinctrl_desc definitions. - Handle multiple TLMM regions in the Qualcomm driver. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEElDRnuGcz/wPCXQWMQRCzN7AZXXMFAmiN/SkACgkQQRCzN7AZ XXPeGw/7BMBf6Uuhs39qHjnLUUgp/H2yzRV7JB3Q99AZh++7mK0z4MchsZfjvXmv Ql2ADPHzmP9AJwSor/Ssvn4SrPwvC62IFBznB4eqPIL4UgWuIEYSJQNFMbZniFex kd8+7GAK7K5R5ReIWfUCs3xusO4+MShXZNKkWVaQZT+603kVznADGANBbEkOnXxY 06JKEo++QuChvLMckGOzyW8zAOV68YM2VYaZkuxxCIaIwKoNzGPKDt8NpPvaIijE S6EhrhRiM595Jt+qAC6lWtwGnFL5DI69Au2IDzaOSyamNLBoA/bmUu9UWB6/HxW2 yOhDW3DbXOB2xhUORlwCBtGsDyxLB9cIyBMjr6JantwPHdz8dzetxaTrwpuNdBQ+ +BgTodEuZf+TXroUQZ5sPRycEKZm1rtO7ctiZ5bG+CtP8qXcc+enMmC8BSCNCWzl bMOLsvP4ZMOuVU2ryOvhqKnbWxLS2RV5nHChtTF2JoE4ZX0dN/dhvGOe/A4dINhG 3Nb+ETmyEnid9PIPARYNy/7BkT92eEUQJlbI9qeU1AojGmRRQLS3+mJD9VcFSe1F /sjp5OYL2M/7SUpqBtlapLXN014gSAVV7zzQThndOYf8RJgohQkOWZsZUx7jyieA 4VYQzLWKAfP/IdOnKzM/8mAHw6VT9gJiWtNsc8ZdeYMwhYGSbbM= =hEj7 -----END PGP SIGNATURE----- Merge tag 'pinctrl-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control updates from Linus Walleij: "Nothing stands out, apart from maybe the interesting Eswin EIC7700, a RISC-V SoC I've never seen before. Core changes: - Open code PINCTRL_FUNCTION_DESC() instead of defining a complex macro only used in one place - Add pinmux_generic_add_pinfunction() helper and use this in a few drivers New drivers: - Amlogic S7, S7D and S6 pin control support - Eswin EIC7700 pin control support - Qualcomm PMIV0104, PM7550 and Milos pin control support Because of unhelpful numbering schemes, the Qualcomm driver now needs to start to rely on SoC codenames - STM32 HDP pin control support - Mediatek MT8189 pin control support Improvements: - Switch remaining pin control drivers over to the new GPIO set callback that provides a return value - Support RSVD (reserved) pins in the STM32 driver - Move many fixed assignments over to pinctrl_desc definitions - Handle multiple TLMM regions in the Qualcomm driver" * tag 'pinctrl-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (105 commits) pinctrl: mediatek: Add pinctrl driver for mt8189 dt-bindings: pinctrl: mediatek: Add support for mt8189 pinctrl: aspeed-g6: Add PCIe RC PERST pin group pinctrl: ingenic: use pinmux_generic_add_pinfunction() pinctrl: keembay: use pinmux_generic_add_pinfunction() pinctrl: mediatek: moore: use pinmux_generic_add_pinfunction() pinctrl: airoha: use pinmux_generic_add_pinfunction() pinctrl: equilibrium: use pinmux_generic_add_pinfunction() pinctrl: provide pinmux_generic_add_pinfunction() pinctrl: pinmux: open-code PINCTRL_FUNCTION_DESC() pinctrl: ma35: use new GPIO line value setter callbacks MAINTAINERS: add Clément Le Goffic as STM32 HDP maintainer pinctrl: stm32: Introduce HDP driver dt-bindings: pinctrl: stm32: Introduce HDP pinctrl: qcom: Add Milos pinctrl driver dt-bindings: pinctrl: document the Milos Top Level Mode Multiplexer pinctrl: qcom: spmi: Add PM7550 dt-bindings: pinctrl: qcom,pmic-gpio: Add PM7550 support pinctrl: qcom: spmi: Add PMIV0104 dt-bindings: pinctrl: qcom,pmic-gpio: Add PMIV0104 support ...	2025-08-02 12:07:09 -07:00
Yadan Fan	a2152fef29	mm: mempool: fix crash in mempool_free() for zero-minimum pools The mempool wake-up fix introduced in commit `a5867a218d` ("mm: mempool: fix wake-up edge case bug for zero-minimum pools") inlined the add_element() logic in mempool_free() to return the element to the zero-minimum pool: pool->elements[pool->curr_nr++] = element; This causes crash, because mempool_init_node() does not initialize with real allocation for zero-minimum pool, it only returns ZERO_SIZE_PTR to the elements array which is unable to be dereferenced, and the pre-allocation of this array never happened since the while test: while (pool->curr_nr < pool->min_nr) can never be satisfied as min_nr is zero, so the pool does not actually reserve any buffer, the only way so far is to call alloc_fn() to get buffer from SLUB, but if the memory is under high pressure the alloc_fn() could never get any buffer, the waiting thread would be in an indefinite loop of wake-sleep in a period until there is free memory to get. This patch changes mempool_init_node() to allocate 1 element for the elements array of zero-minimum pool, so that the pool will have reserved buffer to use. This will fix the crash issue and let the waiting thread can get the reserved element when alloc_fn() failed to get buffer under high memory pressure. Also modify add_element() to support zero-minimum pool with simplifying codes of zero-minimum handling in mempool_free(). Link: https://lkml.kernel.org/r/e01f00f3-58d9-4ca7-af54-bfa42fec9527@suse.com Fixes: `a5867a218d` ("mm: mempool: fix wake-up edge case bug for zero-minimum pools") Signed-off-by: Yadan Fan <ydfan@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:13 -07:00
Lorenzo Stoakes	f04fd85f15	mm: correct type for vmalloc vm_flags fields Several functions refer to the unfortunately named 'vm_flags' field when referencing vmalloc flags, which happens to be the precise same name used for VMA flags. As a result these were erroneously changed to use the vm_flags_t type (which currently is a typedef equivalent to unsigned long). Currently this has no impact, but in future when vm_flags_t changes this will result in issues, so change the type to unsigned long to account for this. [lorenzo.stoakes@oracle.com: fixup very disguised vmalloc flags parameter] Link: https://lkml.kernel.org/r/e74dd8de-7e60-47ab-8a45-2c851f3c5d26@lucifer.local Link: https://lkml.kernel.org/r/20250729114906.55347-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Harry Yoo <harry.yoo@oracle.com> Closes: https://lore.kernel.org/all/aIgSpAnU8EaIcqd9@hyeyoo/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:13 -07:00
Kairui Song	de55be4237	mm/shmem, swap: fix major fault counting If the swapin failed, don't update the major fault count. There is a long existing comment for doing it this way, now with previous cleanups, we can finally fix it. Link: https://lkml.kernel.org/r/20250728075306.12704-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:13 -07:00
Kairui Song	93c0476e70	mm/shmem, swap: rework swap entry and index calculation for large swapin Instead of calculating the swap entry differently in different swapin paths, calculate it early before the swap cache lookup and use that for the lookup and later swapin. And after swapin have brought a folio, simply round it down against the size of the folio. This is simple and effective enough to verify the swap value. A folio's swap entry is always aligned by its size. Any kind of parallel split or race is acceptable because the final shmem_add_to_page_cache ensures that all entries covered by the folio are correct, and thus there will be no data corruption. This also prevents false positive cache lookup. If a shmem read request's index points to the middle of a large swap entry, previously, shmem will try the swap cache lookup using the large swap entry's starting value (which is the first sub swap entry of this large entry). This will lead to false positive lookup results if only the first few swap entries are cached but the actual requested swap entry pointed by the index is uncached. This is not a rare event, as swap readahead always tries to cache order 0 folios when possible. And this shouldn't cause any increased repeated faults. Instead, no matter how the shmem mapping is split in parallel, as long as the mapping still contains the right entries, the swapin will succeed. The final object size and stack usage are also reduced due to simplified code: ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145) Function old new delta shmem_swapin_folio 4056 3911 -145 Total: Before=33242, After=33097, chg -0.44% Stack usage (Before vs After): mm/shmem.c:2314:12:shmem_swapin_folio 264 static mm/shmem.c:2314:12:shmem_swapin_folio 256 static And while at it, round down the index too if swap entry is round down. The index is used either for folio reallocation or confirming the mapping content. In either case, it should be aligned with the swap folio. Link: https://lkml.kernel.org/r/20250728075306.12704-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:13 -07:00
Kairui Song	1326359f22	mm/shmem, swap: simplify swapin path and result handling Slightly tidy up the different handling of swap in and error handling for SWP_SYNCHRONOUS_IO and non-SWP_SYNCHRONOUS_IO devices. Now swapin will always use either shmem_swap_alloc_folio or shmem_swapin_cluster, then check the result. Simplify the control flow and avoid a redundant goto label. Link: https://lkml.kernel.org/r/20250728075306.12704-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:13 -07:00
Kairui Song	69805ea79d	mm/shmem, swap: never use swap cache and readahead for SWP_SYNCHRONOUS_IO For SWP_SYNCHRONOUS_IO devices, if a cache bypassing THP swapin failed due to reasons like memory pressure, partially conflicting swap cache or ZSWAP enabled, shmem will fallback to cached order 0 swapin. Right now the swap cache still has a non-trivial overhead, and readahead is not helpful for SWP_SYNCHRONOUS_IO devices, so we should always skip the readahead and swap cache even if the swapin falls back to order 0. So handle the fallback logic without falling back to the cached read. Link: https://lkml.kernel.org/r/20250728075306.12704-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Kairui Song	91ab656ece	mm/shmem, swap: tidy up swap entry splitting Instead of keeping different paths of splitting the entry before the swap in start, move the entry splitting after the swapin has put the folio in swap cache (or set the SWAP_HAS_CACHE bit). This way we only need one place and one unified way to split the large entry. Whenever swapin brought in a folio smaller than the shmem swap entry, split the entry and recalculate the entry and index for verification. This removes duplicated codes and function calls, reduces LOC, and the split is less racy as it's guarded by swap cache now. So it will have a lower chance of repeated faults due to raced split. The compiler is also able to optimize the coder further: bloat-o-meter results with GCC 14: With DEBUG_SECTION_MISMATCH (-fno-inline-functions-called-once): ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-143 (-143) Function old new delta shmem_swapin_folio 2358 2215 -143 Total: Before=32933, After=32790, chg -0.43% With !DEBUG_SECTION_MISMATCH: add/remove: 0/1 grow/shrink: 1/0 up/down: 1069/-749 (320) Function old new delta shmem_swapin_folio 2871 3940 +1069 shmem_split_large_entry.isra 749 - -749 Total: Before=32806, After=33126, chg +0.98% Since shmem_split_large_entry is only called in one place now. The compiler will either generate more compact code, or inlined it for better performance. Link: https://lkml.kernel.org/r/20250728075306.12704-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Kairui Song	c262ffd72c	mm/shmem, swap: tidy up THP swapin checks Move all THP swapin related checks under CONFIG_TRANSPARENT_HUGEPAGE, so they will be trimmed off by the compiler if not needed. And add a WARN if shmem sees a order > 0 entry when CONFIG_TRANSPARENT_HUGEPAGE is disabled, that should never happen unless things went very wrong. There should be no observable feature change except the new added WARN. Link: https://lkml.kernel.org/r/20250728075306.12704-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Kairui Song	0cfc0e7e3d	mm/shmem, swap: avoid redundant Xarray lookup during swapin Patch series "mm/shmem, swap: bugfix and improvement of mTHP swap in", v6. The current THP swapin path have several problems. It may potentially hang, may cause redundant faults due to false positive swap cache lookup, and it issues redundant Xarray walks. !CONFIG_TRANSPARENT_HUGEPAGE builds may also contain unnecessary THP checks. This series fixes all of the mentioned issues, the code should be more robust and prepared for the swap table series. Now 4 walks is reduced to 3 (get order & confirm, confirm, insert folio), !CONFIG_TRANSPARENT_HUGEPAGE build overhead is also minimized, and comes with a sanity check now. The performance is slightly better after this series, sequential swap in of 24G data from ZRAM, using transparent_hugepage_tmpfs=always (24 samples each): Before: avg: 10.66s, stddev: 0.04 After patch 1: avg: 10.58s, stddev: 0.04 After patch 2: avg: 10.65s, stddev: 0.05 After patch 3: avg: 10.65s, stddev: 0.04 After patch 4: avg: 10.67s, stddev: 0.04 After patch 5: avg: 9.79s, stddev: 0.04 After patch 6: avg: 9.79s, stddev: 0.05 After patch 7: avg: 9.78s, stddev: 0.05 After patch 8: avg: 9.79s, stddev: 0.04 Several patches improve the performance by a little, which is about ~8% faster in total. Build kernel test showed very slightly improvement, testing with make -j48 with defconfig in a 768M memcg also using ZRAM as swap, and transparent_hugepage_tmpfs=always (6 test runs): Before: avg: 3334.66s, stddev: 43.76 After patch 1: avg: 3349.77s, stddev: 18.55 After patch 2: avg: 3325.01s, stddev: 42.96 After patch 3: avg: 3354.58s, stddev: 14.62 After patch 4: avg: 3336.24s, stddev: 32.15 After patch 5: avg: 3325.13s, stddev: 22.14 After patch 6: avg: 3285.03s, stddev: 38.95 After patch 7: avg: 3287.32s, stddev: 26.37 After patch 8: avg: 3295.87s, stddev: 46.24 This patch (of 7): Currently shmem calls xa_get_order to get the swap radix entry order, requiring a full tree walk. This can be easily combined with the swap entry value checking (shmem_confirm_swap) to avoid the duplicated lookup and abort early if the entry is gone already. Which should improve the performance. Link: https://lkml.kernel.org/r/20250728075306.12704-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250728075306.12704-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Mike Rapoport (Microsoft)	5d79c2be50	x86/ftrace: enable EXECMEM_ROX_CACHE for ftrace allocations For the most part ftrace uses text poking and can handle ROX memory. The only place that requires writable memory is create_trampoline() that updates the allocated memory and in the end makes it ROX. Use execmem_alloc_rw() in x86::ftrace::alloc_tramp() and enable ROX cache for EXECMEM_FTRACE when configuration and CPU features allow that. Link: https://lkml.kernel.org/r/20250713071730.4117334-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Mike Rapoport (Microsoft)	36de1e4238	x86/kprobes: enable EXECMEM_ROX_CACHE for kprobes allocations x86::alloc_insn_page() always allocates ROX memory. Instead of overriding this method, add EXECMEM_KPROBES entry in execmem_info with pgprot set to PAGE_KERNEL_ROX and use ROX cache when configuration and CPU features allow it. Link: https://lkml.kernel.org/r/20250713071730.4117334-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Mike Rapoport (Microsoft)	ab674b6871	execmem: drop writable parameter from execmem_fill_trapping_insns() After update of execmem_cache_free() that made memory writable before updating it, there is no need to update read only memory, so the writable parameter to execmem_fill_trapping_insns() is not needed. Drop it. Link: https://lkml.kernel.org/r/20250713071730.4117334-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:12 -07:00
Mike Rapoport (Microsoft)	3bd4e0ac61	execmem: add fallback for failures in vmalloc(VM_ALLOW_HUGE_VMAP) When execmem populates ROX cache it uses vmalloc(VM_ALLOW_HUGE_VMAP). Although vmalloc falls back to allocating base pages if high order allocation fails, it may happen that it still cannot allocate enough memory. Right now ROX cache is only used by modules and in majority of cases the allocations happen at boot time when there's plenty of free memory, but upcoming enabling ROX cache for ftrace and kprobes would mean that execmem allocations can happen when the system is under memory pressure and a failure to allocate large page worth of memory becomes more likely. Fallback to regular vmalloc() if vmalloc(VM_ALLOW_HUGE_VMAP) fails. Link: https://lkml.kernel.org/r/20250713071730.4117334-6-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Mike Rapoport (Microsoft)	888b5a847b	execmem: move execmem_force_rw() and execmem_restore_rox() before use to avoid static declarations. Link: https://lkml.kernel.org/r/20250713071730.4117334-5-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Mike Rapoport (Microsoft)	187fd8521d	execmem: rework execmem_cache_free() Currently execmem_cache_free() ignores potential allocation failures that may happen in execmem_cache_add(). Besides, it uses text poking to fill the memory with trapping instructions before returning it to cache although it would be more efficient to make that memory writable, update it using memcpy and then restore ROX protection. Rework execmem_cache_free() so that in case of an error it will defer freeing of the memory to a delayed work. With this the happy fast path will now change permissions to RW, fill the memory with trapping instructions using memcpy, restore ROX permissions, add the memory back to the free cache and clear the relevant entry in busy_areas. If any step in the fast path fails, the entry in busy_areas will be marked as pending_free. These entries will be handled by a delayed work and freed asynchronously. To make the fast path faster, use __GFP_NORETRY for memory allocations and let asynchronous handler try harder with GFP_KERNEL. Link: https://lkml.kernel.org/r/20250713071730.4117334-4-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Mike Rapoport (Microsoft)	838955f64a	execmem: introduce execmem_alloc_rw() Some callers of execmem_alloc() require the memory to be temporarily writable even when it is allocated from ROX cache. These callers use execemem_make_temp_rw() right after the call to execmem_alloc(). Wrap this sequence in execmem_alloc_rw() API. Link: https://lkml.kernel.org/r/20250713071730.4117334-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Mike Rapoport (Microsoft)	fcd90ad31e	execmem: drop unused execmem_update_copy() Patch series "x86: enable EXECMEM_ROX_CACHE for ftrace and kprobes", v3. These patches enable use of EXECMEM_ROX_CACHE for ftrace and kprobes allocations on x86. They also include some ground work in execmem. Since the execmem model for caching large ROX pages changed from the initial assumption that the memory that is allocated from ROX cache is always ROX to the current state where memory can be temporarily made RW and then restored to ROX, we can stop using text poking to update it. This also saves the hassle of trying lock text_mutex in execmem_cache_free() when kprobes already hold that mutex. This patch (of 8): The execmem_update_copy() that used text poking was required when memory allocated from ROX cache was always read-only. Since now its permissions can be switched to read-write there is no need in a function that updates memory with text poking. Remove it. Link: https://lkml.kernel.org/r/20250713071730.4117334-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20250713071730.4117334-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Suren Baghdasaryan	9bbffee67f	mm: fix a UAF when vma->mm is freed after vma->vm_refcnt got dropped By inducing delays in the right places, Jann Horn created a reproducer for a hard to hit UAF issue that became possible after VMAs were allowed to be recycled by adding SLAB_TYPESAFE_BY_RCU to their cache. Race description is borrowed from Jann's discovery report: lock_vma_under_rcu() looks up a VMA locklessly with mas_walk() under rcu_read_lock(). At that point, the VMA may be concurrently freed, and it can be recycled by another process. vma_start_read() then increments the vma->vm_refcnt (if it is in an acceptable range), and if this succeeds, vma_start_read() can return a recycled VMA. In this scenario where the VMA has been recycled, lock_vma_under_rcu() will then detect the mismatching ->vm_mm pointer and drop the VMA through vma_end_read(), which calls vma_refcount_put(). vma_refcount_put() drops the refcount and then calls rcuwait_wake_up() using a copy of vma->vm_mm. This is wrong: It implicitly assumes that the caller is keeping the VMA's mm alive, but in this scenario the caller has no relation to the VMA's mm, so the rcuwait_wake_up() can cause UAF. The diagram depicting the race: T1 T2 T3 == == == lock_vma_under_rcu mas_walk <VMA gets removed from mm> mmap <the same VMA is reallocated> vma_start_read __refcount_inc_not_zero_limited_acquire munmap __vma_enter_locked refcount_add_not_zero vma_end_read vma_refcount_put __refcount_dec_and_test rcuwait_wait_event <finish operation> rcuwait_wake_up [UAF] Note that rcuwait_wait_event() in T3 does not block because refcount was already dropped by T1. At this point T3 can exit and free the mm causing UAF in T1. To avoid this we move vma->vm_mm verification into vma_start_read() and grab vma->vm_mm to stabilize it before vma_refcount_put() operation. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20250729145709.2731370-1-surenb@google.com Link: https://lkml.kernel.org/r/20250728175355.2282375-1-surenb@google.com Fixes: `3104138517` ("mm: make vma cache SLAB_TYPESAFE_BY_RCU") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0-deFbVH=E3jbkWx=X3uVbd8nWeo6kbJPQ0KoUD+m2tA@mail.gmail.com/ Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Jann Horn	a222439e1e	mm/rmap: add anon_vma lifetime debug check If an anon folio is mapped into userspace, its anon_vma must be alive, otherwise rmap walks can hit UAF. There have been syzkaller reports a few months ago[1][2] of UAF in rmap walks that seems to indicate that there can be pages with elevated mapcount whose anon_vma has already been freed, but I think we never figured out what the cause is; and syzkaller only hit these UAFs when memory pressure randomly caused reclaim to rmap-walk the affected pages, so it of course didn't manage to create a reproducer. Add a VM_WARN_ON_FOLIO() when we add/remove mappings of anonymous folios to hopefully catch such issues more reliably. [1] https://lore.kernel.org/r/67abaeaf.050a0220.110943.0041.GAE@google.com [2] https://lore.kernel.org/r/67a76f33.050a0220.3d72c.0028.GAE@google.com Link: https://lkml.kernel.org/r/20250725-anonvma-uaf-debug-v2-1-bc3c7e5ba5b1@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Harry Yoo <harry.yoo@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:11 -07:00
Lorenzo Stoakes	9a4f90e246	mm: remove mm/io-mapping.c This is dead code, which was used from commit `b739f125e4` ("i915: use io_mapping_map_user") but reverted a month later by commit `0e4fe0c9f2` ("Revert "i915: use io_mapping_map_user"") back in 2021. Since then nobody has used it, so remove it. [akpm@linux-foundation.org: update Documentation/core-api/mm-api.rst, per Vlastimil] Link: https://lkml.kernel.org/r/20250725142901.81502-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
Dev Jain	22d0229093	khugepaged: optimize collapse_pte_mapped_thp() by PTE batching Use PTE batching to batch process PTEs mapping the same large folio. An improvement is expected due to batching mapcount manipulation on the folios, and for arm64 which supports contig mappings, the number of TLB flushes is also reduced. Note that we do not need to make a change to the check "if (folio_page(folio, i) != page)"; if i'th page of the folio is equal to the first page of our batch, then i + 1, .... i + nr_batch_ptes - 1 pages of the folio will be equal to the corresponding pages of our batch mapping consecutive pages. Link: https://lkml.kernel.org/r/20250724052301.23844-4-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
Dev Jain	4ea3594a47	khugepaged: optimize __collapse_huge_page_copy_succeeded() by PTE batching Use PTE batching to batch process PTEs mapping the same large folio. An improvement is expected due to batching refcount-mapcount manipulation on the folios, and for arm64 which supports contig mappings, the number of TLB flushes is also reduced. Link: https://lkml.kernel.org/r/20250724052301.23844-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
David Hildenbrand	3dfde97800	mm: add get_and_clear_ptes() and clear_ptes() Patch series "Optimizations for khugepaged", v4. If the underlying folio mapped by the ptes is large, we can process those ptes in a batch using folio_pte_batch(). For arm64 specifically, this results in a 16x reduction in the number of ptep_get() calls, since on a contig block, ptep_get() on arm64 will iterate through all 16 entries to collect a/d bits. Next, ptep_clear() will cause a TLBI for every contig block in the range via contpte_try_unfold(). Instead, use clear_ptes() to only do the TLBI at the first and last contig block of the range. For split folios, there will be no pte batching; the batch size returned by folio_pte_batch() will be 1. For pagetable split folios, the ptes will still point to the same large folio; for arm64, this results in the optimization described above, and for other arches, a minor improvement is expected due to a reduction in the number of function calls and batching atomic operations. This patch (of 3): Let's add variants to be used where "full" does not apply -- which will be the majority of cases in the future. "full" really only applies if we are about to tear down a full MM. Use get_and_clear_ptes() in existing code, clear_ptes() users will be added next. Link: https://lkml.kernel.org/r/20250724052301.23844-2-dev.jain@arm.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
Jinjiang Tu	1623717b05	mm/mincore: hold PTL in mincore_hugetlb Hold PTL in mincore_hugetlb() to avoid operating on stale page, as mincore_pte_range() have done. Link: https://lkml.kernel.org/r/20250724090958.455887-4-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joern Engel <joern@logfs.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
Jinjiang Tu	9109bd5255	mm/memory-failure: hold PTL in hwpoison_hugetlb_range Hold PTL in hwpoison_hugetlb_range() to avoid operating on stale page, as hwpoison_pte_range() have done. This change is not known to address any issues which users have experienced. Link: https://lkml.kernel.org/r/20250725033112.2690158-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brahmajit Das <brahmajit.xyz@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joern Engel <joern@logfs.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:10 -07:00
Lorenzo Stoakes	6c2da14ae1	mm/mseal: rework mseal apply logic The logic can be simplified - firstly by renaming the inconsistently named apply_mm_seal() to mseal_apply(). We then wrap mseal_fixup() into the main loop as the logic is simple enough to not require it, equally it isn't a hugely pleasant pattern in mprotect() etc. so it's not something we want to perpetuate. We eliminate the need for invoking vma_iter_end() on each loop by directly determining if the VMA was merged - the only thing we need concern ourselves with is whether the start/end of the (gapless) range are offset into VMAs. This refactoring also avoids the rather horrid 'pass pointer to prev around' pattern used in mprotect() et al. No functional change intended. Link: https://lkml.kernel.org/r/ddfa4376ce29f19a589d7dc8c92cb7d4f7605a4c.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00
Lorenzo Stoakes	530e090964	mm/mseal: simplify and rename VMA gap check The check_mm_seal() function is doing something general - checking whether a range contains only VMAs (or rather that it does NOT contain any unmapped regions). So rename this function to range_contains_unmapped(). Additionally simplify the logic, we are simply checking whether the last vma->vm_end has either a VMA starting after it or ends before the end parameter. This check is rather dubious, so it is sensible to keep it local to mm/mseal.c as at a later stage it may be removed, and we don't want any other mm code to perform such a check. No functional change intended. [lorenzo.stoakes@oracle.com: add comment explaining why we disallow gaps on mseal()] Link: https://lkml.kernel.org/r/d85b3d55-09dc-43ba-8204-b48267a96751@lucifer.local Link: https://lkml.kernel.org/r/dd50984eff1e242b5f7f0f070a3360ef760e06b8.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00
Lorenzo Stoakes	8b2914162a	mm/mseal: small cleanups Drop the wholly unnecessary set_vma_sealed() helper(), which is used only once, and place VMA_ITERATOR() declarations in the correct place. Retain vma_is_sealed(), and use it instead of the confusingly named can_modify_vma(), so it's abundantly clear what's being tested, rather then a nebulous sense of 'can the VMA be modified'. No functional change intended. Link: https://lkml.kernel.org/r/98cf28d04583d632a6eb698e9ad23733bb6af26b.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Xu <jeffxu@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00
Lorenzo Stoakes	d0b47a6866	mm/mseal: update madvise() logic The madvise() logic is inexplicably performed in mm/mseal.c - this ought to be located in mm/madvise.c. Additionally can_modify_vma_madv() is inconsistently named and, in combination with is_ro_anon(), is very confusing logic. Put a static function in mm/madvise.c instead - can_madvise_modify() - that spells out exactly what's happening. Also explicitly check for an anon VMA. Also add commentary to explain what's going on. Essentially - we disallow discarding of data in mseal()'d mappings in instances where the user couldn't otherwise write to that data. We retain the existing behaviour here regarding MAP_PRIVATE mappings of file-backed mappings, which entails some complexity - while this, strictly speaking - appears to violate mseal() semantics, it may interact badly with users which expect to be able to madvise(MADV_DONTNEED) .text mappings for instance. We may revisit this at a later date. No functional change intended. Link: https://lkml.kernel.org/r/492a98d9189646e92c8f23f4cce41ed323fe01df.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00
Lorenzo Stoakes	f225b34f1e	mm/mseal: always define VM_SEALED Patch series "mseal cleanups", v4. Perform a number of cleanups to the mseal logic. Firstly, VM_SEALED is treated differently from every other VMA flag, it really doesn't make sense to do this, so we start by making this consistent with everything else. Next we place the madvise logic where it belongs - in mm/madvise.c. It really makes no sense to abstract this elsewhere. In doing so, we go to great lengths to explain very clearly the previously very confusing logic as to what sealed mappings are impacted here. In doing so, we retain existing logic regarding treatment of madvise() discard operations for a sealed, read-only MAP_PRIVATE file-backed mapping. This is something we likely need to revisit. We then abstract out and explain the 'are there are any gaps in this range in the mm?' check being performed as a prerequisite to mseal being performed. Finally, we simplify the actual mseal logic which is really quite straightforward. No functional change is intended. This patch (of 4): There is no reason to treat VM_SEALED in a special way, in each other case in which a VMA flag is unavailable due to configuration, we simply assign that flag to VM_NONE, so make VM_SEALED consistent with all other VMA flags in this respect. Additionally, use the next available bit for VM_SEALED, 42, rather than arbitrarily putting it at 63 and update the declaration to match all other VMA flags. No functional change intended. Link: https://lkml.kernel.org/r/cover.1753431105.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/aeb398a77029b6e7377cd944328bc9bbc3c90537.1753431105.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00
Bijan Tabatabai	dee3ab621f	mm/damon/vaddr: skip isolating folios already in destination nid damos_va_migrate_dests_add() determines the node a folio should be in based on the struct damos_migrate_dests associated with the migration scheme and adds the folio to the linked list corresponding to that node so it can be migrated later. Currently, folios are isolated and added to the list even if they are already in the node they should be in. In using damon weighted interleave more, I've found that the overhead of needlessly adding these folios to the migration lists can be quite high. The overhead comes from isolating folios and placing them in the migration lists inside of damos_va_migrate_dests_add(), as well as the cost of handling those folios in damon_migrate_pages(). This patch eliminates that overhead by simply avoiding the addition of folios that are already in their intended location to the migration list. To show the benefit of this patch, we start the test workload and start a DAMON instance attached to that workload with a migrate_hot scheme that has one dest field sending data to the local node. This way, we are only measuring the overheads of the scheme, and not the cost of migrating pages, since data will be allocated to the local node by default. I tested with two workloads: the embedding reduction workload used in [1] and a microbenchmark that allocates 20GB of data then sleeps, which is similar to the memory usage of the embedding reduction workload. The time taken in damos_va_migrate_dests_add() and damon_migrate_pages() each aggregation interval is shown below. Before this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark ~2ms ~3ms embedding reduction ~1s ~3s After this patch: damos_va_migrate_dests_add damon_migrate_pages microbenchmark 0us ~40us embedding reduction 0us ~100us I did not do an in depth analysis for why things are much slower in the embedding reduction workload than the microbenchmark. However, I assume it's because the embedding reduction workload oversaturates the bandwidth of the local memory node, increasing the memory access latency, and in turn making the pointer chasing involved in iterating through a linked list much slower. Regardless of that, this patch results in a significant speedup. [1] https://lore.kernel.org/damon/20250709005952.17776-1-bijan311@gmail.com/ Link: https://lkml.kernel.org/r/20250725163300.4602-1-bijan311@gmail.com Fixes: `19c1dc15c8` ("mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-08-02 12:06:09 -07:00

... 9 10 11 12 13 ...

1381929 Commits