Enable the boot_printk() function to print decimal values. Add the 'd',
'i', and 'u' conversion specifiers support.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Enhance boot_printk() to support field width and padding across all
argument types for better formatting.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Add support for the 'l', 'h', 'hh', and 'z' length modifiers.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Add "%%" support for the boot_printk() format string.
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
For KASAN shadow mappings, switch from physmem_alloc_or_die() to
physmem_alloc() and return INVALID_PHYS_ADDR on allocation failure. This
allows gracefully falling back from large pages to smaller pages (1MB
or 4KB) if the allocation of 2GB size/aligned pages cannot be fulfilled.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Add physmem_alloc() as a variant of physmem_alloc_or_die() that can return
an error instead of triggering a panic on OOM. This allows callers to
implement alternative fallback paths.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The new name better reflects the function's behavior, emphasizing that
it will terminate execution if allocation fails.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Commit c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address
spaces") introduced a large_allowed() helper that restricts which mapping
modes can use large pages. This change unintentionally prevented KASAN
shadow mappings from using large pages, despite there being no reason
to avoid them. In fact, large pages are preferred for performance.
Since commit d8073dc6bc ("s390/mm: Allow large pages only for aligned
physical addresses"), both can_large_pud() and can_large_pmd() call _pa()
to check if large page physical addresses are aligned. However, _pa()
has a side effect: it allocates memory in POPULATE_KASAN_MAP_SHADOW
mode.
Rename large_allowed() to large_page_mapping_allowed() and add
POPULATE_KASAN_MAP_SHADOW to the allowed list to restore large page
mappings for KASAN shadows.
While large_page_mapping_allowed() isn't strictly necessary with current
mapping modes since disallowed modes either don't map anything or fail
alignment and size checks, keep it for clarity.
Rename _pa() to resolve_pa_may_alloc() for clarity and to emphasize
existing side effect. Rework can_large_pud()/can_large_pmd() to take
the side effect into consideration and actually return physical address
instead of just checking conditions.
Fixes: c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
- Select config option KASAN_VMALLOC if KASAN is enabled
- Select config option VMAP_STACK unconditionally
- Implement arch_atomic_inc() / arch_atomic_dec() functions
which result in a single instruction if compiled for z196
or newer architectures
- Make layering between atomic.h and atomic_ops.h consistent
- Comment s390 preempt_count implementation
- Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation
- GCC uses the number of lines of an inline assembly to calculate
number of instructions and decide on inlining. Therefore remove
superfluous new lines from a couple of inline assemblies.
- Provide arch_atomic_*_and_test() implementations that allow the
compiler to generate slightly better code.
- Optimize __preempt_count_dec_and_test()
- Remove __bootdata annotations from declarations in header files
- Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code
- Fix suboptimal kernel image base when running make kasan.config
- Remove huge_pte_none() and huge_pte_none_mostly() as are identical
to the generic variants
- Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC,
and REGION3_KERNEL_EXEC defines
- Simplify noexec page protection handling and change the page,
segment and region3 protection definitions automatically if the
instruction execution-protection facility is not available
- Save one instruction and prefer EXRL instruction over EX in
string, xor_*(), amode31 and other functions
- Create /dev/diag misc device to fetch diagnose specific
information from the kernel and provide it to userspace
- Retrieve electrical power readings using DIAGNOSE 0x324 ioctl
- Make ccw_device_get_ciw() consistent and use array indices
instead of pointer arithmetic
* s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
- The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that in s390 code
- Add missing TLB range adjustment in pud_free_tlb()
- Improve topology setup by adding early polarization detection
- Fix length checks in codepage_convert() function
- The generic bitops implementation is nearly identical to the s390 one.
Switch to the generic variant and decrease a bit the kernel image size
- Provide an optimized arch_test_bit() implementation which makes use of
flag output constraint. This generates slightly better code
- Provide memory topology information obtanied with DIAGNOSE 0x310
using ioctl.
- Various other small improvements, fixes, and cleanups
These changes were added with a merge of 'pci-device-recovery' branch
- Add PCI error recovery status mechanism
- Simplify and document debug_next_entry() logic
- Split private data allocation and freeing out of debug file
open() and close() operations
- Add debug_dump() function that gets a textual representation
of a debug info (e.g. PCI recovery hardware error logs)
- Add formatted content of pci_debug_msg_id to the PCI report
-----BEGIN PGP SIGNATURE-----
iI0EABYKADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZ4pkVxccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8GulAQDg/7pCj1fXH5XKN9W16972OYQD
pNwfCekw8suO8HUBCgEAkzdgTJC6/thifrnUt+Gj8HqASh//Qzw/6Q2Jk6595gk=
=lE3V
-----END PGP SIGNATURE-----
Merge tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Select config option KASAN_VMALLOC if KASAN is enabled
- Select config option VMAP_STACK unconditionally
- Implement arch_atomic_inc() / arch_atomic_dec() functions which
result in a single instruction if compiled for z196 or newer
architectures
- Make layering between atomic.h and atomic_ops.h consistent
- Comment s390 preempt_count implementation
- Remove pre MARCH_HAS_Z196_FEATURES preempt count implementation
- GCC uses the number of lines of an inline assembly to calculate
number of instructions and decide on inlining. Therefore remove
superfluous new lines from a couple of inline assemblies.
- Provide arch_atomic_*_and_test() implementations that allow the
compiler to generate slightly better code.
- Optimize __preempt_count_dec_and_test()
- Remove __bootdata annotations from declarations in header files
- Add missing include of <linux/smp.h> in abs_lowcore.h to provide
declarations for get_cpu() and put_cpu() used in the code
- Fix suboptimal kernel image base when running make kasan.config
- Remove huge_pte_none() and huge_pte_none_mostly() as are identical to
the generic variants
- Remove unused PAGE_KERNEL_EXEC, SEGMENT_KERNEL_EXEC, and
REGION3_KERNEL_EXEC defines
- Simplify noexec page protection handling and change the page, segment
and region3 protection definitions automatically if the instruction
execution-protection facility is not available
- Save one instruction and prefer EXRL instruction over EX in string,
xor_*(), amode31 and other functions
- Create /dev/diag misc device to fetch diagnose specific information
from the kernel and provide it to userspace
- Retrieve electrical power readings using DIAGNOSE 0x324 ioctl
- Make ccw_device_get_ciw() consistent and use array indices instead of
pointer arithmetic
- s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into
one place
- The sysfs core now allows instances of 'struct bin_attribute' to be
moved into read-only memory. Make use of that in s390 code
- Add missing TLB range adjustment in pud_free_tlb()
- Improve topology setup by adding early polarization detection
- Fix length checks in codepage_convert() function
- The generic bitops implementation is nearly identical to the s390
one. Switch to the generic variant and decrease a bit the kernel
image size
- Provide an optimized arch_test_bit() implementation which makes use
of flag output constraint. This generates slightly better code
- Provide memory topology information obtanied with DIAGNOSE 0x310
using ioctl.
- Various other small improvements, fixes, and cleanups
Also, some changes came in through a merge of 'pci-device-recovery'
branch:
- Add PCI error recovery status mechanism
- Simplify and document debug_next_entry() logic
- Split private data allocation and freeing out of debug file open()
and close() operations
- Add debug_dump() function that gets a textual representation of a
debug info (e.g. PCI recovery hardware error logs)
- Add formatted content of pci_debug_msg_id to the PCI report
* tag 's390-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (48 commits)
s390/futex: Fix FUTEX_OP_ANDN implementation
s390/diag: Add memory topology information via diag310
s390/bitops: Provide optimized arch_test_bit()
s390/bitops: Switch to generic bitops
s390/ebcdic: Fix length decrement in codepage_convert()
s390/ebcdic: Fix length check in codepage_convert()
s390/ebcdic: Use exrl instead of ex
s390/amode31: Use exrl instead of ex
s390/stackleak: Use exrl instead of ex in __stackleak_poison()
s390/lib: Use exrl instead of ex in xor functions
s390/topology: Improve topology detection
s390/tlb: Add missing TLB range adjustment
s390/pkey: Constify 'struct bin_attribute'
s390/sclp: Constify 'struct bin_attribute'
s390/pci: Constify 'struct bin_attribute'
s390/ipl: Constify 'struct bin_attribute'
s390/crypto/cpacf: Constify 'struct bin_attribute'
s390/qdio: Move memory alloc/pointer arithmetic for slib and sl into one place
s390/cio: Use array indices instead of pointer arithmetic
s390/qdio: Rename feature flag aif_osa to aif_qdio
...
By default page protection definitions like PAGE_RX have the _PAGE_NOEXEC
bit set. For older machines without the instruction execution protection
facility this bit is not allowed to be used in page table entries, and
therefore must be removed.
This is done at a couple of page table walkers, but also at some but not
all page table modification functions like ptep_modify_prot_commit(). Avoid
all of this and change the page, segment and region3 protection definitions
so that the noexec bit is masked out automatically if the instruction
execution-protection facility is not available. This is similar to what
also various other architectures do which had to solve the same problem.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The calculation determining whether to use three- or four-level paging
didn't account for KMSAN modules metadata. Include this metadata in the
virtual memory size calculation to ensure correct paging mode selection
and avoiding potentially unnecessary physical memory size limitations.
Fixes: 65ca73f9fb ("s390/mm: define KMSAN metadata for vmalloc and modules")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reduce the number of to be considered config options and select
KASAN_VMALLOC if KASAN is enabled.
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
With uncoupling of physical and virtual address spaces population of
the identity mapping was changed to use the type POPULATE_IDENTITY
instead of POPULATE_DIRECT. This breaks DirectMap accounting:
> cat /proc/meminfo
DirectMap4k: 55296 kB
DirectMap1M: 18446744073709496320 kB
Adjust all locations of update_page_count() in vmem.c to use
POPULATE_IDENTITY instead of POPULATE_DIRECT as well. With this
accounting is correct again:
> cat /proc/meminfo
DirectMap4k: 54264 kB
DirectMap1M: 8334336 kB
Fixes: c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")
Cc: stable@vger.kernel.org
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use flag output macros in inline asm to allow for better code generation if
the compiler has support for the flag output constraint.
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
To support memory devices under QEMU/KVM, such as virtio-mem,
we have to prepare our kernel virtual address space accordingly and
have to know the highest possible physical memory address we might see
later: the storage limit. The good old SCLP interface is not suitable for
this use case.
In particular, memory owned by memory devices has no relationship to
storage increments, it is always detected using the device driver, and
unaware OSes (no driver) must never try making use of that memory.
Consequently this memory is located outside of the "maximum storage
increment"-indicated memory range.
Let's use our new diag500 STORAGE_LIMIT subcode to query this storage
limit that can exceed the "maximum storage increment", and use the
existing interfaces (i.e., SCLP) to obtain information about the initial
memory that is not owned+managed by memory devices.
If a hypervisor does not support such memory devices, the address exposed
through diag500 STORAGE_LIMIT will correspond to the maximum storage
increment exposed through SCLP.
To teach kdump on s390 to include memory owned by memory devices, there
will be ways to query the relevant memory ranges from the device via a
driver running in special kdump mode (like virtio-mem already implements
to filter /proc/vmcore access so we don't end up reading from unplugged
device blocks).
Update setup_ident_map_size(), to clarify that there can be more than
just online and standby memory.
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20241025141453.1210600-4-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reflect the updated content in the query information UVC to the sysfs at
/sys/firmware/query
* new UV-query sysfs entry for the maximum number of retrievable
secrets the UV can store for one secure guest.
* new UV-query sysfs entry for the maximum number of association
secrets the UV can store for one secure guest.
* max_secrets contains the sum of max association and max retrievable
secrets.
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241024062638.1465970-7-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add a define for the UVC rc 0x0100 that indicates that a UV-call was
successful but may serve more data if called with a larger buffer
again.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20241024062638.1465970-2-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Optimize ftrace and kprobes code patching and avoid stop machine for
kprobes if sequential instruction fetching facility is available
- Add hiperdispatch feature to dynamically adjust CPU capacity in
vertical polarization to improve scheduling efficiency and overall
performance. Also add infrastructure for handling warning track
interrupts (WTI), allowing for graceful CPU preemption
- Rework crypto code pkey module and split it into separate, independent
modules for sysfs, PCKMO, CCA, and EP11, allowing modules to load only
when the relevant hardware is available
- Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
utilizing message-security assist extensions (MSA) 10 and 11. It
introduces new shash implementations for HMAC-SHA224/256/384/512 and
registers the hardware-accelerated AES-XTS cipher as the preferred
option. Also add clear key token support
- Add MSA 10 and 11 processor activity instrumentation counters to perf
and update PAI Extension 1 NNPA counters
- Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
statements
- Add support for SHA3 performance enhancements introduced with MSA 12
- Add support for the query authentication information feature of
MSA 13 and introduce the KDSA CPACF instruction. Provide query and query
authentication information in sysfs, enabling tools like cpacfinfo to
present this data in a human-readable form
- Update kernel disassembler instructions
- Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
kpatch compatibility
- Add missing warning handling and relocated lowcore support to the
early program check handler
- Optimize ftrace_return_address() and avoid calling unwinder
- Make modules use kernel ftrace trampolines
- Strip relocs from the final vmlinux ELF file to make it roughly 2
times smaller
- Dump register contents and call trace for early crashes to the console
- Generate ptdump address marker array dynamically
- Fix rcu_sched stalls that might occur when adding or removing large
amounts of pages at once to or from the CMM balloon
- Fix deadlock caused by recursive lock of the AP bus scan mutex
- Unify sync and async register save areas in entry code
- Cleanup debug prints in crypto code
- Various cleanup and sanitizing patches for the decompressor
- Various small ftrace cleanups
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmbsZawACgkQjYWKoQLX
FBg+Ogf+NiKPfvI14NcTwnOHB6qz8ApPdGfN9bNVtQxtK3epeAvtj0cMonAuKpRg
xckTRRd8y0guhCT7Q2+WitSgA5eYDn+u9/Ux5YuKUdUdXolQ0D64BJNtVeEFkmJj
s+Lesb8cVI9T2VBZOpuF9lJigfsDALBkFroqN4MDudDeahS+qy33bAc0OoqYNXHo
S6OwPK1/tEG9O/oTN2V4mN+aP0B3/dl7Msezb0gfAXQJA+WUAyMNK0RHvoG9uzaa
BWAyWWYABj6woGZEAQAzXcbzkQiRPixTqZVe6e4YndXhIlEnB/Z2AQFdTpT9V7En
eOmmve3QuJa0hkF9q4H/anvOMPntTg==
=Xagq
-----END PGP SIGNATURE-----
Merge tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Optimize ftrace and kprobes code patching and avoid stop machine for
kprobes if sequential instruction fetching facility is available
- Add hiperdispatch feature to dynamically adjust CPU capacity in
vertical polarization to improve scheduling efficiency and overall
performance. Also add infrastructure for handling warning track
interrupts (WTI), allowing for graceful CPU preemption
- Rework crypto code pkey module and split it into separate,
independent modules for sysfs, PCKMO, CCA, and EP11, allowing modules
to load only when the relevant hardware is available
- Add hardware acceleration for HMAC modes and the full AES-XTS cipher,
utilizing message-security assist extensions (MSA) 10 and 11. It
introduces new shash implementations for HMAC-SHA224/256/384/512 and
registers the hardware-accelerated AES-XTS cipher as the preferred
option. Also add clear key token support
- Add MSA 10 and 11 processor activity instrumentation counters to perf
and update PAI Extension 1 NNPA counters
- Cleanup cpu sampling facility code and rework debug/WARN_ON_ONCE
statements
- Add support for SHA3 performance enhancements introduced with MSA 12
- Add support for the query authentication information feature of MSA
13 and introduce the KDSA CPACF instruction. Provide query and query
authentication information in sysfs, enabling tools like cpacfinfo to
present this data in a human-readable form
- Update kernel disassembler instructions
- Always enable EXPOLINE_EXTERN if supported by the compiler to ensure
kpatch compatibility
- Add missing warning handling and relocated lowcore support to the
early program check handler
- Optimize ftrace_return_address() and avoid calling unwinder
- Make modules use kernel ftrace trampolines
- Strip relocs from the final vmlinux ELF file to make it roughly 2
times smaller
- Dump register contents and call trace for early crashes to the
console
- Generate ptdump address marker array dynamically
- Fix rcu_sched stalls that might occur when adding or removing large
amounts of pages at once to or from the CMM balloon
- Fix deadlock caused by recursive lock of the AP bus scan mutex
- Unify sync and async register save areas in entry code
- Cleanup debug prints in crypto code
- Various cleanup and sanitizing patches for the decompressor
- Various small ftrace cleanups
* tag 's390-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (84 commits)
s390/crypto: Display Query and Query Authentication Information in sysfs
s390/crypto: Add Support for Query Authentication Information
s390/crypto: Rework RRE and RRF CPACF inline functions
s390/crypto: Add KDSA CPACF Instruction
s390/disassembler: Remove duplicate instruction format RSY_RDRU
s390/boot: Move boot_printk() code to own file
s390/boot: Use boot_printk() instead of sclp_early_printk()
s390/boot: Rename decompressor_printk() to boot_printk()
s390/boot: Compile all files with the same march flag
s390: Use MARCH_HAS_*_FEATURES defines
s390: Provide MARCH_HAS_*_FEATURES defines
s390/facility: Disable compile time optimization for decompressor code
s390/boot: Increase minimum architecture to z10
s390/als: Remove obsolete comment
s390/sha3: Fix SHA3 selftests failures
s390/pkey: Add AES xts and HMAC clear key token support
s390/cpacf: Add MSA 10 and 11 new PCKMO functions
s390/mm: Add cond_resched() to cmm_alloc/free_pages()
s390/pai_ext: Update PAI extension 1 counters
s390/pai_crypto: Add support for MSA 10 and 11 pai counters
...
Keep the printk code separate from the program check code and move
boot_printk() and helper functions to own printk.c file.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Consistently use boot_printk() everywhere instead of sclp_early_printk() at
some places. For some places it was required (e.g. als.c), in order to stay
in code compiled for the same architecture level, for other places it is
not obvious why sclp_early_printk() was used instead of
decompressor_printk().
Given that the whole decompressor code is compiled for the same
architecture level, there is no requirement left to use different
printk functions.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Rename decompressor_printk() to boot_printk() just to have a shorter
function name, which also makes the code more readable.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Only a couple of files of the decompressor are compiled with the
minimum architecture level. This is problematic for potential function
calls between compile units, especially if a target function is within
a compile until compiled for a higher architecture level, since that
may lead to an unexpected operation exception.
Therefore compile all files of the decompressor for the same (minimum)
architecture level.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The decompressor code is partially compiled with march=z900 so it is
possible to print an error message in case a kernel is booted on a
machine which misses facilities to execute the kernel.
Given that the decompressor code also includes header files from the
core kernel this causes problems for inline assemblies and other code
where the minimum assumed architecture level is set to z10 in the
meantime. If such code is also used in the decompressor (e.g. inline
functions) z900 support must be implemented again.
In order to avoid this and to keep things simple just raise the
minimum architecture level to z10 for the decompressor just like for
the kernel.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The bss section of the decompressor is part of the compressed kernel image
since commit 980d5f9ab3 ("s390/boot: enable .bss section for compressed
kernel").
Remove a now incorrect comment that states that the bss section must not be
accessed.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the past two save areas existed because interrupt handlers
and system call / program check handlers where entered with
interrupts enabled. To prevent a handler from overwriting the
save areas from the previous handler, interrupts used the async
save area, while system call and program check handler used the
sync save area.
Since the removal of critical section cleanup from entry.S, handlers are
entered with interrupts disabled. When the interrupts are re-enabled,
the save area is no longer need. Therefore merge both save areas into one.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since commit 778666df60 ("s390: compile relocatable kernel without
-fPIE") the kernel vmlinux ELF file is linked with --emit-relocs to
preserve all relocations, so that all absolute relocations can be
extracted using the 'relocs' tool to adjust them during boot.
Port and adapt Petr Pavlu's x86 commit 9d9173e9ce ("x86/build: Avoid
relocation information in final vmlinux") to s390 to strip all
relocations from the final vmlinux ELF file to optimize its size.
Following is his original commit message with minor adaptions for s390:
The Linux build process on s390 roughly consists of compiling all input
files, statically linking them into a vmlinux ELF file, and then taking
and turning this file into an actual bzImage bootable file.
vmlinux has in this process two main purposes:
1) It is an intermediate build target on the way to produce the final
bootable image.
2) It is a file that is expected to be used by debuggers and standard
ELF tooling to work with the built kernel.
For the second purpose, a vmlinux file is typically collected by various
package build recipes, such as distribution spec files, including the
kernel's own tar-pkg target.
When building the kernel vmlinux contains also relocation information
produced by using the --emit-relocs linker option. This is utilized by
subsequent build steps to create relocs.S and produce a relocatable
image. However, the information is not needed by debuggers and other
standard ELF tooling.
The issue is then that the collected vmlinux file and hence distribution
packages end up unnecessarily large because of this extra data. The
following is a size comparison of vmlinux v6.10 with and without the
relocation information:
| Configuration | With relocs | Stripped relocs |
| defconfig | 696 MB | 320 MB |
| -CONFIG_DEBUG_INFO | 48 MB | 32 MB |
Optimize a resulting vmlinux by adding a postlink step that splits the
relocation information into relocs.S and then strips it from the vmlinux
binary.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Symbol offsets to the KASLR base do not match symbol address in
the vmlinux image. That is the result of setting the KASLR base
to the beginning of .text section as result of an optimization.
Revert that optimization and allocate virtual memory for the
whole kernel image including __START_KERNEL bytes as per the
linker script. That allows keeping the semantics of the KASLR
base offset in sync with other architectures.
Rename __START_KERNEL to TEXT_OFFSET, since it represents the
offset of the .text section within the kernel image, rather than
a virtual address.
Still skip mapping TEXT_OFFSET bytes to save memory on pgtables
and provoke exceptions in case an attempt to access this area is
made, as no kernel symbol may reside there.
In case CONFIG_KASAN is enabled the location counter might exceed
the value of TEXT_OFFSET, while the decompressor linker script
forcefully resets it to TEXT_OFFSET, which leads to a sections
overlap link failure. Use MAX() expression to avoid that.
Reported-by: Omar Sandoval <osandov@osandov.com>
Closes: https://lore.kernel.org/linux-s390/ZnS8dycxhtXBZVky@telecaster.dhcp.thefacebook.com/
Fixes: 56b1069c40 ("s390/boot: Rework deployment of the kernel image")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When physical memory for the kernel image is allocated it does not
consider extra memory required for offsetting the image start to
match it with the lower 20 bits of KASLR virtual base address. That
might lead to kernel access beyond its memory range.
Suggested-by: Vasily Gorbik <gor@linux.ibm.com>
Fixes: 693d41f7c9 ("s390/mm: Restore mapping of kernel image using large pages")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
SIE instruction performs faster when the virtual address of
SIE block matches the physical one. Pin the identity mapping
base to zero for the benefit of SIE and other instructions
that have similar performance impact. Still, randomize the
base when DEBUG_VM kernel configuration option is enabled.
Suggested-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Fix KMSAN build breakage caused by the conflict between s390 and
mm-stable trees
- Add KMSAN page markers for ptdump
- Add runtime constant support
- Fix __pa/__va for modules under non-GPL licenses by exporting necessary
vm_layout struct with EXPORT_SYMBOL to prevent linkage problems
- Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
Counter Facility code when the counter set size is zero
- Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
its functionality by default
- Support allocation of multiple MSI interrupts per device and improve
logging of architecture-specific limitations
- Add support for lowcore relocation as a debugging feature to catch
all null ptr dereferences in the kernel address space, improving
detection beyond the current implementation's limited write access
protection
- Clean up and rework CPU alternatives to allow for callbacks and early
patching for the lowcore relocation
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmajrGUACgkQjYWKoQLX
FBgZjwf/X8suh1Gm2qO47hdGURusOKmEa6GjYaihKzOi2I5yAWVXGsAYX7QtXI4X
fxbKyuGJIvq7LOrIojN1JOCPGkDRztgMddqDJI7WljuRiw6dcd4L5tbaPgCKEv3Q
AQcoq+Aeg1L5xnuNFPdQXl6+Fy2lTFqJCkUl+uW05pGAn2R212dYG3HB41TpwOtJ
Sv2R5+yD9TQCKnHyuCQqaGf7d6SQTcVeBj8zrqVmcyduNK+BYYMOwlJ/UTRzeZEX
3DmQg/TdAkxXf0jZ+vrNILEfHlIvwDAhFjdoAXXL0TX4lx2cHLx9AiqNxYhUprsG
0gutc/nLq2FxhqofoJ0z9TCdb1Ef7w==
=skva
-----END PGP SIGNATURE-----
Merge tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix KMSAN build breakage caused by the conflict between s390 and
mm-stable trees
- Add KMSAN page markers for ptdump
- Add runtime constant support
- Fix __pa/__va for modules under non-GPL licenses by exporting
necessary vm_layout struct with EXPORT_SYMBOL to prevent linkage
problems
- Fix an endless loop in the CF_DIAG event stop in the CPU Measurement
Counter Facility code when the counter set size is zero
- Remove the PROTECTED_VIRTUALIZATION_GUEST config option and enable
its functionality by default
- Support allocation of multiple MSI interrupts per device and improve
logging of architecture-specific limitations
- Add support for lowcore relocation as a debugging feature to catch
all null ptr dereferences in the kernel address space, improving
detection beyond the current implementation's limited write access
protection
- Clean up and rework CPU alternatives to allow for callbacks and early
patching for the lowcore relocation
* tag 's390-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390: Remove protvirt and kvm config guards for uv code
s390/boot: Add cmdline option to relocate lowcore
s390/kdump: Make kdump ready for lowcore relocation
s390/entry: Make system_call() ready for lowcore relocation
s390/entry: Make ret_from_fork() ready for lowcore relocation
s390/entry: Make __switch_to() ready for lowcore relocation
s390/entry: Make restart_int_handler() ready for lowcore relocation
s390/entry: Make mchk_int_handler() ready for lowcore relocation
s390/entry: Make int handlers ready for lowcore relocation
s390/entry: Make pgm_check_handler() ready for lowcore relocation
s390/entry: Add base register to CHECK_VMAP_STACK/CHECK_STACK macro
s390/entry: Add base register to SIEEXIT macro
s390/entry: Add base register to MBEAR macro
s390/entry: Make __sie64a() ready for lowcore relocation
s390/head64: Make startup code ready for lowcore relocation
s390: Add infrastructure to patch lowcore accesses
s390/atomic_ops: Disable flag outputs constraint for GCC versions below 14.2.0
s390/entry: Move SIE indicator flag to thread info
s390/nmi: Simplify ptregs setup
s390/alternatives: Remove alternative facility list
...
- Remove tristate choice support from Kconfig
- Stop using the PROVIDE() directive in the linker script
- Reduce the number of links for the combination of CONFIG_DEBUG_INFO_BTF
and CONFIG_KALLSYMS
- Enable the warning for symbol reference to .exit.* sections by default
- Fix warnings in RPM package builds
- Improve scripts/make_fit.py to generate a FIT image with separate base
DTB and overlays
- Improve choice value calculation in Kconfig
- Fix conditional prompt behavior in choice in Kconfig
- Remove support for the uncommon EMAIL environment variable in Debian
package builds
- Remove support for the uncommon "name <email>" form for the DEBEMAIL
environment variable
- Raise the minimum supported GNU Make version to 4.0
- Remove stale code for the absolute kallsyms
- Move header files commonly used for host programs to scripts/include/
- Introduce the pacman-pkg target to generate a pacman package used in
Arch Linux
- Clean up Kconfig
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmagBLUVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGmoUQAJ8pnURs0g+Rcyk6bdY/qtXBYkS+
nXpIK1ssFgRRgAQdeszYtvBqLFzb0wRCSie87G1AriD/JkVVTjCCY1For1y+vs0u
a7HfxitHhZpPyZW/T+WMQ3LViNccpkx+DFAcoRH8xOY/XPEJKVUby332jOIXMuyg
+NKIELQJVsLhcDofTUGb5VfIQektw219n5c4jKjXdNk4ZtE24xCRM5X528ZebwWJ
RZhMvJ968PyIH1IRXvNt6dsKBxoGIwPP8IO6yW9hzHaNsBqt7MGSChSel7r1VKpk
iwCNApJvEiVBe5wvTSVOVro7/8p/AZ70CQAqnMJV+dNnRqtGqW7NvL6XAjZRJgJJ
Uxe5NSrXgQd3FtqfcbXLetBgp9zGVt328nHm1HXHR5rFsvoOiTvO7hHPbhA+OoWJ
fs+jHzEXdAMRgsNrczPWU5Svq6MgGe4v8HBf0m8N1Uy65t/O+z9ti2QAw7kIFlbu
/VSFNjw4CHmNxGhnH0khCMsy85FwVIt9Ux+2d6IEc0gP8S1Qa1HgHGAoVI4U51eS
9dxEPVJNPOugaIVHheuS3wimEO6wzaJcQHn4IXaasMA7P6Yo4G/jiGoy4cb9qPTM
Hb+GaOltUy7vDoG4D2LSym8zR8rdKwbIf/5psdZrq/IWVKq5p+p7KWs3aOykSoM7
o6Hb532Ioalhm8je
=BYu7
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove tristate choice support from Kconfig
- Stop using the PROVIDE() directive in the linker script
- Reduce the number of links for the combination of CONFIG_KALLSYMS and
CONFIG_DEBUG_INFO_BTF
- Enable the warning for symbol reference to .exit.* sections by
default
- Fix warnings in RPM package builds
- Improve scripts/make_fit.py to generate a FIT image with separate
base DTB and overlays
- Improve choice value calculation in Kconfig
- Fix conditional prompt behavior in choice in Kconfig
- Remove support for the uncommon EMAIL environment variable in Debian
package builds
- Remove support for the uncommon "name <email>" form for the DEBEMAIL
environment variable
- Raise the minimum supported GNU Make version to 4.0
- Remove stale code for the absolute kallsyms
- Move header files commonly used for host programs to scripts/include/
- Introduce the pacman-pkg target to generate a pacman package used in
Arch Linux
- Clean up Kconfig
* tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (65 commits)
kbuild: doc: gcc to CC change
kallsyms: change sym_entry::percpu_absolute to bool type
kallsyms: unify seq and start_pos fields of struct sym_entry
kallsyms: add more original symbol type/name in comment lines
kallsyms: use \t instead of a tab in printf()
kallsyms: avoid repeated calculation of array size for markers
kbuild: add script and target to generate pacman package
modpost: use generic macros for hash table implementation
kbuild: move some helper headers from scripts/kconfig/ to scripts/include/
Makefile: add comment to discourage tools/* addition for kernel builds
kbuild: clean up scripts/remove-stale-files
kconfig: recursive checks drop file/lineno
kbuild: rpm-pkg: introduce a simple changelog section for kernel.spec
kallsyms: get rid of code for absolute kallsyms
kbuild: Create INSTALL_PATH directory if it does not exist
kbuild: Abort make on install failures
kconfig: remove 'e1' and 'e2' macros from expression deduplication
kconfig: remove SYMBOL_CHOICEVAL flag
kconfig: add const qualifiers to several function arguments
kconfig: call expr_eliminate_yn() at least once in expr_eliminate_dups()
...
Removing the CONFIG_PROTECTED_VIRTUALIZATION_GUEST ifdefs and config
option as well as CONFIG_KVM ifdefs in uv files.
Having this configurable has been more of a pain than a help.
It's time to remove the ifdefs and the config option.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Now that everything has been converted, add the option
'relocate_lowcore' to enable relocating the lowcore.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The s390 architecture defines two special per-CPU data pages
called the "prefix area". In s390-linux terminology this is usually
called "lowcore". This memory area contains system configuration
data like old/new PSW's for system call/interrupt/machine check
handlers and lots of other data. It is normally mapped to logical
address 0. This area can only be accessed when in supervisor mode.
This means that kernel code can dereference NULL pointers, because
accesses to address 0 are allowed. Parts of lowcore can be write
protected, but read accesses and write accesses outside of the write
protected areas are not caught.
To remove this limitation for debugging and testing, remap lowcore to
another address and define a function get_lowcore() which simply
returns the address where lowcore is mapped at. This would normally
introduce a pointer dereference (=memory read). As lowcore is used
for several very often used variables, add code to patch this function
during runtime, so we avoid the memory reads.
For C code get_lowcore() has to be used, for assembly code it is
the GET_LC macro. When using this macro/function a reference is added
to alternative patching. All these locations will be patched to the
actual lowcore location when the kernel is booted or a module is loaded.
To make debugging/bisecting problems easier, this patch adds all the
infrastructure but the lowcore address is still hardwired to 0. This
way the code can be converted on a per function basis, and the
functionality is enabled in a patch after all the functions have
been converted.
Note that this requires at least z16 because the old lpsw instruction
only allowed a 12 bit displacement. z16 introduced lpswey which allows
20 bits (signed), so the lowcore can effectively be mapped from
address 0 - 0x7e000. To use 0x7e000 as address, a 6 byte lgfi
instruction would have to be used in the alternative. To save two
bytes, llilh can be used, but this only allows to set bits 16-31 of
the address. In order to use the llilh instruction, use 0x70000 as
alternative lowcore address. This is still large enough to catch
NULL pointer dereferences into large arrays.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add the required code to patch alternatives early in the decompressor.
This is required for the upcoming lowcore relocation changes, where
alternatives for facility 193 need to get patched before lowcore
alternatives.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When allocating a random memory range for .amode31 sections
the minimal randomization address is 0. That does not lead
to a possible overlap with the decompressor image (which also
starts from 0) since by that time the image range is already
reserved.
Do not assume the decompressor range is reserved and always
provide the minimal randomization address for .amode31
sections beyond the decompressor. That is a prerequisite
for moving the lowcore memory address from NULL elsewhere.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
walkers") is known to cause a performance regression
(https://lore.kernel.org/all/3acefad9-96e5-4681-8014-827d6be71c7a@linux.ibm.com/T/#mfa809800a7862fb5bdf834c6f71a3a5113eb83ff).
Yu has a fix which I'll send along later via the hotfixes branch.
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability of
cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of the
zeropage in MAP_SHARED mappings. I don't see any runtime effects here -
more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of
higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the
series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has
simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Is anyone reading this stuff? If so, email me!
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large folio
userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's self
testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are
"mm: memcg: separate legacy cgroup v1 code and put under config
option" and
"mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of excessive
correctable memory errors. In order to permit userspace to monitor and
handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from migrate
folio" teaches the kernel to appropriately handle migration from
poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare
refcount increments. So these paes can first be moved aside if they
reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps
for much faster reading of vma information. The series is "query VMAs
from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance Yang
improves the kernel's presentation of developer information related to
multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and not
very useful feature from slab fault injection.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2C+QAKCRDdBJ7gKXxA
joTkAQDvjqOoFStqk4GU3OXMYB7WCU/ZQMFG0iuu1EEwTVDZ4QEA8CnG7seek1R3
xEoo+vw0sWWeLV3qzsxnCA1BJ8cTJA8=
=z0Lf
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- In the series "mm: Avoid possible overflows in dirty throttling" Jan
Kara addresses a couple of issues in the writeback throttling code.
These fixes are also targetted at -stable kernels.
- Ryusuke Konishi's series "nilfs2: fix potential issues related to
reserved inodes" does that. This should actually be in the
mm-nonmm-stable tree, along with the many other nilfs2 patches. My
bad.
- More folio conversions from Kefeng Wang in the series "mm: convert to
folio_alloc_mpol()"
- Kemeng Shi has sent some cleanups to the writeback code in the series
"Add helper functions to remove repeated code and improve readability
of cgroup writeback"
- Kairui Song has made the swap code a little smaller and a little
faster in the series "mm/swap: clean up and optimize swap cache
index".
- In the series "mm/memory: cleanly support zeropage in
vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David
Hildenbrand has reworked the rather sketchy handling of the use of
the zeropage in MAP_SHARED mappings. I don't see any runtime effects
here - more a cleanup/understandability/maintainablity thing.
- Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling
of higher addresses, for aarch64. The (poorly named) series is
"Restructure va_high_addr_switch".
- The core TLB handling code gets some cleanups and possible slight
optimizations in Bang Li's series "Add update_mmu_tlb_range() to
simplify code".
- Jane Chu has improved the handling of our
fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in
the series "Enhance soft hwpoison handling and injection".
- Jeff Johnson has sent a billion patches everywhere to add
MODULE_DESCRIPTION() to everything. Some landed in this pull.
- In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang
has simplified migration's use of hardware-offload memory copying.
- Yosry Ahmed performs more folio API conversions in his series "mm:
zswap: trivial folio conversions".
- In the series "large folios swap-in: handle refault cases first",
Chuanhua Han inches us forward in the handling of large pages in the
swap code. This is a cleanup and optimization, working toward the end
objective of full support of large folio swapin/out.
- In the series "mm,swap: cleanup VMA based swap readahead window
calculation", Huang Ying has contributed some cleanups and a possible
fixlet to his VMA based swap readahead code.
- In the series "add mTHP support for anonymous shmem" Baolin Wang has
taught anonymous shmem mappings to use multisize THP. By default this
is a no-op - users must opt in vis sysfs controls. Dramatic
improvements in pagefault latency are realized.
- David Hildenbrand has some cleanups to our remaining use of
page_mapcount() in the series "fs/proc: move page_mapcount() to
fs/proc/internal.h".
- David also has some highmem accounting cleanups in the series
"mm/highmem: don't track highmem pages manually".
- Build-time fixes and cleanups from John Hubbard in the series
"cleanups, fixes, and progress towards avoiding "make headers"".
- Cleanups and consolidation of the core pagemap handling from Barry
Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers
and utilize them".
- Lance Yang's series "Reclaim lazyfree THP without splitting" has
reduced the latency of the reclaim of pmd-mapped THPs under fairly
common circumstances. A 10x speedup is seen in a microbenchmark.
It does this by punting to aother CPU but I guess that's a win unless
all CPUs are pegged.
- hugetlb_cgroup cleanups from Xiu Jianfeng in the series
"mm/hugetlb_cgroup: rework on cftypes".
- Miaohe Lin's series "Some cleanups for memory-failure" does just that
thing.
- Someone other than SeongJae has developed a DAMON feature in Honggyu
Kim's series "DAMON based tiered memory management for CXL memory".
This adds DAMON features which may be used to help determine the
efficiency of our placement of CXL/PCIe attached DRAM.
- DAMON user API centralization and simplificatio work in SeongJae
Park's series "mm/damon: introduce DAMON parameters online commit
function".
- In the series "mm: page_type, zsmalloc and page_mapcount_reset()"
David Hildenbrand does some maintenance work on zsmalloc - partially
modernizing its use of pageframe fields.
- Kefeng Wang provides more folio conversions in the series "mm: remove
page_maybe_dma_pinned() and page_mkclean()".
- More cleanup from David Hildenbrand, this time in the series
"mm/memory_hotplug: use PageOffline() instead of PageReserved() for
!ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline()
pages" and permits the removal of some virtio-mem hacks.
- Barry Song's series "mm: clarify folio_add_new_anon_rmap() and
__folio_add_anon_rmap()" is a cleanup to the anon folio handling in
preparation for mTHP (multisize THP) swapin.
- Kefeng Wang's series "mm: improve clear and copy user folio"
implements more folio conversions, this time in the area of large
folio userspace copying.
- The series "Docs/mm/damon/maintaier-profile: document a mailing tool
and community meetup series" tells people how to get better involved
with other DAMON developers. From SeongJae Park.
- A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does
that.
- David Hildenbrand sends along more cleanups, this time against the
migration code. The series is "mm/migrate: move NUMA hinting fault
folio isolation + checks under PTL".
- Jan Kara has found quite a lot of strangenesses and minor errors in
the readahead code. He addresses this in the series "mm: Fix various
readahead quirks".
- SeongJae Park's series "selftests/damon: test DAMOS tried regions and
{min,max}_nr_regions" adds features and addresses errors in DAMON's
self testing code.
- Gavin Shan has found a userspace-triggerable WARN in the pagecache
code. The series "mm/filemap: Limit page cache size to that supported
by xarray" addresses this. The series is marked cc:stable.
- Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations
and cleanup" cleans up and slightly optimizes KSM.
- Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of
code motion. The series (which also makes the memcg-v1 code
Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put
under config option" and "mm: memcg: put cgroup v1-specific memcg
data under CONFIG_MEMCG_V1"
- Dan Schatzberg's series "Add swappiness argument to memory.reclaim"
adds an additional feature to this cgroup-v2 control file.
- The series "Userspace controls soft-offline pages" from Jiaqi Yan
permits userspace to stop the kernel's automatic treatment of
excessive correctable memory errors. In order to permit userspace to
monitor and handle this situation.
- Kefeng Wang's series "mm: migrate: support poison recover from
migrate folio" teaches the kernel to appropriately handle migration
from poisoned source folios rather than simply panicing.
- SeongJae Park's series "Docs/damon: minor fixups and improvements"
does those things.
- In the series "mm/zsmalloc: change back to per-size_class lock"
Chengming Zhou improves zsmalloc's scalability and memory
utilization.
- Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for
pinning memfd folios" makes the GUP code use FOLL_PIN rather than
bare refcount increments. So these paes can first be moved aside if
they reside in the movable zone or a CMA block.
- Andrii Nakryiko has added a binary ioctl()-based API to
/proc/pid/maps for much faster reading of vma information. The series
is "query VMAs from /proc/<pid>/maps".
- In the series "mm: introduce per-order mTHP split counters" Lance
Yang improves the kernel's presentation of developer information
related to multisize THP splitting.
- Michael Ellerman has developed the series "Reimplement huge pages
without hugepd on powerpc (8xx, e500, book3s/64)". This permits
userspace to use all available huge page sizes.
- In the series "revert unconditional slab and page allocator fault
injection calls" Vlastimil Babka removes a performance-affecting and
not very useful feature from slab fault injection.
* tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits)
mm/mglru: fix ineffective protection calculation
mm/zswap: fix a white space issue
mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
mm/hugetlb: fix possible recursive locking detected warning
mm/gup: clear the LRU flag of a page before adding to LRU batch
mm/numa_balancing: teach mpol_to_str about the balancing mode
mm: memcg1: convert charge move flags to unsigned long long
alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting
lib: reuse page_ext_data() to obtain codetag_ref
lib: add missing newline character in the warning message
mm/mglru: fix overshooting shrinker memory
mm/mglru: fix div-by-zero in vmpressure_calc_level()
mm/kmemleak: replace strncpy() with strscpy()
mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC
mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB
mm: ignore data-race in __swap_writepage
hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr
mm: shmem: rename mTHP shmem counters
mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()
mm/migrate: putback split folios when numa hint migration fails
...
Setting '-e' flag tells shells to exit with error exit code immediately
after any of commands fails, and causes make(1) to regard recipes as
failed.
Before this, make will still continue to succeed even after the
installation failed, for example, for insufficient permission or
directory does not exist.
Signed-off-by: Zhang Bingwu <xtexchooser@duck.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during failures,
and adding timeout handling for Store Data requests to address potential
firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during topology
updates, as they are unused and also not present on other architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from gaccess.c
to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and swap"
wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of zero
page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore() function
instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmaYGegACgkQjYWKoQLX
FBjCwwf/aRYoLIXCa9/nHGWFiUjZm6xBgVwZh55bXjfNG9TI2J9UZSsYlOFGUJKl
gvD2Ym+LqAejK8R4EUHkfD6ftaKMQuIxNDoedxhwuSpfOQ2mZ5teu0MxTh8QcUAx
4Y2w5XEeCuqE3SuoZ4SJa58K4rGl4cFpPsKNa8ofdzH1ZLFNe8Wqzis4kh0htqLb
FtPj6nsgfzQ5kg14rVkGxCa4CqoFxonXgsA6nH6xZLbxKUInyq8uV44UBQ+aJq5v
dsdzZ5XuAJHN2FpBuuOYQYZYw3XIy/kka7o4EjffORi5SGCRMWO4Zt0P6HXaNkh6
xV8EEO8myeo7rV8dnrk1V4yGjGJmfA==
=3IGY
-----END PGP SIGNATURE-----
Merge tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU
bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during
failures, and adding timeout handling for Store Data requests to
address potential firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during
topology updates, as they are unused and also not present on other
architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output
constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from
gaccess.c to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and
swap" wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of
zero page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore()
function instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
* tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()
s390/kvm: Move bitfields for dat tables
s390/entry: Pass the asce as parameter to sie64a()
s390/sthyi: Use cached data when diag is busy
s390/sthyi: Move diag operations
s390/hypfs_diag: Diag204 busy loop
s390/diag: Add busy-indication-facility requirements
s390/diag: Diag204 add busy return errno
s390/diag: Return errno's from diag204
s390/sclp: Diag204 busy indication facility detection
s390/atomic_ops: Make use of flag output constraint
s390/atomic_ops: Improve __atomic_set() for small values
s390/atomic_ops: Use symbolic names
s390/smp: Switch to GENERIC_CPU_DEVICES
s390/hwcaps: Add documentation for HWCAP flags
s390/pgtable: Make crdte() and cspg() return a value
s390/topology: Remove CPU KOBJ_CHANGE uevents
s390/sclp: Add timeout to Store Data requests
s390/sclp: Prevent release of buffer in I/O
s390/sclp: Suppress unnecessary Store Data warning
...
Add KMSAN support for the s390 implementations of the string functions.
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's important
for KMSAN to know that they initialized something.
The way boot code is built with regard to string functions is problematic,
since most files think it's configured with sanitizers, but boot/string.c
doesn't. This creates various problems with the memset64() definitions,
depending on whether the code is built with sanitizers or fortify. This
should probably be streamlined, but in the meantime resolve the issues by
introducing the IN_BOOT_STRING_C macro, similar to the existing
IN_ARCH_STRING_C macro.
Link: https://lkml.kernel.org/r/20240621113706.315500-33-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The pages for the KMSAN metadata associated with most kernel mappings are
taken from memblock by the common code. However, vmalloc and module
metadata needs to be defined by the architectures.
Be a little bit more careful than x86: allocate exactly MODULES_LEN for
the module shadow and origins, and then take 2/3 of vmalloc for the
vmalloc shadow and origins. This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.
Link: https://lkml.kernel.org/r/20240621113706.315500-32-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It should be possible to have inline functions in the s390 header files,
which call kmsan_unpoison_memory(). The problem is that these header
files might be included by the decompressor, which does not contain KMSAN
runtime, causing linker errors.
Not compiling these calls if __SANITIZE_MEMORY__ is not defined - either
by changing kmsan-checks.h or at the call sites - may cause unintended
side effects, since calling these functions from an uninstrumented code
that is linked into the kernel is valid use case.
One might want to explicitly distinguish between the kernel and the
decompressor. Checking for a decompressor-specific #define is quite
heavy-handed, and will have to be done at all call sites.
A more generic approach is to provide a dummy kmsan_unpoison_memory()
definition. This produces some runtime overhead, but only when building
with CONFIG_KMSAN. The benefit is that it does not disturb the existing
KMSAN build logic and call sites don't need to be changed.
Link: https://lkml.kernel.org/r/20240621113706.315500-25-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All other sanitizers are disabled for boot as well. While at it, add a
comment explaining why we need this.
Link: https://lkml.kernel.org/r/20240621113706.315500-23-iii@linux.ibm.com
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <kasan-dev@googlegroups.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Since commit 778666df60 ("s390: compile relocatable kernel without
-fPIE") and commit 00cda11d3b ("s390: Compile kernel with -fPIC and
link with -no-pie") the kernel on s390x may have a Global Offset Table
(GOT) whose entries are adjusted for KASLR in kaslr_adjust_got().
The GOT may contain entries for undefined weak symbols that resolved to
zero. That is the resulting GOT entry value is zero. Adjusting those
entries unconditionally in kaslr_adjust_got() is wrong. Otherwise the
following sample code would erroneously assume foo to be defined, due to
the adjustment changing the zero-value to a non-zero one:
extern int foo __attribute__((weak));
if (*foo)
/* foo is defined [or undefined and erroneously adjusted] */
The vmlinux build at commit 00cda11d3b ("s390: Compile kernel with
-fPIC and link with -no-pie") with defconfig actually had two GOT
entries for the undefined weak symbols __start_BTF and __stop_BTF:
$ objdump -tw vmlinux | grep -F "*UND*"
0000000000000000 w *UND* 0000000000000000 __stop_BTF
0000000000000000 w *UND* 0000000000000000 __start_BTF
$ readelf -rw vmlinux | grep -E "R_390_GOTENT +0{16}"
000000345760 2776a0000001a R_390_GOTENT 0000000000000000 __stop_BTF + 2
000000345766 2d5480000001a R_390_GOTENT 0000000000000000 __start_BTF + 2
The s390-specific vmlinux linker script sets the section start to
__START_KERNEL, which is currently defined as 0x100000 on s390x. Access
to lowcore is performed via a pointer of 0 and not a symbol in a section
starting at 0. The first 64K are reserved for the loader on s390x. Thus
it is safe to assume that __START_KERNEL will never be 0. As a result
there cannot be any defined symbols resolving to zero in the kernel.
Note that the first three GOT entries are reserved for the dynamic
loader on s390x. [1] In the kernel they are zero. Therefore no extra
handling is required to skip these.
Skip adjusting GOT entries with a value of zero in kaslr_adjust_got().
While at it update the comment when a GOT exists on s390x. Since commit
00cda11d3b ("s390: Compile kernel with -fPIC and link with -no-pie")
it no longer only exists when compiling with Clang, but also with GCC.
[1]: s390x ELF ABI, section "Global Offset Table",
https://github.com/IBM/s390x-abi/releases
Fixes: 778666df60 ("s390: compile relocatable kernel without -fPIE")
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Replace all S390_lowcore usages in arch/s390/boot by get_lowcore().
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since physical and virtual kernel address spaces are uncoupled
the kernel image is not mapped using large segment pages anymore,
which is a regression.
Put the kernel image at the same large segment page offset in
physical memory as in virtual memory. Such approach preserves
the existing number of bits of entropy used for randomization
of the kernel location in virtual memory when KASLR is on.
As result, the kernel is mapped using large segment pages.
Fixes: c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Do not allow creation of large pages against physical addresses,
which itself are not aligned on the correct boundary. Failure to
do so might lead to referencing wrong memory as result of the way
DAT works.
Fixes: c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
It is nowhere used in the decompressor, therefore remove it.
Fixes: 17e89e1340 ("s390/facilities: move stfl information from lowcore to global data")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When the kernel is built with CONFIG_PIE_BUILD option enabled it
uses dynamic symbols, for which the linker does not allow more
than 64K number of entries. This can break features like kpatch.
Hence, whenever possible the kernel is built with CONFIG_PIE_BUILD
option disabled. For that support of unaligned symbols generated by
linker scripts in the compiler is necessary.
However, older compilers might lack such support. In that case the
build process resorts to CONFIG_PIE_BUILD option-enabled build.
Compile object files with -fPIC option and then link the kernel
binary with -no-pie linker option.
As result, the dynamic symbols are not generated and not only kpatch
feature succeeds, but also the whole CONFIG_PIE_BUILD option-enabled
code could be dropped.
[ agordeev: Reworded the commit message ]
Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The .vmlinux.relocs section is moved in front of the compressed
kernel. The interim section rescue step is avoided as result.
Suggested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework deployment of kernel image for both compressed and
uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED
kernel configuration variable.
In case CONFIG_KERNEL_UNCOMPRESSED is disabled avoid uncompressing
the kernel to a temporary buffer and copying it to the target
address. Instead, uncompress it directly to the target destination.
In case CONFIG_KERNEL_UNCOMPRESSED is enabled avoid moving the
kernel to default 0x100000 location when KASLR is disabled or
failed. Instead, use the uncompressed kernel image directly.
In case KASLR is disabled or failed .amode31 section location in
memory is not randomized and precedes the kernel image. In case
CONFIG_KERNEL_UNCOMPRESSED is disabled that location overlaps the
area used by the decompression algorithm. That is fine, since that
area is not used after the decompression finished and the size of
.amode31 section is not expected to exceed BOOT_HEAP_SIZE ever.
There is no decompression in case CONFIG_KERNEL_UNCOMPRESSED is
enabled. Therefore, rename decompress_kernel() to deploy_kernel(),
which better describes both uncompressed and compressed cases.
Introduce AMODE31_SIZE macro to avoid immediate value of 0x3000
(the size of .amode31 section) in the decompressor linker script.
Modify the vmlinux linker script to force the size of .amode31
section to AMODE31_SIZE (the value of (_eamode31 - _samode31)
could otherwise differ as result of compiler options used).
Introduce __START_KERNEL macro that defines the kernel ELF image
entry point and set it to the currrent value of 0x100000.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Since kernel virtual and physical address spaces are
uncoupled the kernel is mapped at the top of the virtual
address space in case KASLR is disabled.
That does not pose any issue with regard to the kernel
booting and operation, but makes it difficult to use a
generated vmlinux with some debugging tools (e.g. gdb),
because the exact location of the kernel image in virtual
memory is unknown. Make that location known and introduce
CONFIG_KERNEL_IMAGE_BASE configuration option.
A custom CONFIG_KERNEL_IMAGE_BASE value that would break
the virtual memory layout leads to a build error.
The kernel image size is defined by KERNEL_IMAGE_SIZE
macro and set to 512 MB, by analogy with x86.
Suggested-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The uncoupling physical vs virtual address spaces brings
the following benefits to s390:
- virtual memory layout flexibility;
- closes the address gap between kernel and modules, it
caused s390-only problems in the past (e.g. 'perf' bugs);
- allows getting rid of trampolines used for module calls
into kernel;
- allows simplifying BPF trampoline;
- minor performance improvement in branch prediction;
- kernel randomization entropy is magnitude bigger, as it is
derived from the amount of available virtual, not physical
memory;
The whole change could be described in two pictures below:
before and after the change.
Some aspects of the virtual memory layout setup are not
clarified (number of page levels, alignment, DMA memory),
since these are not a part of this change or secondary
with regard to how the uncoupling itself is implemented.
The focus of the pictures is to explain why __va() and __pa()
macros are implemented the way they are.
Memory layout in V==R mode:
| Physical | Virtual |
+- 0 --------------+- 0 --------------+ identity mapping start
| | S390_lowcore | Low-address memory
| +- 8 KB -----------+
| | |
| | identity | phys == virt
| | mapping | virt == phys
| | |
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
|.amode31 text/data|.amode31 text/data|
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt start
| | |
| | |
+- __kaslr_offset, __kaslr_offset_phys| kernel rand. phys/virt start
| | |
| kernel text/data | kernel text/data | phys == kvirt
| | |
+------------------+------------------+ kernel phys/virt end
| | |
| | |
| | |
| | |
+- ident_map_size -+- ident_map_size -+ identity mapping end
| |
| ... unused gap |
| |
+---- vmemmap -----+ 'struct page' array start
| |
| virtually mapped |
| memory map |
| |
+- __abs_lowcore --+
| |
| Absolute Lowcore |
| |
+- __memcpy_real_area
| |
| Real Memory Copy|
| |
+- VMALLOC_START --+ vmalloc area start
| |
| vmalloc area |
| |
+- MODULES_VADDR --+ modules area start
| |
| modules area |
| |
+------------------+ UltraVisor Secure Storage limit
| |
| ... unused gap |
| |
+KASAN_SHADOW_START+ KASAN shadow memory start
| |
| KASAN shadow |
| |
+------------------+ ASCE limit
Memory layout in V!=R mode:
| Physical | Virtual |
+- 0 --------------+- 0 --------------+
| | S390_lowcore | Low-address memory
| +- 8 KB -----------+
| | |
| | |
| | ... unused gap |
| | |
+- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start
|.amode31 text/data|.amode31 text/data|
+- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB)
| | |
| | |
+- __kaslr_offset_phys | kernel rand. phys start
| | |
| kernel text/data | |
| | |
+------------------+ | kernel phys end
| | |
| | |
| | |
| | |
+- ident_map_size -+ |
| |
| ... unused gap |
| |
+- __identity_base + identity mapping start (>= 2GB)
| |
| identity | phys == virt - __identity_base
| mapping | virt == phys + __identity_base
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
+---- vmemmap -----+ 'struct page' array start
| |
| virtually mapped |
| memory map |
| |
+- __abs_lowcore --+
| |
| Absolute Lowcore |
| |
+- __memcpy_real_area
| |
| Real Memory Copy|
| |
+- VMALLOC_START --+ vmalloc area start
| |
| vmalloc area |
| |
+- MODULES_VADDR --+ modules area start
| |
| modules area |
| |
+- __kaslr_offset -+ kernel rand. virt start
| |
| kernel text/data | phys == (kvirt - __kaslr_offset) +
| | __kaslr_offset_phys
+- kernel .bss end + kernel rand. virt end
| |
| ... unused gap |
| |
+------------------+ UltraVisor Secure Storage limit
| |
| ... unused gap |
| |
+KASAN_SHADOW_START+ KASAN shadow memory start
| |
| KASAN shadow |
| |
+------------------+ ASCE limit
Unused gaps in the virtual memory layout could be present
or not - depending on how partucular system is configured.
No page tables are created for the unused gaps.
The relative order of vmalloc, modules and kernel image in
virtual memory is defined by following considerations:
- start of the modules area and end of the kernel should reside
within 4GB to accommodate relative 32-bit jumps. The best way
to achieve that is to place kernel next to modules;
- vmalloc and module areas should locate next to each other
to prevent failures and extra reworks in user level tools
(makedumpfile, crash, etc.) which treat vmalloc and module
addresses similarily;
- kernel needs to be the last area in the virtual memory
layout to easily distinguish between kernel and non-kernel
virtual addresses. That is needed to (again) simplify
handling of addresses in user level tools and make __pa()
macro faster (see below);
Concluding the above, the relative order of the considered
virtual areas in memory is: vmalloc - modules - kernel.
Therefore, the only change to the current memory layout is
moving kernel to the end of virtual address space.
With that approach the implementation of __pa() macro is
straightforward - all linear virtual addresses less than
kernel base are considered identity mapping:
phys == virt - __identity_base
All addresses greater than kernel base are kernel ones:
phys == (kvirt - __kaslr_offset) + __kaslr_offset_phys
By contrast, __va() macro deals only with identity mapping
addresses:
virt == phys + __identity_base
.amode31 section is mapped separately and is not covered by
__pa() macro. In fact, it could have been handled easily by
checking whether a virtual address is within the section or
not, but there is no need for that. Thus, let __pa() code
do as little machine cycles as possible.
The KASAN shadow memory is located at the very end of the
virtual memory layout, at addresses higher than the kernel.
However, that is not a linear mapping and no code other than
KASAN instrumentation or API is expected to access it.
When KASLR mode is enabled the kernel base address randomized
within a memory window that spans whole unused virtual address
space. The size of that window depends from the amount of
physical memory available to the system, the limit imposed by
UltraVisor (if present) and the vmalloc area size as provided
by vmalloc= kernel command line parameter.
In case the virtual memory is exhausted the minimum size of
the randomization window is forcefully set to 2GB, which
amounts to in 15 bits of entropy if KASAN is enabled or 17
bits of entropy in default configuration.
The default kernel offset 0x100000 is used as a magic value
both in the decompressor code and vmlinux linker script, but
it will be removed with a follow-up change.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Currently __kaslr_offset is the kernel offset in both
physical memory on boot and in virtual memory after DAT
mode is enabled.
Uncouple these offsets and rename the physical address
space variant to __kaslr_offset_phys while keep the name
__kaslr_offset for the offset in virtual address space.
Do not use __kaslr_offset_phys after DAT mode is enabled
just yet, but still make it a persistent boot variable
for later use.
Use __kaslr_offset and __kaslr_offset_phys offsets in
proper contexts and alter handle_relocs() function to
distinguish between the two.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Put virtual memory layout information into a structure
to improve code generation when accessing the structure
members, which are currently only ident_map_size and
__kaslr_offset.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This is a preparatory rework to allow uncoupling virtual
and physical addresses spaces.
Currently the order of virtual memory areas is (the lowcore
and .amode31 section are skipped, as it is irrelevant):
identity mapping (the kernel is contained within)
vmemmap
vmalloc
modules
Absolute Lowcore
Real Memory Copy
In the future the kernel will be mapped separately and placed
to the end of the virtual address space, so the layout would
turn like this:
identity mapping
vmemmap
vmalloc
modules
Absolute Lowcore
Real Memory Copy
kernel
However, the distance between kernel and modules needs to be as
little as possible, ideally - none. Thus, the Absolute Lowcore
and Real Memory Copy areas would stay in the way and therefore
need to be moved as well:
identity mapping
vmemmap
Absolute Lowcore
Real Memory Copy
vmalloc
modules
kernel
To facilitate such layout swap the vmalloc and Absolute Lowcore
together with Real Memory Copy areas. As result, the current
layout turns into:
identity mapping (the kernel is contained within)
vmemmap
Absolute Lowcore
Real Memory Copy
vmalloc
modules
This will allow to locate the kernel directly next to the
modules once it gets mapped separately.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In case vmemmap array could overlap with vmalloc area on
virtual memory layout setup, the size of vmalloc area
is decreased. That could result in less memory than user
requested with vmalloc= kernel command line parameter.
Instead, reduce the size of identity mapping (and the
size of vmemmap array as result) to avoid such overlap.
Further, currently the virtual memmory allocation "rolls"
from top to bottom and it is only VMALLOC_START that could
get increased due to the overlap. Change that to decrease-
only, which makes the whole allocation algorithm more easy
to comprehend.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The maximum mappable physical address (as returned by
arch_get_mappable_range() callback) is limited by the
value of (1UL << MAX_PHYSMEM_BITS).
The maximum physical address available to a DCSS segment
is 512GB.
In case the available online or offline memory size is less
than the DCSS limit arch_get_mappable_range() would include
never used [512GB..(1UL << MAX_PHYSMEM_BITS)] range.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
vmemmap is forcefully set to start at MAX_PHYSMEM_BITS at most.
That could be needed in the past to limit ident_map_size to
MAX_PHYSMEM_BITS. However since commit 75eba6ec0de1 ("s390:
unify identity mapping limits handling") ident_map_size is
limited in setup_ident_map_size() function, which is called
earlier.
Another reason to limit vmemmap start to MAX_PHYSMEM_BITS is
because it was returned by arch_get_mappable_range() as the
maximum mappable physical address. Since commit f641679dfe55
("s390/mm: rework arch_get_mappable_range() callback") that
is not required anymore.
As result, there is no neccessity to limit vmemmap starting
address with MAX_PHYSMEM_BITS.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Common pattern in non-verbose build output for quiet commands is that the
shorthand of a command including whitespace contains at least eight
characters. Adjust this for the RELOCS command, which comes only with seven
characters.
Before:
SORTTAB vmlinux
CC arch/s390/boot/version.o
RELOCS arch/s390/boot/relocs.S
OBJCOPY arch/s390/boot/info.bin
After:
SORTTAB vmlinux
CC arch/s390/boot/version.o
RELOCS arch/s390/boot/relocs.S
OBJCOPY arch/s390/boot/info.bin
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series "mm:
zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is hotplugged
as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving policy
wherein we allocate memory across nodes in a weighted fashion rather
than uniformly. This is beneficial in heterogeneous memory environments
appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the process
has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown situations.
The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings" Ryan
Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's series
"Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page faults.
He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction test",
Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in
our handling of DAX on archiecctures which have virtually aliasing data
caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic
improvements in worst-case mmap_lock hold times during certain
userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability improvements
in his series "Mitigate a vmap lock contention". It realizes a 12x
improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging of
large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages() to
an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which are
configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA
joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx
TMNhHfyiHYDTx/GAV9NXW84tasJSDgA=
=TG55
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series
"implement "memmap on memory" feature on s390".
- More folio conversions from Matthew Wilcox in the series
"Convert memcontrol charge moving to use folios"
"mm: convert mm counter to take a folio"
- Chengming Zhou has optimized zswap's rbtree locking, providing
significant reductions in system time and modest but measurable
reductions in overall runtimes. The series is "mm/zswap: optimize the
scalability of zswap rb-tree".
- Chengming Zhou has also provided the series "mm/zswap: optimize zswap
lru list" which provides measurable runtime benefits in some
swap-intensive situations.
- And Chengming Zhou further optimizes zswap in the series "mm/zswap:
optimize for dynamic zswap_pools". Measured improvements are modest.
- zswap cleanups and simplifications from Yosry Ahmed in the series
"mm: zswap: simplify zswap_swapoff()".
- In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has
contributed several DAX cleanups as well as adding a sysfs tunable to
control the memmap_on_memory setting when the dax device is
hotplugged as system memory.
- Johannes Weiner has added the large series "mm: zswap: cleanups",
which does that.
- More DAMON work from SeongJae Park in the series
"mm/damon: make DAMON debugfs interface deprecation unignorable"
"selftests/damon: add more tests for core functionalities and corner cases"
"Docs/mm/damon: misc readability improvements"
"mm/damon: let DAMOS feeds and tame/auto-tune itself"
- In the series "mm/mempolicy: weighted interleave mempolicy and sysfs
extension" Rakie Kim has developed a new mempolicy interleaving
policy wherein we allocate memory across nodes in a weighted fashion
rather than uniformly. This is beneficial in heterogeneous memory
environments appearing with CXL.
- Christophe Leroy has contributed some cleanup and consolidation work
against the ARM pagetable dumping code in the series "mm: ptdump:
Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute".
- Luis Chamberlain has added some additional xarray selftesting in the
series "test_xarray: advanced API multi-index tests".
- Muhammad Usama Anjum has reworked the selftest code to make its
human-readable output conform to the TAP ("Test Anything Protocol")
format. Amongst other things, this opens up the use of third-party
tools to parse and process out selftesting results.
- Ryan Roberts has added fork()-time PTE batching of THP ptes in the
series "mm/memory: optimize fork() with PTE-mapped THP". Mainly
targeted at arm64, this significantly speeds up fork() when the
process has a large number of pte-mapped folios.
- David Hildenbrand also gets in on the THP pte batching game in his
series "mm/memory: optimize unmap/zap with PTE-mapped THP". It
implements batching during munmap() and other pte teardown
situations. The microbenchmark improvements are nice.
- And in the series "Transparent Contiguous PTEs for User Mappings"
Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte
mappings"). Kernel build times on arm64 improved nicely. Ryan's
series "Address some contpte nits" provides some followup work.
- In the series "mm/hugetlb: Restore the reservation" Breno Leitao has
fixed an obscure hugetlb race which was causing unnecessary page
faults. He has also added a reproducer under the selftest code.
- In the series "selftests/mm: Output cleanups for the compaction
test", Mark Brown did what the title claims.
- Kinsey Ho has added the series "mm/mglru: code cleanup and
refactoring".
- Even more zswap material from Nhat Pham. The series "fix and extend
zswap kselftests" does as claimed.
- In the series "Introduce cpu_dcache_is_aliasing() to fix DAX
regression" Mathieu Desnoyers has cleaned up and fixed rather a mess
in our handling of DAX on archiecctures which have virtually aliasing
data caches. The arm architecture is the main beneficiary.
- Lokesh Gidra's series "per-vma locks in userfaultfd" provides
dramatic improvements in worst-case mmap_lock hold times during
certain userfaultfd operations.
- Some page_owner enhancements and maintenance work from Oscar Salvador
in his series
"page_owner: print stacks and their outstanding allocations"
"page_owner: Fixup and cleanup"
- Uladzislau Rezki has contributed some vmalloc scalability
improvements in his series "Mitigate a vmap lock contention". It
realizes a 12x improvement for a certain microbenchmark.
- Some kexec/crash cleanup work from Baoquan He in the series "Split
crash out from kexec and clean up related config items".
- Some zsmalloc maintenance work from Chengming Zhou in the series
"mm/zsmalloc: fix and optimize objects/page migration"
"mm/zsmalloc: some cleanup for get/set_zspage_mapping()"
- Zi Yan has taught the MM to perform compaction on folios larger than
order=0. This a step along the path to implementaton of the merging
of large anonymous folios. The series is named "Enable >0 order folio
memory compaction".
- Christoph Hellwig has done quite a lot of cleanup work in the
pagecache writeback code in his series "convert write_cache_pages()
to an iterator".
- Some modest hugetlb cleanups and speedups in Vishal Moola's series
"Handle hugetlb faults under the VMA lock".
- Zi Yan has changed the page splitting code so we can split huge pages
into sizes other than order-0 to better utilize large folios. The
series is named "Split a folio to any lower order folios".
- David Hildenbrand has contributed the series "mm: remove
total_mapcount()", a cleanup.
- Matthew Wilcox has sought to improve the performance of bulk memory
freeing in his series "Rearrange batched folio freeing".
- Gang Li's series "hugetlb: parallelize hugetlb page init on boot"
provides large improvements in bootup times on large machines which
are configured to use large numbers of hugetlb pages.
- Matthew Wilcox's series "PageFlags cleanups" does that.
- Qi Zheng's series "minor fixes and supplement for ptdesc" does that
also. S390 is affected.
- Cleanups to our pagemap utility functions from Peter Xu in his series
"mm/treewide: Replace pXd_large() with pXd_leaf()".
- Nico Pache has fixed a few things with our hugepage selftests in his
series "selftests/mm: Improve Hugepage Test Handling in MM
Selftests".
- Also, of course, many singleton patches to many things. Please see
the individual changelogs for details.
* tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits)
mm/zswap: remove the memcpy if acomp is not sleepable
crypto: introduce: acomp_is_async to expose if comp drivers might sleep
memtest: use {READ,WRITE}_ONCE in memory scanning
mm: prohibit the last subpage from reusing the entire large folio
mm: recover pud_leaf() definitions in nopmd case
selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements
selftests/mm: skip uffd hugetlb tests with insufficient hugepages
selftests/mm: dont fail testsuite due to a lack of hugepages
mm/huge_memory: skip invalid debugfs new_order input for folio split
mm/huge_memory: check new folio order when split a folio
mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure
mm: add an explicit smp_wmb() to UFFDIO_CONTINUE
mm: fix list corruption in put_pages_list
mm: remove folio from deferred split list before uncharging it
filemap: avoid unnecessary major faults in filemap_fault()
mm,page_owner: drop unnecessary check
mm,page_owner: check for null stack_record before bumping its refcount
mm: swap: fix race between free_swap_and_cache() and swapoff()
mm/treewide: align up pXd_leaf() retval across archs
mm/treewide: drop pXd_large()
...
pud_large() is always defined as pud_leaf(). Merge their usages. Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.
Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
pmd_large() is always defined as pmd_leaf(). Merge their usages. Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.
Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The relocation table is not expected to contain a zero-termination
entry. The existing check is likely a left-over from similar x86
code that uses zero-entries as delimiters. s390 does not have ones
and therefore the check could be avoided.
Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make the type of __vmlinux_relocs_64_start|end symbols as
char array, just like it is done for all other sections.
Function rescue_relocs() is simplified as result.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Do not use vmlinux.image_size within kaslr_adjust_relocs() function
to calculate the upper relocation table boundary. Instead, make both
lower and upper boundaries the function input parameters.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The end of GOT is calculated dynamically on boot. The size of GOT
is calculated on build from the start and end of GOT. Avoid both
calculations and use the end of GOT directly.
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Naresh reported this build error on linux-next:
s390x-linux-gnu-ld: Unexpected GOT/PLT entries detected!
make[3]: *** [/builds/linux/arch/s390/boot/Makefile:87:
arch/s390/boot/vmlinux.syms] Error 1
make[3]: Target 'arch/s390/boot/bzImage' not remade because of errors.
The reason for the build error is an incorrect/incomplete assertion which
checks the size of the .got.plt section. Similar to x86 the size is either
zero or 24 bytes (three entries).
See commit 262b5cae67 ("x86/boot/compressed: Move .got.plt entries out of
the .got section") for more details. The three reserved/additional entries
for s390 are described in chapter 3.2.2 of the s390x ABI [1] (thanks to
Andreas Krebbel for pointing this out!).
[1] https://github.com/IBM/s390x-abi/releases/download/v1.6.1/lzsabi_s390x.pdf
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYvWp8TY-fMEvc3UhoVtoR_eM5VsfHj3+n+kexcfJJ+Cvw@mail.gmail.com
Fixes: 30226853d6 ("s390: vmlinux.lds.S: explicitly handle '.got' and '.plt' sections")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with OBJDUMP=llvm-objdump, there are a series of warnings
from the section comparisons that arch/s390/boot/Makefile performs
between vmlinux and arch/s390/boot/vmlinux:
llvm-objdump: warning: section '.boot.preserved.data' mentioned in a -j/--section option, but not found in any input file
llvm-objdump: warning: section '.boot.data' mentioned in a -j/--section option, but not found in any input file
llvm-objdump: warning: section '.boot.preserved.data' mentioned in a -j/--section option, but not found in any input file
llvm-objdump: warning: section '.boot.data' mentioned in a -j/--section option, but not found in any input file
The warning is a little misleading, as these sections do exist in the
input files. It is really pointing out that llvm-objdump does not match
GNU objdump's behavior of respecting '-j' / '--section' in combination
with '-t' / '--syms':
$ s390x-linux-gnu-objdump -t -j .boot.data vmlinux.full
vmlinux.full: file format elf64-s390
SYMBOL TABLE:
0000000001951000 l O .boot.data 0000000000003000 sclp_info_sccb
00000000019550e0 l O .boot.data 0000000000000001 sclp_info_sccb_valid
00000000019550e2 g O .boot.data 0000000000001000 early_command_line
...
$ llvm-objdump -t -j .boot.data vmlinux.full
vmlinux.full: file format elf64-s390
SYMBOL TABLE:
0000000000100040 l O .text 0000000000000010 dw_psw
0000000000000000 l df *ABS* 0000000000000000 main.c
00000000001001b0 l F .text 00000000000000c6 trace_event_raw_event_initcall_level
0000000000100280 l F .text 0000000000000100 perf_trace_initcall_level
...
It may be possible to change llvm-objdump's behavior to match GNU
objdump's behavior but the difficulty of that task has not yet been
explored. The combination of '$(OBJDUMP) -t -j' is not common in the
kernel tree on a whole, so workaround this tool difference by grepping
for the sections in the full symbol table output in a similar manner to
the sed invocation. This results in no visible change for GNU objdump
users while fixing the warnings for OBJDUMP=llvm-objdump, further
enabling use of LLVM=1 for ARCH=s390 with versions of LLVM that have
support for s390 in ld.lld and llvm-objcopy.
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Closes: https://lore.kernel.org/20240219113248.16287-C-hca@linux.ibm.com/
Link: https://github.com/ClangBuiltLinux/linux/issues/859
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240220-s390-work-around-llvm-objdump-t-j-v1-1-47bb0366a831@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
On s390, currently kernel uses the '-fPIE' compiler flag for compiling
vmlinux. This has a few problems:
- It uses dynamic symbols (.dynsym), for which the linker refuses to
allow more than 64k sections. This can break features which use
'-ffunction-sections' and '-fdata-sections', including kpatch-build
[1] and Function Granular KASLR.
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses.
Instead of using '-fPIE', resolve all the relocations at link time and
then manually adjust any absolute relocations (R_390_64) during boot.
This is done by first telling the linker to preserve all relocations
during the vmlinux link. (Note this is harmless: they are later
stripped in the vmlinux.bin link.)
Then use the 'relocs' tool to find all absolute relocations (R_390_64)
which apply to allocatable sections. The offsets of those relocations
are saved in a special section which is then used to adjust the
relocations during boot.
(Note: For some reason, Clang occasionally creates a GOT reference, even
without '-fPIE'. So Clang-compiled kernels have a GOT, which needs to
be adjusted.)
On my mostly-defconfig kernel, this reduces kernel text size by ~1.3%.
[1] https://github.com/dynup/kpatch/issues/1284
[2] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622872.html
[3] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/625986.html
Compiler consideration:
Gcc recently implemented an optimization [2] for loading symbols without
explicit alignment, aligning with the IBM Z ELF ABI. This ABI mandates
symbols to reside on a 2-byte boundary, enabling the use of the larl
instruction. However, kernel linker scripts may still generate unaligned
symbols. To address this, a new -munaligned-symbols option has been
introduced [3] in recent gcc versions. This option has to be used with
future gcc versions.
Older Clang lacks support for handling unaligned symbols generated
by kernel linker scripts when the kernel is built without -fPIE. However,
future versions of Clang will include support for the -munaligned-symbols
option. When the support is unavailable, compile the kernel with -fPIE
to maintain the existing behavior.
In addition to it:
move vmlinux.relocs to safe relocation
When the kernel is built with CONFIG_KERNEL_UNCOMPRESSED, the entire
uncompressed vmlinux.bin is positioned in the bzImage decompressor
image at the default kernel LMA of 0x100000, enabling it to be executed
in-place. However, the size of .vmlinux.relocs could be large enough to
cause an overlap with the uncompressed kernel at the address 0x100000.
To address this issue, .vmlinux.relocs is positioned after the
.rodata.compressed in the bzImage. Nevertheless, in this configuration,
vmlinux.relocs will overlap with the .bss section of vmlinux.bin. To
overcome that, move vmlinux.relocs to a safe location before clearing
.bss and handling relocs.
Compile warning fix from Sumanth Korikkar:
When kernel is built with CONFIG_LD_ORPHAN_WARN and -fno-PIE, there are
several warnings:
ld: warning: orphan section `.rela.iplt' from
`arch/s390/kernel/head64.o' being placed in section `.rela.dyn'
ld: warning: orphan section `.rela.head.text' from
`arch/s390/kernel/head64.o' being placed in section `.rela.dyn'
ld: warning: orphan section `.rela.init.text' from
`arch/s390/kernel/head64.o' being placed in section `.rela.dyn'
ld: warning: orphan section `.rela.rodata.cst8' from
`arch/s390/kernel/head64.o' being placed in section `.rela.dyn'
Orphan sections are sections that exist in an object file but don't have
a corresponding output section in the final executable. ld raises a
warning when it identifies such sections.
Eliminate the warning by placing all .rela orphan sections in .rela.dyn
and raise an error when size of .rela.dyn is greater than zero. i.e.
Dont just neglect orphan sections.
This is similar to adjustment performed in x86, where kernel is built
with -fno-PIE.
commit 5354e84598 ("x86/build: Add asserts for unwanted sections")
[sumanthk@linux.ibm.com: rebased Josh Poimboeuf patches and move
vmlinux.relocs to safe location]
[hca@linux.ibm.com: merged compile warning fix from Sumanth]
Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Link: https://lore.kernel.org/r/20240219132734.22881-4-sumanthk@linux.ibm.com
Link: https://lore.kernel.org/r/20240219132734.22881-5-sumanthk@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When attempting to boot a kernel compiled with OBJCOPY=llvm-objcopy,
there is a crash right at boot:
Out of memory allocating 6d7800 bytes 8 aligned in range 0:20000000
Reserved memory ranges:
0000000000000000 a394c3c30d90cdaf DECOMPRESSOR
Usable online memory ranges (info source: sclp read info [3]):
0000000000000000 0000000020000000
Usable online memory total: 20000000 Reserved: a394c3c30d90cdaf Free: 0
Call Trace:
(sp:0000000000033e90 [<0000000000012fbc>] physmem_alloc_top_down+0x5c/0x104)
sp:0000000000033f00 [<0000000000011d56>] startup_kernel+0x3a6/0x77c
sp:0000000000033f60 [<00000000000100f4>] startup_normal+0xd4/0xd4
GNU objcopy does not have any issues. Looking at differences between the
object files in each build reveals info.bin does not get properly
populated with llvm-objcopy, which results in an empty .vmlinux.info
section.
$ file {gnu,llvm}-objcopy/arch/s390/boot/info.bin
gnu-objcopy/arch/s390/boot/info.bin: data
llvm-objcopy/arch/s390/boot/info.bin: empty
$ llvm-readelf --section-headers {gnu,llvm}-objcopy/arch/s390/boot/vmlinux | rg 'File:|\.vmlinux\.info|\.decompressor\.syms'
File: gnu-objcopy/arch/s390/boot/vmlinux
[12] .vmlinux.info PROGBITS 0000000000034000 035000 000078 00 WA 0 0 1
[13] .decompressor.syms PROGBITS 0000000000034078 035078 000b00 00 WA 0 0 1
File: llvm-objcopy/arch/s390/boot/vmlinux
[12] .vmlinux.info PROGBITS 0000000000034000 035000 000000 00 WA 0 0 1
[13] .decompressor.syms PROGBITS 0000000000034000 035000 000b00 00 WA 0 0 1
Ulrich points out that llvm-objcopy only copies sections marked as alloc
with a binary output target, whereas the .vmlinux.info section is only
marked as load. Add 'alloc' in addition to 'load', so that both objcopy
implementations work properly:
$ file {gnu,llvm}-objcopy/arch/s390/boot/info.bin
gnu-objcopy/arch/s390/boot/info.bin: data
llvm-objcopy/arch/s390/boot/info.bin: data
$ llvm-readelf --section-headers {gnu,llvm}-objcopy/arch/s390/boot/vmlinux | rg 'File:|\.vmlinux\.info|\.decompressor\.syms'
File: gnu-objcopy/arch/s390/boot/vmlinux
[12] .vmlinux.info PROGBITS 0000000000034000 035000 000078 00 WA 0 0 1
[13] .decompressor.syms PROGBITS 0000000000034078 035078 000b00 00 WA 0 0 1
File: llvm-objcopy/arch/s390/boot/vmlinux
[12] .vmlinux.info PROGBITS 0000000000034000 035000 000078 00 WA 0 0 1
[13] .decompressor.syms PROGBITS 0000000000034078 035078 000b00 00 WA 0 0 1
Closes: https://github.com/ClangBuiltLinux/linux/issues/1996
Link: 3c02cb7492
Suggested-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240216-s390-fix-boot-with-llvm-objcopy-v1-1-0ac623daf42b@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there are several series of warnings
from the various discardable sections that the kernel adds for build
purposes that are not needed at runtime:
s390-linux-ld: warning: orphan section `.export_symbol' from `arch/s390/boot/decompressor.o' being placed in section `.export_symbol'
s390-linux-ld: warning: orphan section `.discard.addressable' from `arch/s390/boot/decompressor.o' being placed in section `.discard.addressable'
s390-linux-ld: warning: orphan section `.modinfo' from `arch/s390/boot/decompressor.o' being placed in section `.modinfo'
include/asm-generic/vmlinux.lds.h has a macro for easily discarding
these sections across the kernel named COMMON_DISCARDS, use it to clear
up the warnings.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-9-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there is a warning around the '.comment'
section for each file in arch/s390/boot
s390-linux-ld: warning: orphan section `.comment' from `arch/s390/boot/als.o' being placed in section `.comment'
s390-linux-ld: warning: orphan section `.comment' from `arch/s390/boot/startup.o' being placed in section `.comment'
s390-linux-ld: warning: orphan section `.comment' from `arch/s390/boot/physmem_info.o' being placed in section `.comment'
include/asm-generic/vmlinux.lds.h has a macro for required ELF sections
not related to debugging named ELF_DETAILS, use it to clear up the
warnings.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-8-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there are several series of warnings for
each file in arch/s390/boot due to the boot linker script not handling
the DWARF debug sections:
s390-linux-ld: warning: orphan section `.debug_line' from `arch/s390/boot/head.o' being placed in section `.debug_line'
s390-linux-ld: warning: orphan section `.debug_info' from `arch/s390/boot/head.o' being placed in section `.debug_info'
s390-linux-ld: warning: orphan section `.debug_abbrev' from `arch/s390/boot/head.o' being placed in section `.debug_abbrev'
s390-linux-ld: warning: orphan section `.debug_aranges' from `arch/s390/boot/head.o' being placed in section `.debug_aranges'
s390-linux-ld: warning: orphan section `.debug_str' from `arch/s390/boot/head.o' being placed in section `.debug_str'
include/asm-generic/vmlinux.lds.h has a macro for DWARF debug sections
named DWARF_DEBUG, use it to clear up the warnings.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-7-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there are several warnings from
arch/s390/boot/head.o due to the unhandled presence of '.rela' sections:
s390-linux-ld: warning: orphan section `.rela.iplt' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.head.text' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.got' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.data' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.data.rel.ro' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.iplt' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.head.text' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.got' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.data' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
s390-linux-ld: warning: orphan section `.rela.data.rel.ro' from `arch/s390/boot/head.o' being placed in section `.rela.dyn'
These sections are unneeded for the decompressor and they are not
emitted in the binary currently. In a manner similar to other
architectures, coalesce the sections into '.rela.dyn' and ensure it is
zero sized, which is a safe/tested approach versus full discard.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-6-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there is a warning about the presence of
an '.init.text' section in arch/s390/boot:
s390-linux-ld: warning: orphan section `.init.text' from `arch/s390/boot/sclp_early_core.o' being placed in section `.init.text'
arch/s390/boot/sclp_early_core.c includes a file from the main kernel
build, which picks up a usage of '__init' somewhere. For the
decompressed image, this section can just be coalesced into '.text'.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-5-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When building with CONFIG_LD_ORPHAN_WARN after selecting
CONFIG_ARCH_HAS_LD_ORPHAN_WARN, there are a lot of warnings around the
GOT and PLT sections:
s390-linux-ld: warning: orphan section `.plt' from `arch/s390/kernel/head64.o' being placed in section `.plt'
s390-linux-ld: warning: orphan section `.got' from `arch/s390/kernel/head64.o' being placed in section `.got'
s390-linux-ld: warning: orphan section `.got.plt' from `arch/s390/kernel/head64.o' being placed in section `.got.plt'
s390-linux-ld: warning: orphan section `.iplt' from `arch/s390/kernel/head64.o' being placed in section `.iplt'
s390-linux-ld: warning: orphan section `.igot.plt' from `arch/s390/kernel/head64.o' being placed in section `.igot.plt'
s390-linux-ld: warning: orphan section `.iplt' from `arch/s390/boot/head.o' being placed in section `.iplt'
s390-linux-ld: warning: orphan section `.igot.plt' from `arch/s390/boot/head.o' being placed in section `.igot.plt'
s390-linux-ld: warning: orphan section `.got' from `arch/s390/boot/head.o' being placed in section `.got'
Currently, only the '.got' section is actually emitted in the final
binary. In a manner similar to other architectures, put the '.got'
section near the '.data' section and coalesce the PLT sections,
checking that the final section is zero sized, which is a safe/tested
approach versus full discard.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-3-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
arch/s390/boot/vmlinux uses a different linker script and build rules
than the main vmlinux, so the '--orphan-handling' flag is not applied to
it. Add support for '--orphan-handling' so that all sections are
properly described in the linker script, which helps eliminate bugs
between linker implementations having different orphan section
heuristics.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240207-s390-lld-and-orphan-warn-v1-1-8a665b3346ab@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The size of vmalloc area depends from various factors
on boot and could be set to:
1. Default size as determined by VMALLOC_DEFAULT_SIZE macro;
2. One half of the virtual address space not occupied by
modules and fixed mappings;
3. The size provided by user with vmalloc= kernel command
line parameter;
In cases [1] and [2] the vmalloc area base address is aligned
on Region3 table type boundary, while in case [3] in might get
aligned on page boundary.
Limit the waste of page tables and always align vmalloc area
size and base address on segment boundary.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework the way physical pages are set no-dat / dat:
The old way is:
- Rely on that all pages are initially marked "dat"
- Allocate page tables for the kernel mapping
- Enable dat
- Walk the whole kernel mapping and set PG_arch_1 bit in all struct pages
that belong to pages of kernel page tables
- Walk all struct pages and test and clear the PG_arch_1 bit. If the bit is
not set, set the page state to no-dat
- For all subsequent page table allocations, set the page state to dat
(remove the no-dat state) on allocation time
Change this rather complex logic to a simpler approach:
- Set the whole physical memory (all pages) to "no-dat"
- Explicitly set those page table pages to "dat" which are part of the
kernel image (e.g. swapper_pg_dir)
- For all subsequent page table allocations, set the page state to dat
(remove the no-dat state) on allocation time
In result the code is simpler, and this also allows to get rid of one
odd usage of the PG_arch_1 bit.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The "cmma=" kernel command line parameter needs to be parsed early for
upcoming changes. Therefore move the parsing code.
Note that EX_TABLE handling of cmma_test_essa() needs to be open-coded,
since the early boot code doesn't have infrastructure for handling expected
exceptions.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Get rid of private VM_FAULT flags
- Add word-at-a-time implementation
- Add DCACHE_WORD_ACCESS support
- Cleanup control register handling
- Disallow CPU hotplug of CPU 0 to simplify its handling complexity,
following a similar restriction in x86
- Optimize pai crypto map allocation
- Update the list of crypto express EP11 coprocessor operation modes
- Fixes and improvements for secure guests AP pass-through
- Several fixes to address incorrect page marking for address translation
with the "cmma no-dat" feature, preventing potential incorrect guest
TLB flushes
- Fix early IPI handling
- Several virtual vs physical address confusion fixes
- Various small fixes and improvements all over the code
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmVFLkYACgkQjYWKoQLX
FBgxRwf9FSNFwLcbYbG1x94rUUHnbaiyJWCezp3/ypr+m+qDvQatLYc75SxwrH0y
ocSygqvtVryVkWAKKvOHF1Kg5R2Fedmzf5wuVTXglfPqE1ZgMGdwS/LtknIoz556
twZJIpFzUFt5xaljpTCZJanLMvy/npl0bilezhNGl6v7N5rsWLbfK6vsPMDm+TTZ
yscapOsk8Z16NjXq0FETS5JHG65jjj9rkRfb0qD8SOFhti0fR9MSP2xeRXrDMDZE
IWXog5usx2DS6VX2HnxA8O7z1hhuTccJ1K1+rYqbb0Fwccqi7QaGZXEvocYEvlvy
lVe3/jbyn27hUoypHcfVCAVxdoOrnw==
=SMOp
-----END PGP SIGNATURE-----
Merge tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Get rid of private VM_FAULT flags
- Add word-at-a-time implementation
- Add DCACHE_WORD_ACCESS support
- Cleanup control register handling
- Disallow CPU hotplug of CPU 0 to simplify its handling complexity,
following a similar restriction in x86
- Optimize pai crypto map allocation
- Update the list of crypto express EP11 coprocessor operation modes
- Fixes and improvements for secure guests AP pass-through
- Several fixes to address incorrect page marking for address
translation with the "cmma no-dat" feature, preventing potential
incorrect guest TLB flushes
- Fix early IPI handling
- Several virtual vs physical address confusion fixes
- Various small fixes and improvements all over the code
* tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (74 commits)
s390/cio: replace deprecated strncpy with strscpy
s390/sclp: replace deprecated strncpy with strtomem
s390/cio: fix virtual vs physical address confusion
s390/cio: export CMG value as decimal
s390: delete the unused store_prefix() function
s390/cmma: fix handling of swapper_pg_dir and invalid_pg_dir
s390/cmma: fix detection of DAT pages
s390/sclp: handle default case in sclp memory notifier
s390/pai_crypto: remove per-cpu variable assignement in event initialization
s390/pai: initialize event count once at initialization
s390/pai_crypto: use PERF_ATTACH_TASK define for per task detection
s390/mm: add missing arch_set_page_dat() call to gmap allocations
s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc()
s390/cmma: fix initial kernel address space page table walk
s390/diag: add missing virt_to_phys() translation to diag224()
s390/mm,fault: move VM_FAULT_ERROR handling to do_exception()
s390/mm,fault: remove VM_FAULT_BADMAP and VM_FAULT_BADACCESS
s390/mm,fault: remove VM_FAULT_SIGNAL
s390/mm,fault: remove VM_FAULT_BADCONTEXT
s390/mm,fault: simplify kfence fault handling
...
When physical memory is defined under z/VM using DEF STOR CONFIG, there
may be memory holes that are not hotpluggable memory. In such cases,
DCSS mapping could be placed in one of these memory holes. Subsequently,
attempting memory access to such DCSS mapping would result in a kasan
failure because there is no shadow memory mapping for it.
To maintain consistency with cases where DCSS mapping is positioned after
the kernel identity mapping, which is then covered by kasan zero shadow
mapping, handle the scenario above by populating zero shadow mapping
for memory holes where DCSS mapping could potentially be placed.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use control register bit defines instead of plain numbers where
possible.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Convert all single control register usages of __local_ctl_load() and
__local_ctl_store() to local_ctl_load() and local_ctl_store().
This also requires to change the type of some struct lowcore members
from __u64 to unsigned long.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add local and system prefix to some functions to clarify they change
control register contents on either the local CPU or the on all CPUs.
This results in the following API:
Two defines which load and save multiple control registers.
The defines correlate with the following C prototypes:
void __local_ctl_load(unsigned long *, unsigned int cr_low, unsigned int cr_high);
void __local_ctl_store(unsigned long *, unsigned int cr_low, unsigned int cr_high);
Two functions which locally set or clear one bit for a specified
control register:
void local_ctl_set_bit(unsigned int cr, unsigned int bit);
void local_ctl_clear_bit(unsigned int cr, unsigned int bit);
Two functions which set or clear one bit for a specified control
register on all CPUs:
void system_ctl_set_bit(unsigned int cr, unsigned int bit);
void system_ctl_clear_bit(unsigend int cr, unsigned int bit);
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename ctl_reg.h to ctlreg.h so it matches not only ctlreg.c but also
other control register related function, union, and structure names,
which all come with a ctlreg prefix.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Control register handling has nothing to do with low level SMP code.
Move it to a separate file.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The kernel mapping is setup in two stages: in the decompressor map all
pages with RWX permissions, and within the kernel change all mappings to
their final permissions, where most of the mappings are changed from RWX to
RWNX.
Change this and map all pages RWNX from the beginning, however without
enabling noexec via control register modification. This means that
effectively all pages are used with RWX permissions like before. When the
final permissions have been applied to the kernel mapping enable noexec via
control register modification.
This allows to remove quite a bit of non-obvious code.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Do the same like x86 with commit 76ea0025a2 ("x86/cpu: Remove "noexec"")
and remove the "noexec" kernel command line option.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make multi-line comment style consistent across the source.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Real Memory Copy and (absolute) Lowcore areas are
not accounted when virtual memory layout is set up.
Fixes: 4df29d2b90 ("s390/smp: rework absolute lowcore access")
Fixes: 2f0e8aae26 ("s390/mm: rework memcpy_real() to avoid DAT-off mode")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make Real Memory Copy area size and mask explicit.
This does not bring any functional change and only
needed for clarity.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The separate vmalloc area size check against _REGION2_SIZE
is needed in case user provided insanely large value using
vmalloc= kernel command line parameter. That could lead to
overflow and selecting 3 page table levels instead of 4.
Use size_add() for the overflow check and get rid of the
extra vmalloc area check.
With the current values of CONFIG_MAX_PHYSMEM_BITS and
PAGES_PER_SECTION the sum of maximal possible size of
identity mapping and vmemmap area (derived from these
macros) plus modules area size MODULES_LEN can not
overflow. Thus, that sum is used as first addend while
vmalloc area size is second addend for size_add().
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There are no users of VMEM_MAX_PHYS macro left, remove it.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As per description in mm/memory_hotplug.c platforms should define
arch_get_mappable_range() that provides maximum possible addressable
physical memory range for which the linear mapping could be created.
The current implementation uses VMEM_MAX_PHYS macro as the maximum
mappable physical address and it is simply a cast to vmemmap. Since
the address is in physical address space the natural upper limit of
MAX_PHYSMEM_BITS is honoured:
vmemmap_start = min(vmemmap_start, 1UL << MAX_PHYSMEM_BITS);
Further, to make sure the identity mapping would not overlay with
vmemmap, the size of identity mapping could be stripped like this:
ident_map_size = min(ident_map_size, vmemmap_start);
Similarily, any other memory that could be added (e.g DCSS segment)
should not overlay with vmemmap as well and that is prevented by
using vmemmap (VMEM_MAX_PHYS macro) as the upper limit.
However, while the use of VMEM_MAX_PHYS brings the desired result
it actually poses two issues:
1. As described, vmemmap is handled as a physical address, although
it is actually a pointer to struct page in virtual address space.
2. As vmemmap is a virtual address it could have been located
anywhere in the virtual address space. However, the desired
necessity to honour MAX_PHYSMEM_BITS limit prevents that.
Rework arch_get_mappable_range() callback in a way it does not
use VMEM_MAX_PHYS macro and does not confuse the notion of virtual
vs physical address spacees as result. That paves the way for moving
vmemmap elsewhere and optimizing the virtual address space layout.
Introduce max_mappable preserved boot variable and let function
setup_kernel_memory_layout() set it up. As result, the rest of the
code is does not need to know the virtual memory layout specifics.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>