Commit Graph

143 Commits

Author SHA1 Message Date
Heiko Carstens
65c9a9f925 s390: Explicitly include <linux/export.h>
Explicitly include <linux/export.h> in files which contain an
EXPORT_SYMBOL().

See commit a934a57a42 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") for more details.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2025-06-17 18:18:02 +02:00
Claudio Imbrenda
200197908d KVM: s390: Refactor and split some gmap helpers
Refactor some gmap functions; move the implementation into a separate
file with only helper functions. The new helper functions work on vm
addresses, leaving all gmap logic in the gmap functions, which mostly
become just wrappers.

The whole gmap handling is going to be moved inside KVM soon, but the
helper functions need to touch core mm functions, and thus need to
stay in the core of kernel.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-4-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250528095502.226213-4-imbrenda@linux.ibm.com>
2025-05-28 17:48:04 +02:00
Claudio Imbrenda
af941f3dd8 s390: Remove unneeded includes
Many files don't need to include asm/tlb.h or asm/gmap.h.
On the other hand, asm/tlb.h does need to include asm/gmap.h.

Remove all unneeded includes so that asm/tlb.h is not directly used by
s390 arch code anymore. Remove asm/gmap.h from a few other files as
well, so that now only KVM code, mm/gmap.c, and asm/tlb.h include it.

Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20250528095502.226213-2-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250528095502.226213-2-imbrenda@linux.ibm.com>
2025-05-28 17:48:04 +02:00
Linus Torvalds
f90f2145b2 s390 updates for 6.15 merge window
- Add sorting of mcount locations at build time
 
 - Rework uaccess functions with C exception handling to shorten inline
   assembly size and enable full inlining. This yields near-optimal code
   for small constant copies with a ~40kb kernel size increase
 
 - Add support for a configurable STRICT_MM_TYPECHECKS which allows to
   generate better code, but also allows to have type checking for
   debug builds
 
 - Optimize get_lowcore() for common callers with alternatives that
   nearly revert to the pre-relocated lowcore code, while also slightly
   reducing syscall entry and exit time
 
 - Convert MACHINE_HAS_* checks for single facility tests into cpu_has_*
   style macros that call test_facility(), and for features with additional
   conditions, add a new ALT_TYPE_FEATURE alternative to provide a static
   branch via alternative patching. Also, move machine feature detection
   to the decompressor for early patching and add debugging functionality
   to easily show which alternatives are patched
 
 - Add exception table support to early boot / startup code to get rid
   of the open coded exception handling
 
 - Use asm_inline for all inline assemblies with EX_TABLE or ALTERNATIVE
   to ensure correct inlining and unrolling decisions
 
 - Remove 2k page table leftovers now that s390 has been switched to
   always allocate 4k page tables
 
 - Split kfence pool into 4k mappings in arch_kfence_init_pool() and
   remove the architecture-specific kfence_split_mapping()
 
 - Use READ_ONCE_NOCHECK() in regs_get_kernel_stack_nth() to silence
   spurious KASAN warnings from opportunistic ftrace argument tracing
 
 - Force __atomic_add_const() variants on s390 to always return void,
   ensuring compile errors for improper usage
 
 - Remove s390's ioremap_wt() and pgprot_writethrough() due to mismatched
   semantics and lack of known users, relying on asm-generic fallbacks
 
 - Signal eventfd in vfio-ap to notify userspace when the guest AP
   configuration changes, including during mdev removal
 
 - Convert mdev_types from an array to a pointer in vfio-ccw and vfio-ap
   drivers to avoid fake flex array confusion
 
 - Cleanup trap code
 
 - Remove references to the outdated linux390@de.ibm.com address
 
 - Other various small fixes and improvements all over the code
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmfmuPwACgkQjYWKoQLX
 FBgTDAgAjKmZ5OYjACRfYepTvKk9SDqa2CBlQZ+BhbAXEVIrxKnv8OkImAXoWNsM
 mFxiCxAHWdcD+nqTrxFsXhkNLsndijlwnj/IqZgvy6R/3yNtBlAYRPLujOmVrsQB
 dWB8Dl38p63Ip1JfAqyabiAOUjfhrclRcM5FX5tgciXA6N/vhY3OM6k0+k7wN4Nj
 Dei/rCrnYRXTrFQgtM4w8JTIrwdnXjeKvaTYCflh4Q5ISJ7TceSF7cqq8HOs5hhK
 o2ciaoTdx212522CIsxeN3Ls3jrn8bCOCoOeSCysc5RL84grAuFnmjSajo1LFide
 S/TQtHXYy78Wuei9xvHi561ogiv/ww==
 =Kxgc
 -----END PGP SIGNATURE-----

Merge tag 's390-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Add sorting of mcount locations at build time

 - Rework uaccess functions with C exception handling to shorten inline
   assembly size and enable full inlining. This yields near-optimal code
   for small constant copies with a ~40kb kernel size increase

 - Add support for a configurable STRICT_MM_TYPECHECKS which allows to
   generate better code, but also allows to have type checking for debug
   builds

 - Optimize get_lowcore() for common callers with alternatives that
   nearly revert to the pre-relocated lowcore code, while also slightly
   reducing syscall entry and exit time

 - Convert MACHINE_HAS_* checks for single facility tests into cpu_has_*
   style macros that call test_facility(), and for features with
   additional conditions, add a new ALT_TYPE_FEATURE alternative to
   provide a static branch via alternative patching. Also, move machine
   feature detection to the decompressor for early patching and add
   debugging functionality to easily show which alternatives are patched

 - Add exception table support to early boot / startup code to get rid
   of the open coded exception handling

 - Use asm_inline for all inline assemblies with EX_TABLE or ALTERNATIVE
   to ensure correct inlining and unrolling decisions

 - Remove 2k page table leftovers now that s390 has been switched to
   always allocate 4k page tables

 - Split kfence pool into 4k mappings in arch_kfence_init_pool() and
   remove the architecture-specific kfence_split_mapping()

 - Use READ_ONCE_NOCHECK() in regs_get_kernel_stack_nth() to silence
   spurious KASAN warnings from opportunistic ftrace argument tracing

 - Force __atomic_add_const() variants on s390 to always return void,
   ensuring compile errors for improper usage

 - Remove s390's ioremap_wt() and pgprot_writethrough() due to
   mismatched semantics and lack of known users, relying on asm-generic
   fallbacks

 - Signal eventfd in vfio-ap to notify userspace when the guest AP
   configuration changes, including during mdev removal

 - Convert mdev_types from an array to a pointer in vfio-ccw and vfio-ap
   drivers to avoid fake flex array confusion

 - Cleanup trap code

 - Remove references to the outdated linux390@de.ibm.com address

 - Other various small fixes and improvements all over the code

* tag 's390-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (78 commits)
  s390: Use inline qualifier for all EX_TABLE and ALTERNATIVE inline assemblies
  s390/kfence: Split kfence pool into 4k mappings in arch_kfence_init_pool()
  s390/ptrace: Avoid KASAN false positives in regs_get_kernel_stack_nth()
  s390/boot: Ignore vmlinux.map
  s390/sysctl: Remove "vm/allocate_pgste" sysctl
  s390: Remove 2k vs 4k page table leftovers
  s390/tlb: Use mm_has_pgste() instead of mm_alloc_pgste()
  s390/lowcore: Use lghi instead llilh to clear register
  s390/syscall: Merge __do_syscall() and do_syscall()
  s390/spinlock: Implement SPINLOCK_LOCKVAL with inline assembly
  s390/smp: Implement raw_smp_processor_id() with inline assembly
  s390/current: Implement current with inline assembly
  s390/lowcore: Use inline qualifier for get_lowcore() inline assembly
  s390: Move s390 sysctls into their own file under arch/s390
  s390/syscall: Simplify syscall_get_arguments()
  s390/vfio-ap: Notify userspace that guest's AP config changed when mdev removed
  s390: Remove ioremap_wt() and pgprot_writethrough()
  s390/mm: Add configurable STRICT_MM_TYPECHECKS
  s390/mm: Convert pgste_val() into function
  s390/mm: Convert pgprot_val() into function
  ...
2025-03-29 11:59:43 -07:00
Heiko Carstens
174cb82a57 s390: Remove 2k vs 4k page table leftovers
Since commit d08d4e7cd6 ("s390/mm: use full 4KB page for 2KB PTE") always
4k page tables are allocated, however there is still some (now) obsolete
code left which deals with switching from 2k to 4k page tables for qemu/kvm
processes.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Remove the not needed code.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-18 17:13:05 +01:00
Claudio Imbrenda
d8dfda5af0 KVM: s390: pv: fix race when making a page secure
Holding the pte lock for the page that is being converted to secure is
needed to avoid races. A previous commit removed the locking, which
caused issues. Fix by locking the pte again.

Fixes: 5cbe24350b ("KVM: s390: move pv gmap functions into kvm")
Reported-by: David Hildenbrand <david@redhat.com>
Tested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[david@redhat.com: replace use of get_locked_pte() with folio_walk_start()]
Link: https://lore.kernel.org/r/20250312184912.269414-2-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250312184912.269414-2-imbrenda@linux.ibm.com>
2025-03-14 15:24:19 +01:00
Heiko Carstens
17d3804808 s390/tlb: Convert MACHINE_HAS_TLB_GUEST to machine_has_tlb_guest()
Use static branch(es) to implement and use machine_has_tlb_guest()
instead of a runtime check via MACHINE_HAS_TLB_GUEST.

Also add sclp_early_detect_machine_features() in order to allow for
feature detection from the decompressor.

Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-04 17:18:06 +01:00
Heiko Carstens
949b73c990 s390/cpufeature: Convert MACHINE_HAS_IDTE to cpu_has_idte()
Convert MACHINE_HAS_... to cpu_has_...() which uses test_facility() instead
of testing the machine_flags lowcore member if the feature is present.

test_facility() generates better code since it results in a static branch
without accessing memory. The branch is patched via alternatives by the
decompressor depending on the availability of the required facility.

Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2025-03-04 17:18:05 +01:00
Claudio Imbrenda
84b7387692 KVM: s390: remove the last user of page->index
Shadow page tables use page->index to keep the g2 address of the guest
page table being shadowed.

Instead of keeping the information in page->index, split the address
and smear it over the 16-bit softbits areas of 4 PGSTEs.

This removes the last s390 user of page->index.

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-16-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-16-imbrenda@linux.ibm.com>
2025-01-31 12:03:53 +01:00
Claudio Imbrenda
c27e002626 KVM: s390: remove useless page->index usage
The page->index field for VSIE dat tables is only used for segment
tables.

Stop setting the field for all region tables.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-14-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-14-imbrenda@linux.ibm.com>
2025-01-31 12:03:53 +01:00
Claudio Imbrenda
43656f774a KVM: s390: move gmap_shadow_pgt_lookup() into kvm
Move gmap_shadow_pgt_lookup() from mm/gmap.c into kvm/gaccess.c .

Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-13-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-13-imbrenda@linux.ibm.com>
2025-01-31 12:03:53 +01:00
Claudio Imbrenda
ef0c8ef848 KVM: s390: stop using lists to keep track of used dat tables
Until now, every dat table allocated to map a guest was put in a
linked list. The page->lru field of struct page was used to keep track
of which pages were being used, and when the gmap is torn down, the
list was walked and all pages freed.

This patch gets rid of the usage of page->lru. Page tables are now
freed by recursively walking the dat table tree.

Since s390_unlist_old_asce() becomes useless now, remove it.

Acked-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-12-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-12-imbrenda@linux.ibm.com>
2025-01-31 12:03:53 +01:00
Claudio Imbrenda
37d1b5d8d5 KVM: s390: stop using page->index for non-shadow gmaps
The host_to_guest radix tree will now map userspace addresses to guest
addresses, instead of userspace addresses to segment tables.

When segment tables and page tables are needed, they are found using an
additional gmap_table_walk().

This gets rid of all usage of page->index for non-shadow gmaps.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-11-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-11-imbrenda@linux.ibm.com>
2025-01-31 12:03:53 +01:00
Claudio Imbrenda
c9f721ed8e KVM: s390: move some gmap shadowing functions away from mm/gmap.c
Move some gmap shadowing functions from mm/gmap.c to kvm/kvm-s390.c and
the newly created kvm/gmap-vsie.c

This is a step toward removing gmap from mm.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-10-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-10-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Claudio Imbrenda
d41993f713 KVM: s390: get rid of gmap_translate()
Add gpa_to_hva(), which uses memslots, and use it to replace all uses
of gmap_translate().

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-9-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-9-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Claudio Imbrenda
6eb84e1300 KVM: s390: get rid of gmap_fault()
All gmap page faults are already handled in kvm by the function
kvm_s390_handle_dat_fault(); only few users of gmap_fault remained, all
within kvm.

Convert those calls to use kvm_s390_handle_dat_fault() instead.

Remove gmap_fault() entirely since it has no more users.

Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-8-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-8-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Claudio Imbrenda
3762e905ec KVM: s390: use __kvm_faultin_pfn()
Refactor the existing page fault handling code to use __kvm_faultin_pfn().

This possible now that memslots are always present.

Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-7-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-7-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Claudio Imbrenda
5cbe24350b KVM: s390: move pv gmap functions into kvm
Move gmap related functions from kernel/uv into kvm.

Create a new file to collect gmap-related functions.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
[fixed unpack_one(), thanks mhartmay@linux.ibm.com]
Link: https://lore.kernel.org/r/20250123144627.312456-6-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-6-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Linus Torvalds
509f806f7f more s390 updates for 6.13 merge window
- Add swap entry for hugetlbfs support
 
 - Add PTE_MARKER support for hugetlbs mappings; this fixes a regression
   (possible page fault loop) which was introduced when support for
   UFFDIO_POISON for hugetlbfs was added
 
 - Add ARCH_HAS_PREEMPT_LAZY and PREEMPT_DYNAMIC support
 
 - Mark IRQ entries in entry code, so that stack tracers can filter out the
   non-IRQ parts of stack traces. This fixes stack depot capacity limit
   warnings, since without filtering the number of unique stack traces is
   huge
 
 - In PCI code fix leak of struct zpci_dev object, and fix potential double
   remove of hotplug slot
 
 - Fix pagefault_disable() / pagefault_enable() unbalance in
   arch_stack_user_walk_common()
 
 - A couple of inline assembly optimizations, more cmpxchg() to
   try_cmpxchg() conversions, and removal of usages of xchg() and cmpxchg()
   on one and two byte memory areas
 
 - Various other small improvements and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmdJ3WoACgkQIg7DeRsp
 bsJt1RAAtlkbeN4+eVeYM4vBwHvgfAY/5Ii2wdHO2qwPHqBVkRtsqrmyewE/tVCF
 PZsYBXDDrzyAtLMqjlNGDQ1QexNLn4BELgSIysr45mxwMq1W33BiXvb8I5uK/V/7
 /TcW2s1daJKKrbk+HBA8ZTwna5SeUSoZuh9y/n9SKVC4rRkWdeL7G1RRNQtafDlg
 aELCo17iHDZNoHeoRStOimZqVBwko6IQqQH4DCx2S4+J6nKQBGRyzGWIkLRoUxr6
 MgNLrxekWjkoqAnXM0Ztb7LYg6AS/iOuGbqg/xLi1VJSWNCIf9zLpDs++SdFHoTU
 n4Cj07IHR4OLQ1YB+EX2uPY7rJw0tPt0g/dgmYYi3uP88hJ7VYFOtfJx/UGlid2q
 3l7wXNwtg+CJtw0Ey+21cMdmnOffxH9c3nBPahe7zK5k1GKjXDOfWEcmucG0zW5K
 qYI5m7vAZAX4ve1362DOgJei/1uxGuMQQZsobHpwfhcGXzLZ2AZY45Ls86nQzHua
 KpupybWQe70hQYk9hUw+M/ShChuH8dhnPjx51T0r/0E0BdU6Q20xLPLWx/2jRzUb
 FlFg7WtVw2y45eQCFPbtVsoVzDCpfpfgTw5rrDsjFf/twS0E3ubmTC1rLr4YB+5m
 5cjPys/SYpQWUi3wQFTQ6dL3w0+vWXlQmTi5ChcxTZF2ytwP+yg=
 =cfmM
 -----END PGP SIGNATURE-----

Merge tag 's390-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Heiko Carstens:

 - Add swap entry for hugetlbfs support

 - Add PTE_MARKER support for hugetlbs mappings; this fixes a regression
   (possible page fault loop) which was introduced when support for
   UFFDIO_POISON for hugetlbfs was added

 - Add ARCH_HAS_PREEMPT_LAZY and PREEMPT_DYNAMIC support

 - Mark IRQ entries in entry code, so that stack tracers can filter out
   the non-IRQ parts of stack traces. This fixes stack depot capacity
   limit warnings, since without filtering the number of unique stack
   traces is huge

 - In PCI code fix leak of struct zpci_dev object, and fix potential
   double remove of hotplug slot

 - Fix pagefault_disable() / pagefault_enable() unbalance in
   arch_stack_user_walk_common()

 - A couple of inline assembly optimizations, more cmpxchg() to
   try_cmpxchg() conversions, and removal of usages of xchg() and
   cmpxchg() on one and two byte memory areas

 - Various other small improvements and cleanups

* tag 's390-6.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (27 commits)
  Revert "s390/mm: Allow large pages for KASAN shadow mapping"
  s390/spinlock: Use flag output constraint for arch_cmpxchg_niai8()
  s390/spinlock: Use R constraint for arch_load_niai4()
  s390/spinlock: Generate shorter code for arch_spin_unlock()
  s390/spinlock: Remove condition code clobber from arch_spin_unlock()
  s390/spinlock: Use symbolic names in inline assemblies
  s390: Support PREEMPT_DYNAMIC
  s390/pci: Fix potential double remove of hotplug slot
  s390/pci: Fix leak of struct zpci_dev when zpci_add_device() fails
  s390/mm/hugetlbfs: Add missing includes
  s390/mm: Add PTE_MARKER support for hugetlbfs mappings
  s390/mm: Introduce region-third and segment table swap entries
  s390/mm: Introduce region-third and segment table entry present bits
  s390/mm: Rearrange region-third and segment table entry SW bits
  KVM: s390: Increase size of union sca_utility to four bytes
  KVM: s390: Remove one byte cmpxchg() usage
  KVM: s390: Use try_cmpxchg() instead of cmpxchg() loops
  s390/ap: Replace xchg() with WRITE_ONCE()
  s390/mm: Allow large pages for KASAN shadow mapping
  s390: Add ARCH_HAS_PREEMPT_LAZY support
  ...
2024-11-29 10:40:52 -08:00
Gerald Schaefer
03e6db16b8 s390/mm: Introduce region-third and segment table entry present bits
Introduce region-third and segment table entry present SW bits, and adjust
pmd/pud_present() accordingly.

Also add pmd/pud_present() checks to pmd/pud_leaf(), to return false for
future swap entries. Same logic applies to pmd_trans_huge(), make that
return pmd_leaf() instead of duplicating the same check.

huge_pte_offset() also needs to be adjusted, current code would return
NULL for !pud_present(). Use the same logic as in the generic version,
which allows for !pud_present() swap entries.

Similar to PTE, bit 63 can be used for the new SW present bit in region
and segment table entries. For segment-table entries (PMD) the architecture
says that "Bits 62-63 are available for programming", so they are safe to
use. The same is true for large leaf region-third-table entries (PUD).

However, for non-leaf region-third-table entries, bits 62-63 indicate the
TABLE LENGTH and both must be set to 1. But such entries would always be
considered as present, so it is safe to use bit 63 as PRESENT bit for PUD.
They also should not conflict with bit 62 potentially later used for
preserving SOFT_DIRTY in swap entries, because they are not swap entries.

Valid PMDs / PUDs should always have the present bit set, so add it to
the various pgprot defines, and also _SEGMENT_ENTRY which is OR'ed e.g.
in pmd_populate(). _REGION3_ENTRY wouldn't need any change, as the present
bit is already included in the TABLE LENGTH, but also explicitly add it
there, for completeness, and just in case the bit would ever be changed.

gmap code needs some adjustment, to also OR the _SEGMENT_ENTRY, like it
is already done gmap_shadow_pgt() when creating new PMDs, but not in
__gmap_link(). Otherwise, the gmap PMDs would not be considered present,
e.g. when using pmd_leaf() checks in gmap code. The various WARN_ON
checks in gmap code also need adjustment, to tolerate the new present
bit.

This is a prerequisite for hugetlbfs PTE_MARKER support on s390, which
is needed to fix a regression introduced with commit 8a13897fb0
("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs"). That commit
depends on the availability of swap entries for hugetlbfs, which were
not available for s390 so far.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-11-27 12:55:21 +01:00
Linus Torvalds
0352387523 First step of consolidating the VDSO data page handling:
The VDSO data page handling is architecture specific for historical
   reasons, but there is no real technical reason to do so.
 
   Aside of that VDSO data has become a dump ground for various mechanisms
   and fail to provide a clear separation of the functionalities.
 
   Clean this up by:
 
     * consolidating the VDSO page data by getting rid of architecture
       specific warts especially in x86 and PowerPC.
 
     * removing the last includes of header files which are pulling in other
       headers outside of the VDSO namespace.
 
     * seperating timekeeping and other VDSO data accordingly.
 
   Further consolidation of the VDSO page handling is done in subsequent
   changes scheduled for the next merge window.
 
   This also lays the ground for expanding the VDSO time getters for
   independent PTP clocks in a generic way without making every architecture
   add support seperately.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kyoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVBjD/9awdN2YeCGIM9rlHIktUdNRmRSL2SL
 6av1CPffN5DenONYTXWrDYPkC4yfjUwIs8H57uzFo10yA7RQ/Qfq+O68k5GnuFew
 jvpmmYSZ6TT21AmAaCIhn+kdl9YbEJFvN2AWH85Bl29k9FGB04VzJlQMMjfEZ1a5
 Mhwv+cfYNuPSZmU570jcxW2XgbyTWlLZBByXX/Tuz9bwpmtszba507bvo45x6gIP
 twaWNzrsyJpdXfMrfUnRiChN8jHlDN7I6fgQvpsoRH5FOiVwIFo0Ip2rKbk+ONfD
 W/rcU5oeqRIxRVDHzf2Sv8WPHMCLRv01ZHBcbJOtgvZC3YiKgKYoeEKabu9ZL1BH
 6VmrxjYOBBFQHOYAKPqBuS7BgH5PmtMbDdSZXDfRaAKaCzhCRysdlWW7z48r2R//
 zPufb7J6Tle23AkuZWhFjvlGgSBl4zxnTFn31HYOyQps3TMI4y50Z2DhE/EeU8a6
 DRl8/k1KQVDUZ6udJogS5kOr1J8pFtUPrA2uhR8UyLdx7YKiCzcdO1qWAjtXlVe8
 oNpzinU+H9bQqGe9IyS7kCG9xNaCRZNkln5Q1WfnkTzg5f6ihfaCvIku3l4bgVpw
 3HmcxYiC6RxQB+ozwN7hzCCKT4L9aMhr/457TNOqRkj2Elw3nvJ02L4aI86XAKLE
 jwO9Fkp9qcCxCw==
 =q5eD
 -----END PGP SIGNATURE-----

Merge tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull vdso data page handling updates from Thomas Gleixner:
 "First steps of consolidating the VDSO data page handling.

  The VDSO data page handling is architecture specific for historical
  reasons, but there is no real technical reason to do so.

  Aside of that VDSO data has become a dump ground for various
  mechanisms and fail to provide a clear separation of the
  functionalities.

  Clean this up by:

   - consolidating the VDSO page data by getting rid of architecture
     specific warts especially in x86 and PowerPC.

   - removing the last includes of header files which are pulling in
     other headers outside of the VDSO namespace.

   - seperating timekeeping and other VDSO data accordingly.

  Further consolidation of the VDSO page handling is done in subsequent
  changes scheduled for the next merge window.

  This also lays the ground for expanding the VDSO time getters for
  independent PTP clocks in a generic way without making every
  architecture add support seperately"

* tag 'timers-vdso-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
  x86/vdso: Add missing brackets in switch case
  vdso: Rename struct arch_vdso_data to arch_vdso_time_data
  powerpc: Split systemcfg struct definitions out from vdso
  powerpc: Split systemcfg data out of vdso data page
  powerpc: Add kconfig option for the systemcfg page
  powerpc/pseries/lparcfg: Use num_possible_cpus() for potential processors
  powerpc/pseries/lparcfg: Fix printing of system_active_processors
  powerpc/procfs: Propagate error of remap_pfn_range()
  powerpc/vdso: Remove offset comment from 32bit vdso_arch_data
  x86/vdso: Split virtual clock pages into dedicated mapping
  x86/vdso: Delete vvar.h
  x86/vdso: Access vdso data without vvar.h
  x86/vdso: Move the rng offset to vsyscall.h
  x86/vdso: Access rng vdso data without vvar.h
  x86/vdso: Access timens vdso data without vvar.h
  x86/vdso: Allocate vvar page from C code
  x86/vdso: Access rng data from kernel without vvar
  x86/vdso: Place vdso_data at beginning of vvar page
  x86/vdso: Use __arch_get_vdso_data() to access vdso data
  x86/mm/mmap: Remove arch_vma_name()
  ...
2024-11-19 16:09:13 -08:00
Claudio Imbrenda
1da7090548 s390/mm/gmap: Remove gmap_{en,dis}able()
Remove gmap_enable(), gmap_disable(), and gmap_get_enabled() since they do
not have any users anymore.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-8-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:19 +01:00
Claudio Imbrenda
473aaf52b9 s390/mm/gmap: Fix __gmap_fault() return code
Errors in fixup_user_fault() were masked and -EFAULT was returned for
any error, including out of memory.

Fix this by returning the correct error code. This means that in many
cases the error code will be propagated all the way to userspace.

Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-5-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:18 +01:00
Claudio Imbrenda
075fd7362c s390/mm/gmap: Refactor gmap_fault() and add support for pfault
When specifying FAULT_FLAG_RETRY_NOWAIT as flag for gmap_fault(), the
gmap fault will be processed only if it can be resolved quickly and
without sleeping. This will be needed for pfault.

Refactor gmap_fault() to improve readability.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20241022120601.167009-4-imbrenda@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-10-29 11:49:18 +01:00
Vincenzo Frascino
6febe0efb2 s390: Remove remaining _PAGE_* macros
The introduction of vdso/page.h made the definition of _PAGE_SHIFT,
_PAGE_SIZE, _PAGE_MASK redundant.

Refactor the code to remove the macros.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241014151340.1639555-4-vincenzo.frascino@arm.com
Closes: https://lore.kernel.org/oe-kbuild-all/202410112106.mvc2U2p0-lkp@intel.com/
2024-10-16 00:13:04 +02:00
Sven Schnelle
208da1d5fc s390: Replace S390_lowcore by get_lowcore()
Replace all S390_lowcore usages in arch/s390/ by get_lowcore().

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2024-06-18 17:01:33 +02:00
David Hildenbrand
1433b36e3a s390/hugetlb: Convert PG_arch_1 code to work on folio->flags
Let's make it clearer that we are always working on folio flags and
never page flags of tail pages by converting remaining PG_arch_1 users
that modify page->flags to modify folio->flags instead.

No functional change intended, because we would always have worked with
the head page (where page->flags corresponds to folio->flags) and never
with tail pages.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-11-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:26 +02:00
David Hildenbrand
7063150650 s390/uv: Convert uv_destroy_owned_page() to uv_destroy_(folio|pte)()
Let's have the following variants for destroying pages:

(1) uv_destroy(): Like uv_pin_shared() and uv_convert_from_secure(),
"low level" helper that operates on paddr and doesn't mess with folios.

(2) uv_destroy_folio(): Consumes a folio to which we hold a reference.

(3) uv_destroy_pte(): Consumes a PTE that holds a reference through the
mapping.

Unfortunately we need uv_destroy_pte(), because pfn_folio() and
friends are not available in pgtable.h.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240508182955.358628-8-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-06-05 17:17:25 +02:00
Linus Torvalds
d65e1a0f30 - Store AP Query Configuration Information in a static buffer
- Rework the AP initialization and add missing cleanups to the error path
 
 - Swap IRQ and AP bus/device registration to avoid race conditions
 
 - Export prot_virt_guest symbol
 
 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus
 
 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus
 
 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and
   dependency and rename it to CONFIG_AP_DEBUG
 
 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code
 
 - Adjust indentation of RELOCS command build step
 
 - Make crypto performance counters upward compatible
 
 - Convert make_page_secure() and gmap_make_secure() to use folio
 
 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs
 
 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes
 
 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data
 
 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data formats
   in userspace without the need for kernel changes
 
 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available
 
 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that
 
 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS
 
 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up
 
 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap
 
 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules
 
 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>
 
 - Put virtual memory layout information into a structure to improve
   code generation
 
 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling
   of the addresses spaces
 
 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base persistent
   boot variable and use it in proper context
 
 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END
 
 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values
 
 - Store virtual memory layout in OS_INFO. It is read out by makedumpfile,
   crash and other tools
 
 - Store virtual memory layout in VMCORE_INFO. It is read out by crash and
   other tools when /proc/kcore device is used
 
 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when virtual
   and physical memory spaces are uncoupled
 
 - Uncouple physical and virtual address spaces
 
 - Map kernel at fixed location when KASLR mode is disabled. The location is
   defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value.
 
 - Rework deployment of kernel image for both compressed and uncompressed
   variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration
   value
 
 - Move .vmlinux.relocs section in front of the compressed kernel.
   The interim section rescue step is avoided as result
 
 - Correct modules thunk offset calculation when branch target is more
   than 2GB away
 
 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline thunks,
   make modules use kernel expolines. Also make EXPOLINE_EXTERN the default
   if the compiler supports it
 
 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead
 
 - Re-enable shared zeropages for non-PV and non-skeys KVM guests
 
 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use
 
 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a vfio-ap
   mediated device in a single operation
 
 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests
 
 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state
 
 - Document ap_config sysfs attribute
 
 - Function os_info_old_init() is expected to be called only from a regular
   kdump kernel. Enable it to be called from a stand-alone dump kernel
 
 - Address gcc -Warray-bounds warning and fix array size in struct os_info
 
 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks
 
 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values
 
 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled
 
 - Compile kernel with -fPIC and link with -no-pie to allow kpatch feature
   always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code
 
 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZjkp5xccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D99AQCEby+KHssuZe9m0NvvikWREYBC
 myqob4EmdU3KdTEbNAEAt2OB7mzSQc90yjawI+Je7vwVyh3uc2Nb4Qg05yO6owI=
 =eOYN
 -----END PGP SIGNATURE-----

Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Alexander Gordeev:

 - Store AP Query Configuration Information in a static buffer

 - Rework the AP initialization and add missing cleanups to the error
   path

 - Swap IRQ and AP bus/device registration to avoid race conditions

 - Export prot_virt_guest symbol

 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus

 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus

 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description
   and dependency and rename it to CONFIG_AP_DEBUG

 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code

 - Adjust indentation of RELOCS command build step

 - Make crypto performance counters upward compatible

 - Convert make_page_secure() and gmap_make_secure() to use folio

 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs

 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes

 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data

 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data
   formats in userspace without the need for kernel changes

 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available

 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that

 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS

 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up

 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap

 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules

 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>

 - Put virtual memory layout information into a structure to improve
   code generation

 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling of
   the addresses spaces

 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base
   persistent boot variable and use it in proper context

 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END

 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values

 - Store virtual memory layout in OS_INFO. It is read out by
   makedumpfile, crash and other tools

 - Store virtual memory layout in VMCORE_INFO. It is read out by crash
   and other tools when /proc/kcore device is used

 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when
   virtual and physical memory spaces are uncoupled

 - Uncouple physical and virtual address spaces

 - Map kernel at fixed location when KASLR mode is disabled. The
   location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration
   value.

 - Rework deployment of kernel image for both compressed and
   uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel
   configuration value

 - Move .vmlinux.relocs section in front of the compressed kernel. The
   interim section rescue step is avoided as result

 - Correct modules thunk offset calculation when branch target is more
   than 2GB away

 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline
   thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN
   the default if the compiler supports it

 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead

 - Re-enable shared zeropages for non-PV and non-skeys KVM guests

 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use

 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a
   vfio-ap mediated device in a single operation

 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests

 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state

 - Document ap_config sysfs attribute

 - Function os_info_old_init() is expected to be called only from a
   regular kdump kernel. Enable it to be called from a stand-alone dump
   kernel

 - Address gcc -Warray-bounds warning and fix array size in struct
   os_info

 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks

 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values

 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is
   disabled

 - Compile kernel with -fPIC and link with -no-pie to allow kpatch
   feature always succeed and drop the whole CONFIG_PIE_BUILD
   option-enabled code

 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks

* tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
  Revert "s390: Relocate vmlinux ELF data to virtual address space"
  KVM: s390: vsie: Use virt_to_phys for crypto control block
  s390: Relocate vmlinux ELF data to virtual address space
  s390: Compile kernel with -fPIC and link with -no-pie
  s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
  s390/ftrace: Use unwinder instead of __builtin_return_address()
  s390/pci: Drop unneeded reference to CONFIG_DMI
  s390/os_info: Fix array size in struct os_info
  s390/os_info: Initialize old os_info in standalone dump kernel
  docs: Update s390 vfio-ap doc for ap_config sysfs attribute
  s390/vfio-ap: Add write support to sysfs attr ap_config
  s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue
  s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state
  s390/ap: Externalize AP bus specific bitmap reading function
  s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
  mm/userfaultfd: Do not place zeropages when zeropages are disallowed
  s390/expoline: Make modules use kernel expolines
  s390/nospec: Correct modules thunk offset calculation
  s390/boot: Do not rescue .vmlinux.relocs section
  s390/boot: Rework deployment of the kernel image
  ...
2024-05-13 08:33:52 -07:00
Claudio Imbrenda
843c328068 s390/mm: Fix storage key clearing for guest huge pages
The function __storage_key_init_range() expects the end address to be
the first byte outside the range to be initialized. I.e. end - start
should be the size of the area to be initialized.

The current code works because __storage_key_init_range() will still loop
over every page in the range, but it is slower than using sske_frame().

Fixes: 964c2c05c9 ("s390/mm: Clear huge page storage keys on enable_skey")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20240416114220.28489-2-imbrenda@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-22 12:47:47 +02:00
David Hildenbrand
06201e00ee s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to
avoid postcopy hangs") introduced an undesired side effect when combined
with memory ballooning and VM migration: memory part of the inflated
memory balloon will consume memory.

Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM
will consume ~60GiB of memory. If we now trigger a VM migration,
hypervisors like QEMU will read all VM memory. As s390x does not support
the shared zeropage, we'll end up allocating for all previously-inflated
memory part of the memory balloon: 50 GiB. So we might easily
(unexpectedly) crash the VM on the migration source.

Even worse, hypervisors like QEMU optimize for zeropage migration to not
consume memory on the migration destination: when migrating a
"page full of zeroes", on the migration destination they check whether the
target memory is already zero (by reading the destination memory) and avoid
writing to the memory to not allocate memory: however, s390x will also
allocate memory here, implying that also on the migration destination, we
will end up allocating all previously-inflated memory part of the memory
balloon.

This is especially bad if actual memory overcommit was not desired, when
memory ballooning is used for dynamic VM memory resizing, setting aside
some memory during boot that can be added later on demand. Alternatives
like virtio-mem that would avoid this issue are not yet available on
s390x.

There could be ways to optimize some cases in user space: before reading
memory in an anonymous private mapping on the migration source, check via
/proc/self/pagemap if anything is already populated. Similarly check on
the migration destination before reading. While that would avoid
populating tables full of shared zeropages on all architectures, it's
harder to get right and performant, and requires user space changes.

Further, with posctopy live migration we must place a page, so there,
"avoid touching memory to avoid allocating memory" is not really
possible. (Note that a previously we would have falsely inserted
shared zeropages into processes using UFFDIO_ZEROPAGE where
mm_forbids_zeropage() would have actually forbidden it)

PV is currently incompatible with memory ballooning, and in the common
case, KVM guests don't make use of storage keys. Instead of zapping
zeropages when enabling storage keys / PV, that turned out to be
problematic in the past, let's do exactly the same we do with KSM pages:
trigger unsharing faults to replace the shared zeropages by proper
anonymous folios.

What about added latency when enabling storage kes? Having a lot of
zeropages in applicable environments (PV, legacy guests, unittests) is
unexpected. Further, KSM could today already unshare the zeropages
and unmerging KSM pages when enabling storage kets would unshare the
KSM-placed zeropages in the same way, resulting in the same latency.

[ agordeev: Fixed sparse and checkpatch complaints and error handling ]

Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes: fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-18 15:02:53 +02:00
Peter Xu
0a845e0f63 mm/treewide: replace pud_large() with pud_leaf()
pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Peter Xu
2f709f7bfd mm/treewide: replace pmd_large() with pmd_leaf()
pmd_large() is always defined as pmd_leaf().  Merge their usages.  Chose
pmd_leaf() because pmd_leaf() is a global API, while pmd_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-8-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:19 -08:00
Qi Zheng
57b77b75ca s390: supplement for ptdesc conversion
After commit 6326c26c15 ("s390: convert various pgalloc functions to use
ptdescs"), there are still some positions that use page->{lru, index}
instead of ptdesc->{pt_list, pt_index}.  In order to make the use of
ptdesc->{pt_list, pt_index} clearer, it would be better to convert them as
well.

[zhengqi.arch@bytedance.com: fix build failure]
  Link: https://lkml.kernel.org/r/20240305072154.26168-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/04beaf3255056ffe131a5ea595736066c1e84756.1709541697.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-06 13:04:18 -08:00
Christian Borntraeger
fe752331d4 KVM: s390: vsie: fix race during shadow creation
Right now it is possible to see gmap->private being zero in
kvm_s390_vsie_gmap_notifier resulting in a crash.  This is due to the
fact that we add gmap->private == kvm after creation:

static int acquire_gmap_shadow(struct kvm_vcpu *vcpu,
                               struct vsie_page *vsie_page)
{
[...]
        gmap = gmap_shadow(vcpu->arch.gmap, asce, edat);
        if (IS_ERR(gmap))
                return PTR_ERR(gmap);
        gmap->private = vcpu->kvm;

Let children inherit the private field of the parent.

Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: a3508fbe9d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: <stable@vger.kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com
2023-12-21 11:40:18 +01:00
Heiko Carstens
65d37f163a s390/cmma: move arch_set_page_dat() to header file
In order to be usable for early boot code move the simple
arch_set_page_dat() function to header file, and add its counter-part
arch_set_page_nodat(). Also change the parameters, and the function name
slightly.

This is required since there aren't any struct pages available in early
boot code, and renaming of functions is done to make sure that all users
are converted to the new API.

Instead of a pointer to a struct page a virtual address is passed, and
instead of an order the number of pages for which the page state needs be
set.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-11-05 22:34:58 +01:00
Heiko Carstens
1954da4a2b s390/mm: add missing arch_set_page_dat() call to gmap allocations
If the cmma no-dat feature is available all pages that are not used for
dynamic address translation are marked as "no-dat" with the ESSA
instruction. This information is visible to the hypervisor, so that the
hypervisor can optimize purging of guest TLB entries. This also means that
pages which are used for dynamic address translation must not be marked as
"no-dat", since the hypervisor may then incorrectly not purge guest TLB
entries.

Region, segment, and page tables allocated within the gmap code are
incorrectly marked as "no-dat", since an explicit call to
arch_set_page_dat() is missing, which would remove the "no-dat" mark.

In order to fix this add a new gmap_alloc_crst() function which should
be used to allocate region and segment tables, and which also calls
arch_set_page_dat().

Also add the arch_set_page_dat() call to page_table_alloc_pgste().

Cc: <stable@vger.kernel.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-10-25 15:08:11 +02:00
Suren Baghdasaryan
49b0638502 mm: enable page walking API to lock vmas during the walk
walk_page_range() and friends often operate under write-locked mmap_lock. 
With introduction of vma locks, the vmas have to be locked as well during
such walks to prevent concurrent page faults in these areas.  Add an
additional member to mm_walk_ops to indicate locking requirements for the
walk.

The change ensures that page walks which prevent concurrent page faults
by write-locking mmap_lock, operate correctly after introduction of
per-vma locks.  With per-vma locks page faults can be handled under vma
lock without taking mmap_lock at all, so write locking mmap_lock would
not stop them.  The change ensures vmas are properly locked during such
walks.

A sample issue this solves is do_mbind() performing queue_pages_range()
to queue pages for migration.  Without this change a concurrent page
can be faulted into the area and be left out of migration.

Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:07:20 -07:00
Claudio Imbrenda
c2fceb59bb KVM: s390: pv: fix index value of replaced ASCE
The index field of the struct page corresponding to a guest ASCE should
be 0. When replacing the ASCE in s390_replace_asce(), the index of the
new ASCE should also be set to 0.

Having the wrong index might lead to the wrong addresses being passed
around when notifying pte invalidations, and eventually to validity
intercepts (VM crash) if the prefix gets unmapped and the notifier gets
called with the wrong address.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Fixes: faa2f72cb3 ("KVM: s390: pv: leak the topmost page table when destroy fails")
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20230705111937.33472-3-imbrenda@linux.ibm.com>
2023-07-18 11:21:51 +02:00
Linus Torvalds
a452483508 s390 updates for 6.5 merge window part 2
- Fix virtual vs physical address confusion in vmem_add_range()
   and vmem_remove_range() functions.
 
 - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
   throughout s390 code.
 
 - Make all PSW related defines also available for assembler files.
   Remove PSW_DEFAULT_KEY define from uapi for that.
 
 - When adding an undefined symbol the build still succeeds, but
   userspace crashes trying to execute VDSO, because the symbol
   is not resolved. Add undefined symbols check to prevent that.
 
 - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
   memory when executing s390 crypto adapter IOCTL.
 
 - Add -fPIE flag to prevent decompressor misaligned symbol build
   error with clang.
 
 - Use .balign instead of .align everywhere. This is a no-op for s390,
   but with this there no mix in using .align and .balign anymore.
 
 - Filter out -mno-pic-data-is-text-relative flag when compiling
   kernel to prevent VDSO build error.
 
 - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
   mask directly.
 
 - Do not retry administrative requests to some s390 crypto cards,
   since the firmware assumes replay attacks.
 
 - Remove most of the debug code, which is build in when kernel config
   option CONFIG_ZCRYPT_DEBUG is enabled.
 
 - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
   off the multiple devices support for the s390 zcrypt device driver.
 
 - With the conversion to generic entry machine checks are accounted
   to the current context instead of irq time. As result, the STCKF
   instruction at the beginning of the machine check handler and the
   lowcore member are no longer required, therefore remove it.
 
 - Fix various typos found with codespell.
 
 - Minor cleanups to CPU-measurement Counter and Sampling Facilities code.
 
 - Revert patch that removes VMEM_MAX_PHYS macro, since it causes
   a regression.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZKakSBccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8HeIAQCg9RX3/olsZhCqRNLZ/O+6FXAF
 29ohi2JmVqxJBKkmwgEA/QXCjoTOp41pQJ1FD39HnI8DeYpJFRnYYE5D3acibAw=
 =2Ykk
 -----END PGP SIGNATURE-----

Merge tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull more s390 updates from Alexander Gordeev:

 - Fix virtual vs physical address confusion in vmem_add_range() and
   vmem_remove_range() functions

 - Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
   throughout s390 code

 - Make all PSW related defines also available for assembler files.
   Remove PSW_DEFAULT_KEY define from uapi for that

 - When adding an undefined symbol the build still succeeds, but
   userspace crashes trying to execute VDSO, because the symbol is not
   resolved. Add undefined symbols check to prevent that

 - Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
   memory when executing s390 crypto adapter IOCTL

 - Add -fPIE flag to prevent decompressor misaligned symbol build error
   with clang

 - Use .balign instead of .align everywhere. This is a no-op for s390,
   but with this there no mix in using .align and .balign anymore

 - Filter out -mno-pic-data-is-text-relative flag when compiling kernel
   to prevent VDSO build error

 - Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
   mask directly

 - Do not retry administrative requests to some s390 crypto cards, since
   the firmware assumes replay attacks

 - Remove most of the debug code, which is build in when kernel config
   option CONFIG_ZCRYPT_DEBUG is enabled

 - Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
   off the multiple devices support for the s390 zcrypt device driver

 - With the conversion to generic entry machine checks are accounted to
   the current context instead of irq time. As result, the STCKF
   instruction at the beginning of the machine check handler and the
   lowcore member are no longer required, therefore remove it

 - Fix various typos found with codespell

 - Minor cleanups to CPU-measurement Counter and Sampling Facilities
   code

 - Revert patch that removes VMEM_MAX_PHYS macro, since it causes a
   regression

* tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits)
  Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
  s390/cpum_sf: remove check on CPU being online
  s390/cpum_sf: handle casts consistently
  s390/cpum_sf: remove unnecessary debug statement
  s390/cpum_sf: remove parameter in call to pr_err
  s390/cpum_sf: simplify function setup_pmu_cpu
  s390/cpum_cf: remove unneeded debug statements
  s390/entry: remove mcck clock
  s390: fix various typos
  s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option
  s390/zcrypt: do not retry administrative requests
  s390/zcrypt: cleanup some debug code
  s390/entry: rework entering DAT-on mode on CPU restart
  s390/mm: fence off VM macros from asm and linker
  s390: include linux/io.h instead of asm/io.h
  s390/ptrace: make all psw related defines also available for asm
  s390/ptrace: remove PSW_DEFAULT_KEY from uapi
  s390/vdso: filter out mno-pic-data-is-text-relative cflag
  s390: consistently use .balign instead of .align
  s390/decompressor: fix misaligned symbol build error
  ...
2023-07-06 13:18:30 -07:00
Heiko Carstens
cada938a01 s390: fix various typos
Fix various typos found with codespell.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-07-03 11:19:42 +02:00
Hugh Dickins
b2f58941ad s390: gmap use pte_unmap_unlock() not spin_unlock()
pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).

Link: https://lkml.kernel.org/r/78873af-e1ec-4f9-47ac-483940ac6daa@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:09 -07:00
Hugh Dickins
5c7f3bf04a s390: allow pte_offset_map_lock() to fail
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.

Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:09 -07:00
Linus Torvalds
493804a689 RISC-V:
- ONE_REG interface to enable/disable SBI extensions
 - Zbb extension for Guest/VM
 - AIA CSR virtualization
 
 x86:
 - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU can
   result in the root being freed even though the root is completely valid
   and can be reused as-is (with a TLB flush).
 
 s390:
 - A couple bugfixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRU198UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOeiwf+OdIcGxOlt9D/qNys/U+2kkhZK6WU
 1Jrwn+S5onea2LGyBucHk3V6O1rWck8aOqfRAmFBQc5s/gu8EySW5Ic17kqgqLwc
 MSr2c70qw3/6U2LoTeVHx21V4Jgf9iPOZu1LAXf26F0jSCAoWlACO9OQd0rJoDmA
 cpUtHSAFKxuPA/8Ix+SH4ddzKhCYEs4QejMLmQIgBWyaGShdi8dvaMoKyoceHN2w
 ntrhSaz+BbCCZKaFwtNCHVaigyioqmwTHjsmMFjzEN3N1ly1XZfN2E7vvT4j3v7l
 o55eMpJnu7+TVs5LEMvj25nuxgts6UzJnoALJAO2cDsHW5F3FvjO9HGjTg==
 =NHEm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more kvm updates from Paolo Bonzini:
 "This includes the 6.4 changes for RISC-V, and a few bugfix patches for
  other architectures. For x86, this closes a longstanding performance
  issue in the newer and (usually) more scalable page table management
  code.

  RISC-V:
   - ONE_REG interface to enable/disable SBI extensions
   - Zbb extension for Guest/VM
   - AIA CSR virtualization

  x86:
   - Fix a long-standing TDP MMU flaw, where unloading roots on a vCPU
     can result in the root being freed even though the root is
     completely valid and can be reused as-is (with a TLB flush).

  s390:
   - A couple of bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: fix race in gmap_make_secure()
  KVM: s390: pv: fix asynchronous teardown for small VMs
  KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated
  RISC-V: KVM: Virtualize per-HART AIA CSRs
  RISC-V: KVM: Use bitmap for irqs_pending and irqs_pending_mask
  RISC-V: KVM: Add ONE_REG interface for AIA CSRs
  RISC-V: KVM: Implement subtype for CSR ONE_REG interface
  RISC-V: KVM: Initial skeletal support for AIA
  RISC-V: KVM: Drop the _MASK suffix from hgatp.VMID mask defines
  RISC-V: Detect AIA CSRs from ISA string
  RISC-V: Add AIA related CSR defines
  RISC-V: KVM: Allow Zbb extension for Guest/VM
  RISC-V: KVM: Add ONE_REG interface to enable/disable SBI extensions
  RISC-V: KVM: Alphabetize selects
  KVM: RISC-V: Retry fault if vma_lookup() results become invalid
2023-05-05 12:17:01 -07:00
Paolo Bonzini
7a8016d956 For 6.4
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEoWuZBM6M3lCBSfTnuARItAMU6BMFAmRT4cIACgkQuARItAMU
 6BPZ6xAA0T0AhPUuc4JyKH4p2ZFaH61+1YpGHlxxCyIF1gm4JMHS0i1CRii+o2Zh
 k3w76MafGau9ADXwrjJAh+C+WihwwgTtSyN2/amOn70K75iK7GGbT125c6VX6A1I
 6L6lFEO/3JURrhQE4TthgH6HUVyXBC4XobQqajWtxGgcq/QRkjtgdbLPPv9HuH23
 NgXZ85+lpPrak9U0Zzu2ez1O9VIlABWbTE6B6DgcyQshwqBZDOSFRTxxJ883XtqH
 7sRMmVJBkiF3HRqvDAmxI0eUOrR8YNtl1PN441iadmvfqG4IDbPtuJ0UtiyNagXT
 g7UYGvv7Qj2z4y2QhQSgLiIC0Zrl/T6Vw/7oXt0CVybhEMuAAeshWzHIl5aR5aKt
 7sY2ijUX290zwuACft9ZjDspUWyOPkvqym5UrldTj1ZaYV9w7iHNezFbXTnXOBh9
 vEiet5p3SCeQNymFKXy2R+8YJ6IIsH3Mxf6SYo2CQFNNjOzfUHwKMix/uArFdXF5
 CQD6jM9NXTiPIjwpXIJjyutJ6USWZhsMCneHtItzfWTa1YnBSZTwIWU1fB05Snk/
 BcDVfJ9Pq142N983SNXaBYIGzxpYPUHYZQmKvAzbeLHcaxrgLJM9/x8zcXIxc8Wa
 sEFoo5W8DCIJRqjk7dsLyLEvPJdoBZidg6sd0MwHRGX0kDRzHEo=
 =PPnV
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.4-2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

For 6.4
2023-05-05 06:15:09 -04:00
Claudio Imbrenda
292a7d6fca KVM: s390: pv: fix asynchronous teardown for small VMs
On machines without the Destroy Secure Configuration Fast UVC, the
topmost level of page tables is set aside and freed asynchronously
as last step of the asynchronous teardown.

Each gmap has a host_to_guest radix tree mapping host (userspace)
addresses (with 1M granularity) to gmap segment table entries (pmds).

If a guest is smaller than 2GB, the topmost level of page tables is the
segment table (i.e. there are only 2 levels). Replacing it means that
the pointers in the host_to_guest mapping would become stale and cause
all kinds of nasty issues.

This patch fixes the issue by disallowing asynchronous teardown for
guests with only 2 levels of page tables. Userspace should (and already
does) try using the normal destroy if the asynchronous one fails.

Update s390_replace_asce so it refuses to replace segment type ASCEs.
This is still needed in case the normal destroy VM fails.

Fixes: fb491d5500 ("KVM: s390: pv: asynchronous destroy for reboot")
Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230421085036.52511-2-imbrenda@linux.ibm.com>
2023-05-04 18:24:41 +02:00
David Hildenbrand
2c281f54f5 mm/ksm: move disabling KSM from s390/gmap code to KSM code
Let's factor out actual disabling of KSM.  The existing "mm->def_flags &=
~VM_MERGEABLE;" was essentially a NOP and can be dropped, because
def_flags should never include VM_MERGEABLE.  Note that we don't currently
prevent re-enabling KSM.

This should now be faster in case KSM was never enabled, because we only
conditionally iterate all VMAs.  Further, it certainly looks cleaner.

Link: https://lkml.kernel.org/r/20230422210156.33630-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Stefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-02 17:21:50 -07:00
Stefan Roesch
d7597f59d1 mm: add new api to enable ksm per process
Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.


Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call. 
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests. 
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.


This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21 14:52:03 -07:00
Suren Baghdasaryan
ff126c0ece mm: replace vma->vm_flags indirect modification in ksm_madvise
Replace indirect modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

Link: https://lkml.kernel.org/r/20230126193752.297968-6-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Reichel <sebastian.reichel@collabora.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00
Suren Baghdasaryan
1c71222e5f mm: replace vma->vm_flags direct modifications with modifier calls
Replace direct modifications to vma->vm_flags with calls to modifier
functions to be able to track flag changes and to keep vma locking
correctness.

[akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo]
Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Oskolkov <posk@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-09 16:51:39 -08:00