- Add vfio-ap support to pass-through crypto devices to secure execution
guests
- Add API ordinal 6 support to zcrypt_ep11misc device drive, which is
required to handle key generate and key derive (e.g. secure key to
protected key) correctly
- Add missing secure/has_secure sysfs files for the case where it is not
possible to figure where a system has been booted from. Existing user
space relies on that these files are always present
- Fix DCSS block device driver list corruption, caused by incorrect
error handling
- Convert virt_to_pfn() and pfn_to_virt() from defines to static inline
functions to enforce type checking
- Cleanups, improvements, and minor fixes to the kernel mapping setup
- Fix various virtual vs physical address confusions
- Move pfault code to separate file, since it has nothing to do with
regular fault handling
- Move s390 documentation to Documentation/arch/ like it has been done
for other architectures already
- Add HAVE_FUNCTION_GRAPH_RETVAL support
- Factor out the s390_hypfs filesystem and add a new config option for
it. The filesystem is deprecated and as soon as all users are gone it
can be removed some time in the not so near future
- Remove support for old CEX2 and CEX3 crypto cards from zcrypt device
driver
- Add support for user-defined certificates: receive user-defined
certificates with a diagnose call and provide them via 'cert_store'
keyring to user space
- Couple of other small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmTrqNYACgkQIg7DeRsp
bsKkUBAApWXr3WCJA2tige34AnFwmskx4sBxl/fgwcwJrC55fED1jKWaiXOM6isv
P+hqavZnks3gXZdYcD3kxXkNMh+fPNWw7BAL35J5Gu1VShA/jlbTC6ZrvUO3t+Fy
NsdLvBDbNDdyUzQF7w0Xb0jyIxqhJTRyhLfR5oXES63FHomv2F/vofu4jWR/q+cc
F9mcnoDeN4zLdssdvl6WtPX4nEY9RpG0QOh67drnxuq+8v7sL8gKN4ti94Rp6vhs
g4NhNs9xgRIPoOcX2KlSIdFqO9P12jSXZq0G4HcOp8UGQvgU/mS+UG3pQwV3ZJLS
3/kUJZ4/CwQa1xUFtPGP1/4AngGNOnhT9FCD4KrqjDkRZmLsd5RvURe6L1zQ3vbZ
KnX7q0Otx4xRVYPlbHb9aP+tC7f3Q10ytBAps616qZoA/2SMss2BLZiiPBpCCvDp
L+9dRhBGYCP2PSe6H/qGQFfMW+uY7QF+NDcDAT5mX1lS8OVrGJxqM7Q+sY2pMLGo
5nR16LvM9g6W/ZnsVn0+BWg4CgaPMi+PMfMPxs/o9RG+/0d1AJx1aLSiHdP1pXog
8/Wg4GaaJ27S4Ers0JUmH7VDO+QkkLvAArstjk8l59r1XslWiBP5USebkxtgu6EQ
ehAh0+oa432ALq8Rn1FK/X+pWFumbTVf8OPwR8YEjDbeTPIBCqg=
=ewd9
-----END PGP SIGNATURE-----
Merge tag 's390-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Add vfio-ap support to pass-through crypto devices to secure
execution guests
- Add API ordinal 6 support to zcrypt_ep11misc device drive, which is
required to handle key generate and key derive (e.g. secure key to
protected key) correctly
- Add missing secure/has_secure sysfs files for the case where it is
not possible to figure where a system has been booted from. Existing
user space relies on that these files are always present
- Fix DCSS block device driver list corruption, caused by incorrect
error handling
- Convert virt_to_pfn() and pfn_to_virt() from defines to static inline
functions to enforce type checking
- Cleanups, improvements, and minor fixes to the kernel mapping setup
- Fix various virtual vs physical address confusions
- Move pfault code to separate file, since it has nothing to do with
regular fault handling
- Move s390 documentation to Documentation/arch/ like it has been done
for other architectures already
- Add HAVE_FUNCTION_GRAPH_RETVAL support
- Factor out the s390_hypfs filesystem and add a new config option for
it. The filesystem is deprecated and as soon as all users are gone it
can be removed some time in the not so near future
- Remove support for old CEX2 and CEX3 crypto cards from zcrypt device
driver
- Add support for user-defined certificates: receive user-defined
certificates with a diagnose call and provide them via 'cert_store'
keyring to user space
- Couple of other small fixes and improvements all over the place
* tag 's390-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (66 commits)
s390/pci: use builtin_misc_device macro to simplify the code
s390/vfio-ap: make sure nib is shared
KVM: s390: export kvm_s390_pv*_is_protected functions
s390/uv: export uv_pin_shared for direct usage
s390/vfio-ap: check for TAPQ response codes 0x35 and 0x36
s390/vfio-ap: handle queue state change in progress on reset
s390/vfio-ap: use work struct to verify queue reset
s390/vfio-ap: store entire AP queue status word with the queue object
s390/vfio-ap: remove upper limit on wait for queue reset to complete
s390/vfio-ap: allow deconfigured queue to be passed through to a guest
s390/vfio-ap: wait for response code 05 to clear on queue reset
s390/vfio-ap: clean up irq resources if possible
s390/vfio-ap: no need to check the 'E' and 'I' bits in APQSW after TAPQ
s390/ipl: refactor deprecated strncpy
s390/ipl: fix virtual vs physical address confusion
s390/zcrypt_ep11misc: support API ordinal 6 with empty pin-blob
s390/paes: fix PKEY_TYPE_EP11_AES handling for secure keyblobs
s390/pkey: fix PKEY_TYPE_EP11_AES handling for sysfs attributes
s390/pkey: fix PKEY_TYPE_EP11_AES handling in PKEY_VERIFYKEY2 IOCTL
s390/pkey: fix PKEY_TYPE_EP11_AES handling in PKEY_KBLOB2PROTK[23]
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc
ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z
dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA=
=24Ha
-----END PGP SIGNATURE-----
Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fchmodat2 system call from Christian Brauner:
"This adds the fchmodat2() system call. It is a revised version of the
fchmodat() system call, adding a missing flag argument. Support for
both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included.
Adding this system call revision has been a longstanding request but
so far has always fallen through the cracks. While the kernel
implementation of fchmodat() does not have a flag argument the libc
provided POSIX-compliant fchmodat(3) version does. Both glibc and musl
have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW
(see [1] and [2]).
The workaround is brittle because it relies not just on O_PATH and
O_NOFOLLOW semantics and procfs magic links but also on our rather
inconsistent symlink semantics.
This gives userspace a proper fchmodat2() system call that libcs can
use to properly implement fchmodat(3) and allows them to get rid of
their hacks. In this case it will immediately benefit them as the
current workaround is already defunct because of aformentioned
inconsistencies.
In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use
AT_EMPTY_PATH with fchmodat2(). This is already possible with
fchownat() so there's no reason to not also support it for
fchmodat2().
The implementation is simple and comes with selftests. Implementation
of the system call and wiring up the system call are done as separate
patches even though they could arguably be one patch. But in case
there are merge conflicts from other system call additions it can be
beneficial to have separate patches"
Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1]
Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2]
* tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
selftests: fchmodat2: remove duplicate unneeded defines
fchmodat2: add support for AT_EMPTY_PATH
selftests: Add fchmodat2 selftest
arch: Register fchmodat2, usually as syscall 452
fs: Add fchmodat2()
Non-functional cleanup of a "__user * filename"
Introduces a function to check the existence of an UV feature.
Refactor feature bit checks to use the new function.
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Michael Mueller <mimu@linux.ibm.com>
Link: https://lore.kernel.org/r/20230815151415.379760-3-seiden@linux.ibm.com
Message-Id: <20230815151415.379760-3-seiden@linux.ibm.com>
Tony Krowiak says:
===================
This patch series is for the changes required in the vfio_ap device
driver to facilitate pass-through of crypto devices to a secure
execution guest. In particular, it is critical that no data from the
queues passed through to the SE guest is leaked when the guest is
destroyed. There are also some new response codes returned from the
PQAP(ZAPQ) and PQAP(TAPQ) commands that have been added to the
architecture in support of pass-through of crypto devices to SE guests;
these need to be accounted for when handling the reset of queues.
===================
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Export the uv_pin_shared function so that it can be called from other
modules that carry a GPL-compatible license.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Tested-by: Viktor Mihajlovski <mihajlov@linux.ibm.com>
Link: https://lore.kernel.org/r/20230815184333.6554-11-akrowiak@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
`strncpy` is deprecated for use on NUL-terminated destination strings [1].
Use `strscpy` which has the same behavior as `strncpy` here with the
extra safeguard of guaranteeing NUL-termination of destination
strings. In it's current form, this may result in silent truncation
if the src string has the same size as the destination string.
[hca@linux.ibm.com: use strscpy() instead of strscpy_pad()]
Link: www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings[1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20230811-arch-s390-kernel-v1-1-7edbeeab3809@google.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The value of ipl_cert_list_addr boot variable contains
a physical address, which is used directly. That works
because virtual and physical address spaces are currently
the same, but otherwise it is wrong.
While at it, fix also a comment for the platform keyring.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20230816132942.2540411-1-agordeev@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
All ipl types have 'secure','has_secure' and type parameters. Move
these to a common ipl parameter group so that they don't need to be
present in each ipl parameter group.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
OS installers are relying on /sys/firmware/ipl/has_secure to be
present on machines supporting secure boot. This file is present
for all IPL types, but not the unknown type, which prevents a secure
installation when an LPAR is booted in HMC via FTP(s), because
this is an unknown IPL type in linux. While at it, also add the secure
file.
Fixes: c9896acc78 ("s390/ipl: Provide has_secure sysfs attribute")
Cc: stable@vger.kernel.org
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit ddb5cdbafa ("kbuild: generate KSYMTAB entries by modpost")
deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>.
Replace #include <asm/export.h> with #include <linux/export.h>.
After all the <asm/export.h> lines are converted, <asm/export.h> and
<asm-generic/export.h> will be removed.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20230806151641.394720-2-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Globally setting a bit in control registers is done with
smp_ctl_set_clear_bit(). This is using on_each_cpu() to execute a function
which actually sets the control register bit on each online CPU. This can
be problematic since on_each_cpu() does not prevent that new CPUs come
online while it is executed, which in turn means that control register
updates could be missing on new CPUs.
In order to prevent this problem make sure that global control register
contents cannot change until new CPUs have initialized their control
registers, and marked themselves online, so they are included in subsequent
on_each_cpu() calls.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The 'rc' will be re-assigned to 0 after calling get_vcssb(), it
needs be set to error code if create_cs_keyring() fails.
[hca@linux.ibm.com: slightly changed coding style]
Fixes: 8cf57d7217 ("s390: add support for user-defined certificates")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Link: https://lore.kernel.org/r/20230728084228.3186083-1-yangyingliang@huawei.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The pfault code has nothing to do with regular fault handling.
Therefore move it to an own C file. Also add an own pfault header
file. This way changes to setup.h don't cause a recompile of the
pfault code and vice versa.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 9fb6c9b3fe ("s390/sthyi: add cache to store hypervisor info")
added cache handling for store hypervisor info. This also changed the
possible return code for sthyi_fill().
Instead of only returning a condition code like the sthyi instruction would
do, it can now also return a negative error value (-ENOMEM). handle_styhi()
was not changed accordingly. In case of an error, the negative error value
would incorrectly injected into the guest PSW.
Add proper error handling to prevent this, and update the comment which
describes the possible return values of sthyi_fill().
Fixes: 9fb6c9b3fe ("s390/sthyi: add cache to store hypervisor info")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20230727182939.2050744-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Nathan Chancellor reported the following build error when compiling the
kernel with CONFIG_MARCH_Z10=y:
arch/s390/kernel/mcount.S: Assembler messages:
arch/s390/kernel/mcount.S:140: Error: Unrecognized opcode: `aghik'
The aghik instruction is only available since z196. Use the la instruction
instead which is available for all machines.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://lore.kernel.org/all/20230725211105.GA224840@dev-arch.thelio-3990X
Fixes: 1256e70a08 ("s390/ftrace: enable HAVE_FUNCTION_GRAPH_RETVAL")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org> # build
Link: https://lore.kernel.org/r/20230726061834.1300984-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The comment above diag8c() describes diagnose 210, not diagnose 8c.
Add a proper short description.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This registers the new fchmodat2 syscall in most places as nuber 452,
with alpha being the exception where it's 562. I found all these sites
by grepping for fspick, which I assume has found me everything.
Signed-off-by: Palmer Dabbelt <palmer@sifive.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
ftrace_trace_function expects a struct ftrace_regs, but the s390
architecure code passes struct pt_regs. This isn't a problem with the
current code because struct ftrace_regs contains only one member:
struct pt_regs. To avoid issues in the future this should be fixed.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As per description in mm/memory_hotplug.c platforms should define
arch_get_mappable_range() that provides maximum possible addressable
physical memory range for which the linear mapping could be created.
The current implementation uses VMEM_MAX_PHYS macro as the maximum
mappable physical address and it is simply a cast to vmemmap. Since
the address is in physical address space the natural upper limit of
MAX_PHYSMEM_BITS is honoured:
vmemmap_start = min(vmemmap_start, 1UL << MAX_PHYSMEM_BITS);
Further, to make sure the identity mapping would not overlay with
vmemmap, the size of identity mapping could be stripped like this:
ident_map_size = min(ident_map_size, vmemmap_start);
Similarily, any other memory that could be added (e.g DCSS segment)
should not overlay with vmemmap as well and that is prevented by
using vmemmap (VMEM_MAX_PHYS macro) as the upper limit.
However, while the use of VMEM_MAX_PHYS brings the desired result
it actually poses two issues:
1. As described, vmemmap is handled as a physical address, although
it is actually a pointer to struct page in virtual address space.
2. As vmemmap is a virtual address it could have been located
anywhere in the virtual address space. However, the desired
necessity to honour MAX_PHYSMEM_BITS limit prevents that.
Rework arch_get_mappable_range() callback in a way it does not
use VMEM_MAX_PHYS macro and does not confuse the notion of virtual
vs physical address spacees as result. That paves the way for moving
vmemmap elsewhere and optimizing the virtual address space layout.
Introduce max_mappable preserved boot variable and let function
setup_kernel_memory_layout() set it up. As result, the rest of the
code is does not need to know the virtual memory layout specifics.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make machine_kexec.o and relocate_kernel.o depend on
CONFIG_KEXEC_CORE option as other architectures do.
Still generate machine_kexec_reloc.o unconditionally,
since arch_kexec_do_relocs() function is neded by the
decompressor.
Suggested-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add support for tracing return values in the function graph tracer.
This requires return_to_handler() to record gpr2 and the frame pointer
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Simplify memory allocation for diagnose 204 memory buffer:
- allocate with __vmalloc_node() to enure page alignment
- allocate real / physical memory area also within vmalloc area and handle
vmalloc to real / physical address translation within diag204().
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
vmalloc() does not guarantee any alignment, unless it is explicitly
requested with e.g. __vmalloc_node(). Using diag204() with subcode 7
requires a 4k aligned virtual buffer. Therefore switch to __vmalloc_node().
Note: with the current vmalloc() implementation callers would still get a
4k aligned area, even though this is quite non-obvious looking at the
code. So changing this in sthyi doesn't fix a real bug. It is just to make
sure the code will not suffer from some obscure options, like it happened
in the past with kmalloc() where debug options changed the assumed
alignment of allocated memory areas.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Diagnose 204 subcode 4 requires a real (physical) address, but a
virtual address is passed to the inline assembly.
Convert the address to a physical address for only this specific case.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Enable receiving the user-defined certificates from the s390x
hypervisor via new diagnose 0x320 calls, and make them available to the
Linux root user as 'cert_store_key' type keys in a so-called
'cert_store' keyring.
New user-space interfaces:
/sys/firmware/cert_store/refresh
Writing to this attribute re-fetches certificates via DIAG 0x320
/sys/firmware/cert_store/cs_status
Reading from this attribute returns either of:
"uninitialized"
If no certificate has been retrieved yet
"ok"
If certificates have been successfully retrieved
"failed (<number>)"
If certificate retrieval failed with reason code <number>
New debug trace areas:
/sys/kernel/debug/s390dbf/cert_store_msg
/sys/kernel/debug/s390dbf/cert_store_hexdump
Usage example:
To initiate request for certificates available to the system as root:
$ echo 1 > /sys/firmware/cert_store/refresh
Upon success the '/sys/firmware/cert_store/cs_status' contains
the value 'ok'.
$ cat /sys/firmware/cert_store/cs_status
ok
Get the ID of the keyring 'cert_store':
$ keyctl search @us keyring cert_store
OR
$ keyctl link @us @s; keyctl request keyring cert_store
Obtain list of IDs of certificates:
$ keyctl rlist <cert_store keyring ID>
Display certificate content as hex-dump:
$ keyctl read <certificate ID>
Read certificate contents as binary data:
$ keyctl pipe <certificate ID> >cert_data
Display certificate description:
$ keyctl describe <certificate ID>
The certificate description has the following format:
<64 bytes certificate name in EBCDIC> ':'
<certificate index as obtained from hypervisor> ':'
<certificate store token obtained from hypervisor>
The certificate description in /proc/keys has certificate name
represented in ASCII.
Users can read but cannot update the content of the certificate.
Signed-off-by: Anastasia Eskova <anastasia.eskova@ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Fix virtual vs physical address confusion in vmem_add_range()
and vmem_remove_range() functions.
- Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
throughout s390 code.
- Make all PSW related defines also available for assembler files.
Remove PSW_DEFAULT_KEY define from uapi for that.
- When adding an undefined symbol the build still succeeds, but
userspace crashes trying to execute VDSO, because the symbol
is not resolved. Add undefined symbols check to prevent that.
- Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
memory when executing s390 crypto adapter IOCTL.
- Add -fPIE flag to prevent decompressor misaligned symbol build
error with clang.
- Use .balign instead of .align everywhere. This is a no-op for s390,
but with this there no mix in using .align and .balign anymore.
- Filter out -mno-pic-data-is-text-relative flag when compiling
kernel to prevent VDSO build error.
- Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
mask directly.
- Do not retry administrative requests to some s390 crypto cards,
since the firmware assumes replay attacks.
- Remove most of the debug code, which is build in when kernel config
option CONFIG_ZCRYPT_DEBUG is enabled.
- Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
off the multiple devices support for the s390 zcrypt device driver.
- With the conversion to generic entry machine checks are accounted
to the current context instead of irq time. As result, the STCKF
instruction at the beginning of the machine check handler and the
lowcore member are no longer required, therefore remove it.
- Fix various typos found with codespell.
- Minor cleanups to CPU-measurement Counter and Sampling Facilities code.
- Revert patch that removes VMEM_MAX_PHYS macro, since it causes
a regression.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZKakSBccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8HeIAQCg9RX3/olsZhCqRNLZ/O+6FXAF
29ohi2JmVqxJBKkmwgEA/QXCjoTOp41pQJ1FD39HnI8DeYpJFRnYYE5D3acibAw=
=2Ykk
-----END PGP SIGNATURE-----
Merge tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:
- Fix virtual vs physical address confusion in vmem_add_range() and
vmem_remove_range() functions
- Include <linux/io.h> instead of <asm/io.h> and <asm-generic/io.h>
throughout s390 code
- Make all PSW related defines also available for assembler files.
Remove PSW_DEFAULT_KEY define from uapi for that
- When adding an undefined symbol the build still succeeds, but
userspace crashes trying to execute VDSO, because the symbol is not
resolved. Add undefined symbols check to prevent that
- Use kvmalloc_array() instead of kzalloc() for allocaton of 256k
memory when executing s390 crypto adapter IOCTL
- Add -fPIE flag to prevent decompressor misaligned symbol build error
with clang
- Use .balign instead of .align everywhere. This is a no-op for s390,
but with this there no mix in using .align and .balign anymore
- Filter out -mno-pic-data-is-text-relative flag when compiling kernel
to prevent VDSO build error
- Rework entering of DAT-on mode on CPU restart to use PSW_KERNEL_BITS
mask directly
- Do not retry administrative requests to some s390 crypto cards, since
the firmware assumes replay attacks
- Remove most of the debug code, which is build in when kernel config
option CONFIG_ZCRYPT_DEBUG is enabled
- Remove CONFIG_ZCRYPT_MULTIDEVNODES kernel config option and switch
off the multiple devices support for the s390 zcrypt device driver
- With the conversion to generic entry machine checks are accounted to
the current context instead of irq time. As result, the STCKF
instruction at the beginning of the machine check handler and the
lowcore member are no longer required, therefore remove it
- Fix various typos found with codespell
- Minor cleanups to CPU-measurement Counter and Sampling Facilities
code
- Revert patch that removes VMEM_MAX_PHYS macro, since it causes a
regression
* tag 's390-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (25 commits)
Revert "s390/mm: get rid of VMEM_MAX_PHYS macro"
s390/cpum_sf: remove check on CPU being online
s390/cpum_sf: handle casts consistently
s390/cpum_sf: remove unnecessary debug statement
s390/cpum_sf: remove parameter in call to pr_err
s390/cpum_sf: simplify function setup_pmu_cpu
s390/cpum_cf: remove unneeded debug statements
s390/entry: remove mcck clock
s390: fix various typos
s390/zcrypt: remove ZCRYPT_MULTIDEVNODES kernel config option
s390/zcrypt: do not retry administrative requests
s390/zcrypt: cleanup some debug code
s390/entry: rework entering DAT-on mode on CPU restart
s390/mm: fence off VM macros from asm and linker
s390: include linux/io.h instead of asm/io.h
s390/ptrace: make all psw related defines also available for asm
s390/ptrace: remove PSW_DEFAULT_KEY from uapi
s390/vdso: filter out mno-pic-data-is-text-relative cflag
s390: consistently use .balign instead of .align
s390/decompressor: fix misaligned symbol build error
...
During sampling event initialization, a check is done if that
particular CPU the event is to be installed on is actually online.
This check is not necessary, as it is also performed in the
system call entry point. Therefore remove this check.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The casts are written in two different notations:
(cast) expression
and
(cast)expression
Convert statements with the first notation to the second notation.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Remove debug_sprint_event() statement right after an pr_err()
statement. No additional debug information is generated.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The op argument is hardcoded in the parameter list of function pr_err.
Make the op code part of the text printed by pr_err.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Print the error message when the FAILURE flag is set.
This saves on pr_err statement as the text of the error message
is identical in both failures.
Also observe reverse Xmas tree variable declarations in this function.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Remove most debug statements which are not needed anymore from
the CPU Measurement counter facility device driver.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
* Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the stage-2
fault path.
* Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with
services that live in the Secure world. pKVM intervenes on FF-A calls
to guarantee the host doesn't misuse memory donated to the hyp or a
pKVM guest.
* Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
* Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set configuration
from userspace, but the intent is to relax this limitation and allow
userspace to select a feature set consistent with the CPU.
* Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
* Use a separate set of pointer authentication keys for the hypervisor
when running in protected mode, as the host is untrusted at runtime.
* Ensure timer IRQs are consistently released in the init failure
paths.
* Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps
(FEAT_EVT), as it is a register commonly read from userspace.
* Erratum workaround for the upcoming AmpereOne part, which has broken
hardware A/D state management.
RISC-V:
* Redirect AMO load/store misaligned traps to KVM guest
* Trap-n-emulate AIA in-kernel irqchip for KVM guest
* Svnapot support for KVM Guest
s390:
* New uvdevice secret API
* CMM selftest and fixes
* fix racy access to target CPU for diag 9c
x86:
* Fix missing/incorrect #GP checks on ENCLS
* Use standard mmu_notifier hooks for handling APIC access page
* Drop now unnecessary TR/TSS load after VM-Exit on AMD
* Print more descriptive information about the status of SEV and SEV-ES during
module load
* Add a test for splitting and reconstituting hugepages during and after
dirty logging
* Add support for CPU pinning in demand paging test
* Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes
included along the way
* Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage
recovery threads (because nx_huge_pages=off can be toggled at runtime)
* Move handling of PAT out of MTRR code and dedup SVM+VMX code
* Fix output of PIC poll command emulation when there's an interrupt
* Add a maintainer's handbook to document KVM x86 processes, preferred coding
style, testing expectations, etc.
* Misc cleanups, fixes and comments
Generic:
* Miscellaneous bugfixes and cleanups
Selftests:
* Generate dependency files so that partial rebuilds work as expected
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmSgHrIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroORcAf+KkBlXwQMf+Q0Hy6Mfe0OtkKmh0Ae
6HJ6dsuMfOHhWv5kgukh+qvuGUGzHq+gpVKmZg2yP3h3cLHOLUAYMCDm+rjXyjsk
F4DbnJLfxq43Pe9PHRKFxxSecRcRYCNox0GD5UYL4PLKcH0FyfQrV+HVBK+GI8L3
FDzUcyJkR12Lcj1qf++7fsbzfOshL0AJPmidQCoc6wkLJpUEr/nYUqlI1Kx3YNuQ
LKmxFHS4l4/O/px3GKNDrLWDbrVlwciGIa3GZLS52PZdW3mAqT+cqcPcYK6SW71P
m1vE80VbNELX5q3YSRoOXtedoZ3Pk97LEmz/xQAsJ/jri0Z5Syk0Ok0m/Q==
=AMXp
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Eager page splitting optimization for dirty logging, optionally
allowing for a VM to avoid the cost of hugepage splitting in the
stage-2 fault path.
- Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact
with services that live in the Secure world. pKVM intervenes on
FF-A calls to guarantee the host doesn't misuse memory donated to
the hyp or a pKVM guest.
- Support for running the split hypervisor with VHE enabled, known as
'hVHE' mode. This is extremely useful for testing the split
hypervisor on VHE-only systems, and paves the way for new use cases
that depend on having two TTBRs available at EL2.
- Generalized framework for configurable ID registers from userspace.
KVM/arm64 currently prevents arbitrary CPU feature set
configuration from userspace, but the intent is to relax this
limitation and allow userspace to select a feature set consistent
with the CPU.
- Enable the use of Branch Target Identification (FEAT_BTI) in the
hypervisor.
- Use a separate set of pointer authentication keys for the
hypervisor when running in protected mode, as the host is untrusted
at runtime.
- Ensure timer IRQs are consistently released in the init failure
paths.
- Avoid trapping CTR_EL0 on systems with Enhanced Virtualization
Traps (FEAT_EVT), as it is a register commonly read from userspace.
- Erratum workaround for the upcoming AmpereOne part, which has
broken hardware A/D state management.
RISC-V:
- Redirect AMO load/store misaligned traps to KVM guest
- Trap-n-emulate AIA in-kernel irqchip for KVM guest
- Svnapot support for KVM Guest
s390:
- New uvdevice secret API
- CMM selftest and fixes
- fix racy access to target CPU for diag 9c
x86:
- Fix missing/incorrect #GP checks on ENCLS
- Use standard mmu_notifier hooks for handling APIC access page
- Drop now unnecessary TR/TSS load after VM-Exit on AMD
- Print more descriptive information about the status of SEV and
SEV-ES during module load
- Add a test for splitting and reconstituting hugepages during and
after dirty logging
- Add support for CPU pinning in demand paging test
- Add support for AMD PerfMonV2, with a variety of cleanups and minor
fixes included along the way
- Add a "nx_huge_pages=never" option to effectively avoid creating NX
hugepage recovery threads (because nx_huge_pages=off can be toggled
at runtime)
- Move handling of PAT out of MTRR code and dedup SVM+VMX code
- Fix output of PIC poll command emulation when there's an interrupt
- Add a maintainer's handbook to document KVM x86 processes,
preferred coding style, testing expectations, etc.
- Misc cleanups, fixes and comments
Generic:
- Miscellaneous bugfixes and cleanups
Selftests:
- Generate dependency files so that partial rebuilds work as
expected"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits)
Documentation/process: Add a maintainer handbook for KVM x86
Documentation/process: Add a label for the tip tree handbook's coding style
KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index
RISC-V: KVM: Remove unneeded semicolon
RISC-V: KVM: Allow Svnapot extension for Guest/VM
riscv: kvm: define vcpu_sbi_ext_pmu in header
RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC
RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip
RISC-V: KVM: Add in-kernel emulation of AIA APLIC
RISC-V: KVM: Implement device interface for AIA irqchip
RISC-V: KVM: Skeletal in-kernel AIA irqchip support
RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero
RISC-V: KVM: Add APLIC related defines
RISC-V: KVM: Add IMSIC related defines
RISC-V: KVM: Implement guest external interrupt line management
KVM: x86: Remove PRIx* definitions as they are solely for user space
s390/uv: Update query for secret-UVCs
s390/uv: replace scnprintf with sysfs_emit
s390/uvdevice: Add 'Lock Secret Store' UVC
...
In the past machine checks where accounted as irq time. With the conversion
to generic entry, it was decided to account machine checks to the current
context. The stckf at the beginning of the machine check handler and the
lowcore member is no longer required, therefore remove it.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Instead of enforcing PSW_MASK_DAT bit on previously stored
in lowcore restart_psw.mask use the PSW_KERNEL_BITS mask
(which contains PSW_MASK_DAT) directly.
As result, the PSW mask stored in lowcore is only used to
enter the CPU restart routine, while PSW_KERNEL_BITS is
used to enter the kernel code - similarily to commit
64ea2977add2 ("s390/mm: start kernel with DAT enabled").
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Include linux/io.h instead of asm/io.h everywhere. linux/io.h includes
asm/io.h, so this shouldn't cause any problems. Instead this might help for
some randconfig build errors which were reported due to some undefined io
related functions.
Also move the changed include so it stays grouped together with other
includes from the same directory.
For ctcm_mpc.c also remove not needed comments (actually questions).
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
top-level directories.
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs.
- Zhen Lei has enhanced kexec's ability to support two crash regions.
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries.
- And the usual bunch of singleton patches in various places.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJelTAAKCRDdBJ7gKXxA
juDkAP0VXWynzkXoojdS/8e/hhi+htedmQ3v2dLZD+vBrctLhAEA7rcH58zAVoWa
2ejqO6wDrRGUC7JQcO9VEjT0nv73UwU=
=F293
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-mm updates from Andrew Morton:
- Arnd Bergmann has fixed a bunch of -Wmissing-prototypes in top-level
directories
- Douglas Anderson has added a new "buddy" mode to the hardlockup
detector. It permits the detector to work on architectures which
cannot provide the required interrupts, by having CPUs periodically
perform checks on other CPUs
- Zhen Lei has enhanced kexec's ability to support two crash regions
- Petr Mladek has done a lot of cleanup on the hard lockup detector's
Kconfig entries
- And the usual bunch of singleton patches in various places
* tag 'mm-nonmm-stable-2023-06-24-19-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits)
kernel/time/posix-stubs.c: remove duplicated include
ocfs2: remove redundant assignment to variable bit_off
watchdog/hardlockup: fix typo in config HARDLOCKUP_DETECTOR_PREFER_BUDDY
powerpc: move arch_trigger_cpumask_backtrace from nmi.h to irq.h
devres: show which resource was invalid in __devm_ioremap_resource()
watchdog/hardlockup: define HARDLOCKUP_DETECTOR_ARCH
watchdog/sparc64: define HARDLOCKUP_DETECTOR_SPARC64
watchdog/hardlockup: make HAVE_NMI_WATCHDOG sparc64-specific
watchdog/hardlockup: declare arch_touch_nmi_watchdog() only in linux/nmi.h
watchdog/hardlockup: make the config checks more straightforward
watchdog/hardlockup: sort hardlockup detector related config values a logical way
watchdog/hardlockup: move SMP barriers from common code to buddy code
watchdog/buddy: simplify the dependency for HARDLOCKUP_DETECTOR_PREFER_BUDDY
watchdog/buddy: don't copy the cpumask in watchdog_next_cpu()
watchdog/buddy: cleanup how watchdog_buddy_check_hardlockup() is called
watchdog/hardlockup: remove softlockup comment in touch_nmi_watchdog()
watchdog/hardlockup: in watchdog_hardlockup_check() use cpumask_copy()
watchdog/hardlockup: don't use raw_cpu_ptr() in watchdog_hardlockup_kick()
watchdog/hardlockup: HAVE_NMI_WATCHDOG must implement watchdog_hardlockup_probe()
watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe fails
...
- Yosry has also eliminated cgroup's atomic rstat flushing.
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability.
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning.
- Lorenzo Stoakes has done some maintanance work on the get_user_pages()
interface.
- Liam Howlett continues with cleanups and maintenance work to the maple
tree code. Peng Zhang also does some work on maple tree.
- Johannes Weiner has done some cleanup work on the compaction code.
- David Hildenbrand has contributed additional selftests for
get_user_pages().
- Thomas Gleixner has contributed some maintenance and optimization work
for the vmalloc code.
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code.
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting.
- Christoph Hellwig has some cleanups for the filemap/directio code.
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the provided
APIs rather than open-coding accesses.
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings.
- John Hubbard has a series of fixes to the MM selftesting code.
- ZhangPeng continues the folio conversion campaign.
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock.
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
128 to 8.
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management.
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code.
- Vishal Moola also has done some folio conversion work.
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
=B7yQ
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
cmd_vdso_check checks if there are any dynamic relocations in
vdso64.so.dbg. When kernel is compiled with
-mno-pic-data-is-text-relative, R_390_RELATIVE relocs are generated and
this results in kernel build error.
kpatch uses -mno-pic-data-is-text-relative option when building the
kernel to prevent relative addressing between code and data. The flag
avoids relocation error when klp text and data are too far apart
kpatch does not patch vdso code and hence the
mno-pic-data-is-text-relative flag is not essential.
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The .align directive has inconsistent behavior across architectures. Use
.balign instead everywhere. This is a no-op for s390, but with this there
is no mix in using .align and .balign anymore.
Future code is supposed to use only .balign.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
When adding an undefined symbol the build still succeeds, but
userspace is crashing trying to execute vdso because the
undefined symbol is not resolved. Add the check for undefined
symbols to prevent this.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
- Fix the style of protected key API driver source: use
x-mas tree for all local variable declarations.
- Rework protected key API driver to not use the struct
pkey_protkey and pkey_clrkey anymore. Both structures
have a fixed size buffer, but with the support of ECC
protected key these buffers are not big enough. Use
dynamic buffers internally and transparently for
userspace.
- Add support for a new 'non CCA clear key token' with
ECC clear keys supported: ECC P256, ECC P384, ECC P521,
ECC ED25519 and ECC ED448. This makes it possible to
derive a protected key from the ECC clear key input via
PKEY_KBLOB2PROTK3 ioctl, while currently the only way
to derive is via PCKMO instruction.
- The s390 PMU of PAI crypto and extension 1 NNPA counters
use atomic_t for reference counting. Replace this with
the proper data type refcount_t.
- Select ARCH_SUPPORTS_INT128, but limit this to clang for
now, since gcc generates inefficient code, which may lead
to stack overflows.
- Replace one-element array with flexible-array member in
struct vfio_ccw_parent and refactor the rest of the code
accordingly. Also, prefer struct_size() over sizeof() open-
coded versions.
- Introduce OS_INFO_FLAGS_ENTRY pointing to a flags field and
OS_INFO_FLAG_REIPL_CLEAR flag that informs a dumper whether
the system memory should be cleared or not once dumped.
- Fix a hang when a user attempts to remove a VFIO-AP mediated
device attached to a guest: add VFIO_DEVICE_GET_IRQ_INFO and
VFIO_DEVICE_SET_IRQS IOCTLs and wire up the VFIO bus driver
callback to request a release of the device.
- Fix calculation for R_390_GOTENT relocations for modules.
- Allow any user space process with CAP_PERFMON capability
read and display the CPU Measurement facility counter sets.
- Rework large statically-defined per-CPU cpu_cf_events data
structure and replace it with dynamically allocated structures
created when a perf_event_open() system call is invoked or
/dev/hwctr device is accessed.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZJsnCRccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D+RAQCN5CYdql/X8kOMcs4jJvDHEZf8
5CHYfmT2SLNs+zWtLgD/f/C9XXv3El/0wMBvuWSZ+T1nw+imt2cz+FJ+Nor+UQg=
=R7Dm
-----END PGP SIGNATURE-----
Merge tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Fix the style of protected key API driver source: use x-mas tree for
all local variable declarations
- Rework protected key API driver to not use the struct pkey_protkey
and pkey_clrkey anymore. Both structures have a fixed size buffer,
but with the support of ECC protected key these buffers are not big
enough. Use dynamic buffers internally and transparently for
userspace
- Add support for a new 'non CCA clear key token' with ECC clear keys
supported: ECC P256, ECC P384, ECC P521, ECC ED25519 and ECC ED448.
This makes it possible to derive a protected key from the ECC clear
key input via PKEY_KBLOB2PROTK3 ioctl, while currently the only way
to derive is via PCKMO instruction
- The s390 PMU of PAI crypto and extension 1 NNPA counters use atomic_t
for reference counting. Replace this with the proper data type
refcount_t
- Select ARCH_SUPPORTS_INT128, but limit this to clang for now, since
gcc generates inefficient code, which may lead to stack overflows
- Replace one-element array with flexible-array member in struct
vfio_ccw_parent and refactor the rest of the code accordingly. Also,
prefer struct_size() over sizeof() open- coded versions
- Introduce OS_INFO_FLAGS_ENTRY pointing to a flags field and
OS_INFO_FLAG_REIPL_CLEAR flag that informs a dumper whether the
system memory should be cleared or not once dumped
- Fix a hang when a user attempts to remove a VFIO-AP mediated device
attached to a guest: add VFIO_DEVICE_GET_IRQ_INFO and
VFIO_DEVICE_SET_IRQS IOCTLs and wire up the VFIO bus driver callback
to request a release of the device
- Fix calculation for R_390_GOTENT relocations for modules
- Allow any user space process with CAP_PERFMON capability read and
display the CPU Measurement facility counter sets
- Rework large statically-defined per-CPU cpu_cf_events data structure
and replace it with dynamically allocated structures created when a
perf_event_open() system call is invoked or /dev/hwctr device is
accessed
* tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpum_cf: rework PER_CPU_DEFINE of struct cpu_cf_events
s390/cpum_cf: open access to hwctr device for CAP_PERFMON privileged process
s390/module: fix rela calculation for R_390_GOTENT
s390/vfio-ap: wire in the vfio_device_ops request callback
s390/vfio-ap: realize the VFIO_DEVICE_SET_IRQS ioctl
s390/vfio-ap: realize the VFIO_DEVICE_GET_IRQ_INFO ioctl
s390/pkey: add support for ecc clear key
s390/pkey: do not use struct pkey_protkey
s390/pkey: introduce reverse x-mas trees
s390/zcore: conditionally clear memory on reipl
s390/ipl: add REIPL_CLEAR flag to os_info
vfio/ccw: use struct_size() helper
vfio/ccw: replace one-element array with flexible-array member
s390: select ARCH_SUPPORTS_INT128
s390/pai_ext: replace atomic_t with refcount_t
s390/pai_crypto: replace atomic_t with refcount_t
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double().
The cmpxchg128() family of functions is basically & functionally
the same as cmpxchg_double(), but with a saner interface: instead
of a 6-parameter horror that forced u128 - u64/u64-halves layout
details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128 types.
- Restructure the generated atomic headers, and add
kerneldoc comments for all of the generic atomic{,64,_long}_t
operations. Generated definitions are much cleaner now,
and come with documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering
when taking multiple locks of the same type. This gets rid of
one use of lockdep_set_novalidate_class() in the bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended
variable shadowing generating garbage code on Clang on certain
ARM builds.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSav3wRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gDyxAAjCHQjpolrre7fRpyiTDwqzIKT27H04vQ
zrQVlVc42WBnn9pe8LthGy43/RvYvqlZvLoLONA4fMkuYriM6nSMsoZjeUmE+6Rs
QAElQC74P5YvEBOa67VNY3/M7sj22ftDe7ODtVV8OrnPjMk1sQNRvaK025Cs3yig
8MAI//hHGNmyVAp1dPYZMJNqxGCvluReLZ4SaUJFCMrg7YgUXgCBj/5Gi07TlKxn
sT8BFCssoEW/B9FXkh59B1t6FBCZoSy4XSZfsZe0uVAUJ4XDEOO+zBgaWFCedNQT
wP323ryBgMrkzUKA8j2/o5d3QnMA1GcBfHNNlvAl/fOfrxWXzDZnOEY26YcaLMa0
YIuRF/JNbPZlt6DCUVBUEvMPpfNYi18dFN0rat1a6xL2L4w+tm55y3mFtSsg76Ka
r7L2nWlRrAGXnuA+VEPqkqbSWRUSWOv5hT2Mcyb5BqqZRsxBETn6G8GVAzIO6j6v
giyfUdA8Z9wmMZ7NtB6usxe3p1lXtnZ/shCE7ZHXm6xstyZrSXaHgOSgAnB9DcuJ
7KpGIhhSODQSwC/h/J0KEpb9Pr/5jCWmXAQ2DWnZK6ndt1jUfFi8pfK58wm0AuAM
o9t8Mx3o8wZjbMdt6up9OIM1HyFiMx2BSaZK+8f/bWemHQ0xwez5g4k5O5AwVOaC
x9Nt+Tp0Ze4=
=DsYj
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
- Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of higher-frequency
SMT cores and lower-frequency non-SMT cores), under the old code
lower-priority CPUs pulled tasks from the higher-priority cores if
more than one SMT sibling was busy - resulting in many unnecessary
task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores with more
than one busy sibling and allows lower-priority CPUs to pull tasks, which
avoids superfluous migrations and lets lower-priority cores inspect all SMT
siblings for the busiest queue.
- Implement the 'runnable boosting' feature in the EAS balancer: consider CPU
contention in frequency, EAS max util & load-balance busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves other key
workloads unchanged.
- Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it
into the build_sched_topology() helper function and building
it dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
- Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations.
- Fix task_struct::saved_state handling.
- Fix various rq clock update bugs, unearthed by turning on the rq clock
debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger by
creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
- Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation
to (maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSatWQRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j62xAAuGOx1LcDfRGC6WGQzp1zOdlsVQtnDvlS
qL58zYSHgizprpVQ3j87SBaG4CHCdvd2Bo36yW0lNZS4nd203qdq7fkrMb3hPP/w
egUQUzMegf5fF6BWldKeMjuHSt+twFQz/ZAKK8iSbAir6CHNAqbNst1oL0i/+Tyk
o33hBs1hT5tnbFb1NSVZkX4k+qT3LzTW4K2QgjjGtkScr6yHh2BdEVefyigWOjdo
9s02d00ll9a2r+F5txlN7Dnw6TN7rmTXGMOJU5bZvBE90/anNiAorMXHJdEKCyUR
u9+JtBdJWiCplGa/tSRcxT16ZW1VdtTnd9q66TDhXREd2UNDFqBEyg5Wl77K4Tlf
vKFajmj/to+cTbuv6m6TVR+zyXpdEpdL6F04P44U3qiJvDobBqeDNKHHIqpmbHXl
AXUXcPWTVAzXX1Ce5M+BeAgTBQ1T7C5tELILrTNQHJvO1s9VVBRFZ/l65Ps4vu7T
wIZ781IFuopk0zWqHovNvgKrJ7oFmOQQZFttQEe8n6nafkjI7u+IZ8FayiGaUMRr
4GawFGUCEdYh8z9qyslGKe8Q/Rphfk6hxMFRYUJpDmubQ0PkMeDjDGq77jDGl1PF
VqwSDEyOaBJs7Gqf/mem00JtzBmXhkhm1SEjggHMI2IQbr/eeBXoLQOn3CDapO/N
PiDbtX760ic=
=EWQA
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Scheduler SMP load-balancer improvements:
- Avoid unnecessary migrations within SMT domains on hybrid systems.
Problem:
On hybrid CPU systems, (processors with a mixture of
higher-frequency SMT cores and lower-frequency non-SMT cores),
under the old code lower-priority CPUs pulled tasks from the
higher-priority cores if more than one SMT sibling was busy -
resulting in many unnecessary task migrations.
Solution:
The new code improves the load balancer to recognize SMT cores
with more than one busy sibling and allows lower-priority CPUs
to pull tasks, which avoids superfluous migrations and lets
lower-priority cores inspect all SMT siblings for the busiest
queue.
- Implement the 'runnable boosting' feature in the EAS balancer:
consider CPU contention in frequency, EAS max util & load-balance
busiest CPU selection.
This improves CPU utilization for certain workloads, while leaves
other key workloads unchanged.
Scheduler infrastructure improvements:
- Rewrite the scheduler topology setup code by consolidating it into
the build_sched_topology() helper function and building it
dynamically on the fly.
- Resolve the local_clock() vs. noinstr complications by rewriting
the code: provide separate sched_clock_noinstr() and
local_clock_noinstr() functions to be used in instrumentation code,
and make sure it is all instrumentation-safe.
Fixes:
- Fix a kthread_park() race with wait_woken()
- Fix misc wait_task_inactive() bugs unearthed by the -rt merge:
- Fix UP PREEMPT bug by unifying the SMP and UP implementations
- Fix task_struct::saved_state handling
- Fix various rq clock update bugs, unearthed by turning on the rq
clock debugging code.
- Fix the PSI WINDOW_MIN_US trigger limit, which was easy to trigger
by creating enough cgroups, by removing the warnign and restricting
window size triggers to PSI file write-permission or
CAP_SYS_RESOURCE.
- Propagate SMT flags in the topology when removing degenerate domain
- Fix grub_reclaim() calculation bug in the deadline scheduler code
- Avoid resetting the min update period when it is unnecessary, in
psi_trigger_destroy().
- Don't balance a task to its current running CPU in load_balance(),
which was possible on certain NUMA topologies with overlapping
groups.
- Fix the sched-debug printing of rq->nr_uninterruptible
Cleanups:
- Address various -Wmissing-prototype warnings, as a preparation to
(maybe) enable this warning in the future.
- Remove unused code
- Mark more functions __init
- Fix shadow-variable warnings"
* tag 'sched-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
sched/core: Avoid double calling update_rq_clock() in __balance_push_cpu_stop()
sched/core: Fixed missing rq clock update before calling set_rq_offline()
sched/deadline: Update GRUB description in the documentation
sched/deadline: Fix bandwidth reclaim equation in GRUB
sched/wait: Fix a kthread_park race with wait_woken()
sched/topology: Mark set_sched_topology() __init
sched/fair: Rename variable cpu_util eff_util
arm64/arch_timer: Fix MMIO byteswap
sched/fair, cpufreq: Introduce 'runnable boosting'
sched/fair: Refactor CPU utilization functions
cpuidle: Use local_clock_noinstr()
sched/clock: Provide local_clock_noinstr()
x86/tsc: Provide sched_clock_noinstr()
clocksource: hyper-v: Provide noinstr sched_clock()
clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX
x86/vdso: Fix gettimeofday masking
math64: Always inline u128 version of mul_u64_u64_shr()
s390/time: Provide sched_clock_noinstr()
loongarch: Provide noinstr sched_clock_read()
...
- Use correct type for size of memory allocated for ELF core
header on kernel crash.
- Fix insecure W+X mapping warning when KASAN shadow memory
range is not aligned on page boundary.
- Avoid allocation of short by one page KASAN shadow memory
when the original memory range is less than (PAGE_SIZE << 3).
- Fix virtual vs physical address confusion in physical memory
enumerator. It is not a real issue, since virtual and physical
addresses are currently the same.
- Set CONFIG_NET_TC_SKB_EXT=y in s390 config files as it is
required for offloading TC as well as bridges on switchdev
capable ConnectX devices.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZJWgRRccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8LuPAQDaL4mjPxKvleuiTeH1hf4id48X
i5UZSf6mAirKwyo4WAEA0sAIvhdJlC8+MBEC1QtYkIJyoxmSg6AsMH5MJL+61wA=
=mJ2Q
-----END PGP SIGNATURE-----
Merge tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Use correct type for size of memory allocated for ELF core header on
kernel crash.
- Fix insecure W+X mapping warning when KASAN shadow memory range is
not aligned on page boundary.
- Avoid allocation of short by one page KASAN shadow memory when the
original memory range is less than (PAGE_SIZE << 3).
- Fix virtual vs physical address confusion in physical memory
enumerator. It is not a real issue, since virtual and physical
addresses are currently the same.
- Set CONFIG_NET_TC_SKB_EXT=y in s390 config files as it is required
for offloading TC as well as bridges on switchdev capable ConnectX
devices.
* tag 's390-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/defconfigs: set CONFIG_NET_TC_SKB_EXT=y
s390/boot: fix physmem_info virtual vs physical address confusion
s390/kasan: avoid short by one page shadow memory
s390/kasan: fix insecure W+X mapping warning
s390/crash: use the correct type for memory allocation
Struct cpu_cf_events is a large data structure and is statically defined
for each possible CPU. Rework this and replace it by dynamically
allocated data structures created when a perf_event_open() system call
is invoked or an access via character device /dev/hwctr takes place.
It is replaced by an array of pointers to all possible CPUs and
reference counting. The array of pointers is allocated when the first
event is created. For each online CPU an event is installed on, a struct
cpu_cf_events is allocated and a pointer to struct cpu_cf_events is
stored in the array:
CPU 0 1 2 3 ... N
+---+---+---+---+---+---+
cpu_cf_root::cpucf--> | * | | | |...| |
+-|-+---+---+---+---+---+
|
|
\|/
+-------------+
|cpu_cf_events|
| |
+-------------+
With this approach the large data structure is only allocated when
an event is actually installed and used.
Also implement proper reference counting for allocation and removal.
During interrupt processing make sure the pointer to cpu_cf_events
is valid. The interrupt handler is shared and might be called when
no event is active.
This requires checking for a valid pointer to struct cpu_cf_events.
When the pointer to the per-cpu cpu_cf_events is NULL, simply return.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The device /dev/hwctr was introduced to access complete
CPU Measurement facility counter sets via an ioctl system call.
The access the to device is limited to privileged processes
running as root or superuser. The capability CAP_SYS_ADMIN
is required. The device permissions are read/write for the
device owner root. There is no need for this restriction.
Make the device access permission read/write for all and
reduce the capabilities to CAP_PERFMON.
Any user space program with the CAP_PERFMON capability assigned to it
can now read and display the CPU Measurement facility counter sets.
For more details on perf tool usage and security, see linux
documentation in Documentation/admin-guide/perf-security.rst.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
During module load, module layout allocation occurs by initially
allowing the architecture to frob the sections. This is performed via
module_frob_arch_sections().
However, the size of each module memory types like text,data,rodata etc
are updated correctly only after layout_sections().
After calculation of required module memory sizes for each types,
move_module() is responsible for allocating the module memory for each
type from modules vaddr range.
Considering the sequence above, module_frob_arch_sections() updates the
module mod_arch_specific got_offset before module memory text type size
is fully updated in layout_sections(). Hence mod_arch_specific
got_offset points to currently zero.
As per s390 ABI,
R_390_GOTENT : (G + O + A - P) >> 1
where
G=me->mem[MOD_TEXT].base+me->arch.got_offset
O=info->got_offset
A=rela->r_addend
P=loc
fix R_390_GOTENT calculation in apply_rela().
Note: currently this doesn't break anything because me->arch.got_offset
is zero. However, reordering of functions in the future could break it.
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.
Link: https://lkml.kernel.org/r/3ff29363-336a-9733-12a1-5c31a45c8aeb@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update the query struct such that secret-UVC related
information can be parsed.
Add sysfs files for these new values.
'supp_add_secret_req_ver' notes the supported versions for the
Add Secret UVC. Bit 0 indicates that version 0x100 is supported,
bit 1 indicates 0x200, and so on.
'supp_add_secret_pcf' notes the supported plaintext flags for
the Add Secret UVC.
'supp_secret_types' notes the supported types of secrets.
Bit 0 indicates secret type 1, bit 1 indicates type 2, and so on.
'max_secrets' notes the maximum amount of secrets the secret store can
store per pv guest.
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230615100533.3996107-8-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20230615100533.3996107-8-seiden@linux.ibm.com>
KVM needs the struct's values to be able to provide PV support.
The uvdevice is currently guest only and will need the struct's values
for call support checking and potential future expansions.
As uv.c is only compiled with CONFIG_PGSTE or
CONFIG_PROTECTED_VIRTUALIZATION_GUEST we don't need a second check in
the code. Users of uv_info will need to fence for these two config
options for the time being.
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20230615100533.3996107-2-seiden@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20230615100533.3996107-2-seiden@linux.ibm.com>
The init/main.c file contains some extern declarations for functions
defined in architecture code, and it defines some other functions that are
called from architecture code with a custom prototype. Both of those
result in warnings with 'make W=1':
init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes]
init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes]
arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes]
arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes]
arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes]
init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes]
kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes]
Add prototypes for all of these in include/linux/init.h or another
appropriate header, and remove the duplicate declarations from
architecture specific code.
[sfr@canb.auug.org.au: declare time_init_early()]
Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au
Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cachestat is previously only wired in for x86 (and architectures using
the generic unistd.h table):
https://lore.kernel.org/lkml/20230503013608.2431726-1-nphamcs@gmail.com/
This patch wires cachestat in for all the other architectures.
[nphamcs@gmail.com: wire up cachestat for arm64]
Link: https://lkml.kernel.org/r/20230511092843.3896327-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230510195806.2902878-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With the intent to provide local_clock_noinstr(), a variant of
local_clock() that's safe to be called from noinstr code (with the
assumption that any such code will already be non-preemptible),
prepare for things by providing a noinstr sched_clock_noinstr()
function.
Specifically, preempt_enable_*() calls out to schedule(), which upsets
noinstr validation efforts.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V
Link: https://lore.kernel.org/r/20230519102715.570170436@infradead.org
Now that there is a cross arch u128 and cmpxchg128(), use those
instead of the custom CDSG helper.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230531132324.058821078@infradead.org
Introduce new OS_INFO_FLAGS_ENTRY to os_info pointing to the field
with bit flags.
Add OS_INFO_FLAGS_ENTRY upon dump_reipl shutdown action processing and
set OS_INFO_FLAG_REIPL_CLEAR flag indicating 'clear' sysfs attribute has
been set on the panicked system for specified ipl type. This flag can be
used to inform the dumper whether LOAD_CLEAR or LOAD_NORMAL diag308
subcode to be used for ipl after dumping the memory.
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
- Add check whether the required facilities are installed
before using the s390-specific ChaCha20 implementation.
- Key blobs for s390 protected key interface IOCTLs commands
PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key
material. Zeroize copies of these keys in kernel memory
after creating protected keys.
- Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra
overhead of initializing all stack variables by default.
- Make sure that when a new channel-path is enabled all
subchannels are evaluated: with and without any devices
connected on it.
- When SMT thread CPUs are added to CPU topology masks the
nr_cpu_ids limit is not checked and could be exceeded.
Respect the nr_cpu_ids limit and avoid a warning when
CONFIG_DEBUG_PER_CPU_MAPS is set.
- The pointer to IPL Parameter Information Block is stored
in the absolute lowcore as a virtual address. Save it as
the physical address for later use by dump tools.
- Fix a Queued Direct I/O (QDIO) problem on z/VM guests using
QIOASSIST with dedicated (pass through) QDIO-based devices
such as FCP, real OSA or HiperSockets.
- s390's struct statfs and struct statfs64 contain padding,
which field-by-field copying does not set. Initialize the
respective structures with zeros before filling them and
copying to userspace.
- Grow s390 compat_statfs64, statfs and statfs64 structures
f_spare array member to cover padding and simplify things.
- Remove obsolete SCHED_BOOK and SCHED_DRAWER configs.
- Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZGd5BRccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8OqMAQCsdBG7eR3dp3mY8ao34dqlWt98
rDQD8oiMgCkFyn77jQEAoo3HhqWY8oTu88fl82dkF0OpGW+7zgoNHUYhH8Z0gAY=
=wtTO
-----END PGP SIGNATURE-----
Merge tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- Add check whether the required facilities are installed before using
the s390-specific ChaCha20 implementation
- Key blobs for s390 protected key interface IOCTLs commands
PKEY_VERIFYKEY2 and PKEY_VERIFYKEY3 may contain clear key material.
Zeroize copies of these keys in kernel memory after creating
protected keys
- Set CONFIG_INIT_STACK_NONE=y in defconfigs to avoid extra overhead of
initializing all stack variables by default
- Make sure that when a new channel-path is enabled all subchannels are
evaluated: with and without any devices connected on it
- When SMT thread CPUs are added to CPU topology masks the nr_cpu_ids
limit is not checked and could be exceeded. Respect the nr_cpu_ids
limit and avoid a warning when CONFIG_DEBUG_PER_CPU_MAPS is set
- The pointer to IPL Parameter Information Block is stored in the
absolute lowcore as a virtual address. Save it as the physical
address for later use by dump tools
- Fix a Queued Direct I/O (QDIO) problem on z/VM guests using QIOASSIST
with dedicated (pass through) QDIO-based devices such as FCP, real
OSA or HiperSockets
- s390's struct statfs and struct statfs64 contain padding, which
field-by-field copying does not set. Initialize the respective
structures with zeros before filling them and copying to userspace
- Grow s390 compat_statfs64, statfs and statfs64 structures f_spare
array member to cover padding and simplify things
- Remove obsolete SCHED_BOOK and SCHED_DRAWER configs
- Remove unneeded S390_CCW_IOMMU and S390_AP_IOM configs
* tag 's390-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/iommu: get rid of S390_CCW_IOMMU and S390_AP_IOMMU
s390/Kconfig: remove obsolete configs SCHED_{BOOK,DRAWER}
s390/uapi: cover statfs padding by growing f_spare
statfs: enforce statfs[64] structure initialization
s390/qdio: fix do_sqbs() inline assembly constraint
s390/ipl: fix IPIB virtual vs physical address confusion
s390/topology: honour nr_cpu_ids when adding CPUs
s390/cio: include subchannels without devices also for evaluation
s390/defconfigs: set CONFIG_INIT_STACK_NONE=y
s390/pkey: zeroize key blobs
s390/crypto: use vector instructions only if available for ChaCha20
These functions are already marked as NOKPROBE to prevent recursion and
we have the same reason to blacklist them if rethook is used with fprobe,
since they are beyond the recursion-free region ftrace can guard.
Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/
Fixes: f3a112c0c4 ("x86,rethook,kprobes: Replace kretprobe with rethook on x86")
Signed-off-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The pointer to IPL Parameter Information Block is stored
in the absolute lowcore for later use by dump tools. That
pointer is a virtual address, though it should be physical
instead.
Note, this does not fix a real issue, since virtual and
physical addresses are currently the same.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
When SMT thread CPUs are added to CPU masks the nr_cpu_ids
limit is not checked and could be exceeded. This leads to
a warning for example if CONFIG_DEBUG_PER_CPU_MAPS is set
and the command line parameter nr_cpus is set to 1.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The s390 PMU of PAI extension 1 NNPA counters uses atomic_t for
reference counting. Replace this with the proper data type
refcount_t.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The s390 PMU of PAI crypto counters uses atomic_t for reference
counting. Replace this with the proper data type refcount_t.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Fix a potential race in gmap_make_secure() and remove the last user of
follow_page() without FOLL_GET.
The old code is locking something it doesn't have a reference to, and
as explained by Jason and David in this discussion:
https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
it can lead to all kind of bad things, including the page getting
unmapped (MADV_DONTNEED), freed, reallocated as a larger folio and the
unlock_page() would target the wrong bit.
There is also another race with the FOLL_WRITE, which could race
between the follow_page() and the get_locked_pte().
The main point is to remove the last use of follow_page() without
FOLL_GET or FOLL_PIN, removing the races can be considered a nice
bonus.
Link: https://lore.kernel.org/linux-mm/Y9J4P%2FRNvY1Ztn0Q@nvidia.com/
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Fixes: 214d9bbcd3 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-Id: <20230428092753.27913-2-imbrenda@linux.ibm.com>
- Add support for stackleak feature. Also allow specifying
architecture-specific stackleak poison function to enable faster
implementation. On s390, the mvc-based implementation helps decrease
typical overhead from a factor of 3 to just 25%
- Convert all assembler files to use SYM* style macros, deprecating the
ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS
- Improve KASLR to also randomize module and special amode31 code
base load addresses
- Rework decompressor memory tracking to support memory holes and improve
error handling
- Add support for protected virtualization AP binding
- Add support for set_direct_map() calls
- Implement set_memory_rox() and noexec module_alloc()
- Remove obsolete overriding of mem*() functions for KASAN
- Rework kexec/kdump to avoid using nodat_stack to call purgatory
- Convert the rest of the s390 code to use flexible-array member instead
of a zero-length array
- Clean up uaccess inline asm
- Enable ARCH_HAS_MEMBARRIER_SYNC_CORE
- Convert to using CONFIG_FUNCTION_ALIGNMENT and enable
DEBUG_FORCE_FUNCTION_ALIGN_64B
- Resolve last_break in userspace fault reports
- Simplify one-level sysctl registration
- Clean up branch prediction handling
- Rework CPU counter facility to retrieve available counter sets just
once
- Other various small fixes and improvements all over the code
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmRM8pwACgkQjYWKoQLX
FBjV1AgAlvAhu1XkwOdwqdT4GqE8pcN4XXzydog1MYihrSO2PdgWAxpEW7o2QURN
W+3xa6RIqt7nX2YBiwTanMZ12TYaFY7noGl3eUpD/NhueprweVirVl7VZUEuRoW/
j0mbx77xsVzLfuDFxkpVwE6/j+tTO78kLyjUHwcN9rFVUaL7/orJneDJf+V8fZG0
sHLOv0aljF7Jr2IIkw82lCmW/vdk7k0dACWMXK2kj1H3dIK34B9X4AdKDDf/WKXk
/OSElBeZ93tSGEfNDRIda6iR52xocROaRnQAaDtargKFl9VO0/dN9ADxO+SLNHjN
pFE/9VD6xT/xo4IuZZh/Z3TcYfiLvA==
=Geqx
-----END PGP SIGNATURE-----
Merge tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Add support for stackleak feature. Also allow specifying
architecture-specific stackleak poison function to enable faster
implementation. On s390, the mvc-based implementation helps decrease
typical overhead from a factor of 3 to just 25%
- Convert all assembler files to use SYM* style macros, deprecating the
ENTRY() macro and other annotations. Select ARCH_USE_SYM_ANNOTATIONS
- Improve KASLR to also randomize module and special amode31 code base
load addresses
- Rework decompressor memory tracking to support memory holes and
improve error handling
- Add support for protected virtualization AP binding
- Add support for set_direct_map() calls
- Implement set_memory_rox() and noexec module_alloc()
- Remove obsolete overriding of mem*() functions for KASAN
- Rework kexec/kdump to avoid using nodat_stack to call purgatory
- Convert the rest of the s390 code to use flexible-array member
instead of a zero-length array
- Clean up uaccess inline asm
- Enable ARCH_HAS_MEMBARRIER_SYNC_CORE
- Convert to using CONFIG_FUNCTION_ALIGNMENT and enable
DEBUG_FORCE_FUNCTION_ALIGN_64B
- Resolve last_break in userspace fault reports
- Simplify one-level sysctl registration
- Clean up branch prediction handling
- Rework CPU counter facility to retrieve available counter sets just
once
- Other various small fixes and improvements all over the code
* tag 's390-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (118 commits)
s390/stackleak: provide fast __stackleak_poison() implementation
stackleak: allow to specify arch specific stackleak poison function
s390: select ARCH_USE_SYM_ANNOTATIONS
s390/mm: use VM_FLUSH_RESET_PERMS in module_alloc()
s390: wire up memfd_secret system call
s390/mm: enable ARCH_HAS_SET_DIRECT_MAP
s390/mm: use BIT macro to generate SET_MEMORY bit masks
s390/relocate_kernel: adjust indentation
s390/relocate_kernel: use SYM* macros instead of ENTRY(), etc.
s390/entry: use SYM* macros instead of ENTRY(), etc.
s390/purgatory: use SYM* macros instead of ENTRY(), etc.
s390/kprobes: use SYM* macros instead of ENTRY(), etc.
s390/reipl: use SYM* macros instead of ENTRY(), etc.
s390/head64: use SYM* macros instead of ENTRY(), etc.
s390/earlypgm: use SYM* macros instead of ENTRY(), etc.
s390/mcount: use SYM* macros instead of ENTRY(), etc.
s390/crc32le: use SYM* macros instead of ENTRY(), etc.
s390/crc32be: use SYM* macros instead of ENTRY(), etc.
s390/crypto,chacha: use SYM* macros instead of ENTRY(), etc.
s390/amode31: use SYM* macros instead of ENTRY(), etc.
...
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some
major architectures it's not even consistently available.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK438RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jJ5Q/5AZ0HGpyqwdFK8GmGznyu5qjP5HwV9pPq
gZQScqSy4tZEeza4TFMi83CoXSg9uJ7GlYJqqQMKm78LGEPomnZtXXC7oWvTA9M5
M/jAvzytmvZloSCXV6kK7jzSejMHhag97J/BjTYhZYQpJ9T+hNC87XO6J6COsKr9
lPIYqkFrIkQNr6B0U11AQfFejRYP1ics2fnbnZL86G/zZAc6x8EveM3KgSer2iHl
KbrO+xcYyGY8Ef9P2F72HhEGFfM3WslpT1yzqR3sm4Y+fuMG0oW3qOQuMJx0ZhxT
AloterY0uo6gJwI0P9k/K4klWgz81Tf/zLb0eBAtY2uJV9Fo3YhPHuZC7jGPGAy3
JusW2yNYqc8erHVEMAKDUsl/1KN4TE2uKlkZy98wno+KOoMufK5MA2e2kPPqXvUi
Jk9RvFolnWUsexaPmCftti0OCv3YFiviVAJ/t0pchfmvvJA2da0VC9hzmEXpLJVF
25nBTV/1uAOrWvOpCyo3ElrC2CkQVkFmK5rXMDdvf6ib0Nid4vFcCkCSLVfu+ePB
11mi7QYro+CcnOug1K+yKogUDmsZgV/u1kUwgQzTIpZ05Kkb49gUiXw9L2RGcBJh
yoDoiI66KPR7PWQ2qBdQoXug4zfEEtWG0O9HNLB0FFRC3hu7I+HHyiUkBWs9jasK
PA5+V7HcQRk=
=Wp7f
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP cross-CPU function-call updates from Ingo Molnar:
- Remove diagnostics and adjust config for CSD lock diagnostics
- Add a generic IPI-sending tracepoint, as currently there's no easy
way to instrument IPI origins: it's arch dependent and for some major
architectures it's not even consistently available.
* tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
trace,smp: Trace all smp_function_call*() invocations
trace: Add trace_ipi_send_cpu()
sched, smp: Trace smp callback causing an IPI
smp: reword smp call IPI comment
treewide: Trace IPIs sent via smp_send_reschedule()
irq_work: Trace self-IPIs sent via arch_irq_work_raise()
smp: Trace IPIs sent via arch_send_call_function_ipi_mask()
sched, smp: Trace IPIs sent via send_call_function_single_ipi()
trace: Add trace_ipi_send_cpumask()
kernel/smp: Make csdlock_debug= resettable
locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging
locking/csd_lock: Remove added data from CSD lock debugging
locking/csd_lock: Add Kconfig option for csd_debug default
- Mark arch_cpu_idle_dead() __noreturn, make all architectures & drivers that did
this inconsistently follow this new, common convention, and fix all the fallout
that objtool can now detect statically.
- Fix/improve the ORC unwinder becoming unreliable due to UNWIND_HINT_EMPTY ambiguity,
split it into UNWIND_HINT_END_OF_STACK and UNWIND_HINT_UNDEFINED to resolve it.
- Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code.
- Generate ORC data for __pfx code
- Add more __noreturn annotations to various kernel startup/shutdown/panic functions.
- Misc improvements & fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmRK1x0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ghxQ/+IkCynMYtdF5OG9YwbcGJqsPSfOPMEcEM
pUSFYg+gGPBDT/fJfcVSqvUtdnWbLC2kXt9yiswXz3X3J2nmNkBk5YKQftsNDcul
TmKeqIIAK51XTncpegKH0EGnOX63oZ9Vxa8CTPdDlb+YF23Km2FoudGRI9F5qbUd
LoraXqGYeiaeySkGyWmZVl6Uc8dIxnMkTN3H/oI9aB6TOrsi059hAtFcSaFfyemP
c4LqXXCH7k2baiQt+qaLZ8cuZVG/+K5r2N2cmjO5kmJc6ynIaFnfMe4XxZLjp5LT
/PulYI15bXkvSARKx5CRh/CDHMOx5Blw+ASO0RhWbdy0WH4ZhhcaVF5AeIpPW86a
1LBcz97rMp72WmvKgrJeVO1r9+ll4SI6/YKGJRsxsCMdP3hgFpqntXyVjTFNdTM1
0gH6H5v55x06vJHvhtTk8SR3PfMTEM2fRU5jXEOrGowoGifx+wNUwORiwj6LE3KQ
SKUdT19RNzoW3VkFxhgk65ThK1S7YsJUKRoac3YdhttpqqqtFV//erenrZoR4k/p
vzvKy68EQ7RCNyD5wNWNFe0YjeJl5G8gQ8bUm4Xmab7djjgz+pn4WpQB8yYKJLAo
x9dqQ+6eUbw3Hcgk6qQ9E+r/svbulnAL0AeALAWK/91DwnZ2mCzKroFkLN7napKi
fRho4CqzrtM=
=NwEV
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- Mark arch_cpu_idle_dead() __noreturn, make all architectures &
drivers that did this inconsistently follow this new, common
convention, and fix all the fallout that objtool can now detect
statically
- Fix/improve the ORC unwinder becoming unreliable due to
UNWIND_HINT_EMPTY ambiguity, split it into UNWIND_HINT_END_OF_STACK
and UNWIND_HINT_UNDEFINED to resolve it
- Fix noinstr violations in the KCSAN code and the lkdtm/stackleak code
- Generate ORC data for __pfx code
- Add more __noreturn annotations to various kernel startup/shutdown
and panic functions
- Misc improvements & fixes
* tag 'objtool-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/hyperv: Mark hv_ghcb_terminate() as noreturn
scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
btrfs: Mark btrfs_assertfail() __noreturn
objtool: Include weak functions in global_noreturns check
cpu: Mark nmi_panic_self_stop() __noreturn
cpu: Mark panic_smp_self_stop() __noreturn
arm64/cpu: Mark cpu_park_loop() and friends __noreturn
x86/head: Mark *_start_kernel() __noreturn
init: Mark start_kernel() __noreturn
init: Mark [arch_call_]rest_init() __noreturn
objtool: Generate ORC data for __pfx code
x86/linkage: Fix padding for typed functions
objtool: Separate prefix code from stack validation code
objtool: Remove superfluous dead_end_function() check
objtool: Add symbol iteration helpers
objtool: Add WARN_INSN()
scripts/objdump-func: Support multiple functions
context_tracking: Fix KCSAN noinstr violation
objtool: Add stackleak instrumentation to uaccess safe list
...
The summary of the changes for this pull requests is:
* Song Liu's new struct module_memory replacement
* Nick Alcock's MODULE_LICENSE() removal for non-modules
* My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded
prior to allocating the final module memory with vmalloc and the
respective debug code it introduces to help clarify the issue. Although
the functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to have
been picked up. Folks on larger CPU systems with modules will want to
just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details
on this pull request.
The functional change change in this pull request is the very first
patch from Song Liu which replaces the struct module_layout with a new
struct module memory. The old data structure tried to put together all
types of supported module memory types in one data structure, the new
one abstracts the differences in memory types in a module to allow each
one to provide their own set of details. This paves the way in the
future so we can deal with them in a cleaner way. If you look at changes
they also provide a nice cleanup of how we handle these different memory
areas in a module. This change has been in linux-next since before the
merge window opened for v6.3 so to provide more than a full kernel cycle
of testing. It's a good thing as quite a bit of fixes have been found
for it.
Jason Baron then made dynamic debug a first class citizen module user by
using module notifier callbacks to allocate / remove module specific
dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area
is active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without Makefile.modbuiltin
or tristate.conf"). Nick has been working on this *for years* and
AFAICT I was the only one to suggest two alternatives to this approach
for tooling. The complexity in one of my suggested approaches lies in
that we'd need a possible-obj-m and a could-be-module which would check
if the object being built is part of any kconfig build which could ever
lead to it being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0]. A more obvious yet theoretical approach I've
suggested would be to have a tristate in kconfig imply the same new
-DPOSSIBLE_MODULE as well but that means getting kconfig symbol names
mapping to modules always, and I don't think that's the case today. I am
not aware of Nick or anyone exploring either of these options. Quite
recently Josh Poimboeuf has pointed out that live patching, kprobes and
BPF would benefit from resolving some part of the disambiguation as
well but for other reasons. The function granularity KASLR (fgkaslr)
patches were mentioned but Joe Lawrence has clarified this effort has
been dropped with no clear solution in sight [1].
In the meantime removing module license tags from code which could never
be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up,
and so you'll see quite a bit of Nick's patches in other pull
requests for this merge window. I just picked up the stragglers after
rc3. LWN has good coverage on the motivation behind this work [2] and
the typical cross-tree issues he ran into along the way. The only
concrete blocker issue he ran into was that we should not remove the
MODULE_LICENSE() tags from files which have no SPDX tags yet, even if
they can never be modules. Nick ended up giving up on his efforts due
to having to do this vetting and backlash he ran into from folks who
really did *not understand* the core of the issue nor were providing
any alternative / guidance. I've gone through his changes and dropped
the patches which dropped the module license tags where an SPDX
license tag was missing, it only consisted of 11 drivers. To see
if a pull request deals with a file which lacks SPDX tags you
can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above,
but that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but
it demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees,
and I just picked up the slack after rc3 for the last kernel was out.
Those changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on
a systems with over 400 CPUs when KASAN was enabled due to running
out of virtual memory space. Although the functional change only
consists of 3 lines in the patch "module: avoid allocation if module is
already present and ready", proving that this was the best we can
do on the modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been
in linux-next since around rc3 of the last kernel, the actual final
fix for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported
with larger number of CPUs. Userspace is not yet fixed as it is taking
a bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge them,
but I'm currently inclined to just see if userspace can fix this
instead.
[0] https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/
[1] https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com
[2] https://lwn.net/Articles/927569/
[3] https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmRG4m0SHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinQ2oP/0xlvKwJg6Ey8fHZF0qv8VOskE80zoLF
hMazU3xfqLA+1TQvouW1YBxt3jwS3t1Ehs+NrV+nY9Yzcm0MzRX/n3fASJVe7nRr
oqWWQU+voYl5Pw1xsfdp6C8IXpBQorpYby3Vp0MAMoZyl2W2YrNo36NV488wM9KC
jD4HF5Z6xpnPSZTRR7AgW9mo7FdAtxPeKJ76Bch7lH8U6omT7n36WqTw+5B1eAYU
YTOvrjRs294oqmWE+LeebyiOOXhH/yEYx4JNQgCwPdxwnRiGJWKsk5va0hRApqF/
WW8dIqdEnjsa84lCuxnmWgbcPK8cgmlO0rT0DyneACCldNlldCW1LJ0HOwLk9pea
p3JFAsBL7TKue4Tos6I7/4rx1ufyBGGIigqw9/VX5g0Iif+3BhWnqKRfz+p9wiMa
Fl7cU6u7yC68CHu1HBSisK16cYMCPeOnTSd89upHj8JU/t74O6k/ARvjrQ9qmNUt
c5U+OY+WpNJ1nXQydhY/yIDhFdYg8SSpNuIO90r4L8/8jRQYXNG80FDd1UtvVDuy
eq0r2yZ8C0XHSlOT9QHaua/tWV/aaKtyC/c0hDRrigfUrq8UOlGujMXbUnrmrWJI
tLJLAc7ePWAAoZXGSHrt0U27l029GzLwRdKqJ6kkDANVnTeOdV+mmBg9zGh3/Mp6
agiwdHUMVN7X
=56WK
-----END PGP SIGNATURE-----
Merge tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"The summary of the changes for this pull requests is:
- Song Liu's new struct module_memory replacement
- Nick Alcock's MODULE_LICENSE() removal for non-modules
- My cleanups and enhancements to reduce the areas where we vmalloc
module memory for duplicates, and the respective debug code which
proves the remaining vmalloc pressure comes from userspace.
Most of the changes have been in linux-next for quite some time except
the minor fixes I made to check if a module was already loaded prior
to allocating the final module memory with vmalloc and the respective
debug code it introduces to help clarify the issue. Although the
functional change is small it is rather safe as it can only *help*
reduce vmalloc space for duplicates and is confirmed to fix a bootup
issue with over 400 CPUs with KASAN enabled. I don't expect stable
kernels to pick up that fix as the cleanups would have also had to
have been picked up. Folks on larger CPU systems with modules will
want to just upgrade if vmalloc space has been an issue on bootup.
Given the size of this request, here's some more elaborate details:
The functional change change in this pull request is the very first
patch from Song Liu which replaces the 'struct module_layout' with a
new 'struct module_memory'. The old data structure tried to put
together all types of supported module memory types in one data
structure, the new one abstracts the differences in memory types in a
module to allow each one to provide their own set of details. This
paves the way in the future so we can deal with them in a cleaner way.
If you look at changes they also provide a nice cleanup of how we
handle these different memory areas in a module. This change has been
in linux-next since before the merge window opened for v6.3 so to
provide more than a full kernel cycle of testing. It's a good thing as
quite a bit of fixes have been found for it.
Jason Baron then made dynamic debug a first class citizen module user
by using module notifier callbacks to allocate / remove module
specific dynamic debug information.
Nick Alcock has done quite a bit of work cross-tree to remove module
license tags from things which cannot possibly be module at my request
so to:
a) help him with his longer term tooling goals which require a
deterministic evaluation if a piece a symbol code could ever be
part of a module or not. But quite recently it is has been made
clear that tooling is not the only one that would benefit.
Disambiguating symbols also helps efforts such as live patching,
kprobes and BPF, but for other reasons and R&D on this area is
active with no clear solution in sight.
b) help us inch closer to the now generally accepted long term goal
of automating all the MODULE_LICENSE() tags from SPDX license tags
In so far as a) is concerned, although module license tags are a no-op
for non-modules, tools which would want create a mapping of possible
modules can only rely on the module license tag after the commit
8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf").
Nick has been working on this *for years* and AFAICT I was the only
one to suggest two alternatives to this approach for tooling. The
complexity in one of my suggested approaches lies in that we'd need a
possible-obj-m and a could-be-module which would check if the object
being built is part of any kconfig build which could ever lead to it
being part of a module, and if so define a new define
-DPOSSIBLE_MODULE [0].
A more obvious yet theoretical approach I've suggested would be to
have a tristate in kconfig imply the same new -DPOSSIBLE_MODULE as
well but that means getting kconfig symbol names mapping to modules
always, and I don't think that's the case today. I am not aware of
Nick or anyone exploring either of these options. Quite recently Josh
Poimboeuf has pointed out that live patching, kprobes and BPF would
benefit from resolving some part of the disambiguation as well but for
other reasons. The function granularity KASLR (fgkaslr) patches were
mentioned but Joe Lawrence has clarified this effort has been dropped
with no clear solution in sight [1].
In the meantime removing module license tags from code which could
never be modules is welcomed for both objectives mentioned above. Some
developers have also welcomed these changes as it has helped clarify
when a module was never possible and they forgot to clean this up, and
so you'll see quite a bit of Nick's patches in other pull requests for
this merge window. I just picked up the stragglers after rc3. LWN has
good coverage on the motivation behind this work [2] and the typical
cross-tree issues he ran into along the way. The only concrete blocker
issue he ran into was that we should not remove the MODULE_LICENSE()
tags from files which have no SPDX tags yet, even if they can never be
modules. Nick ended up giving up on his efforts due to having to do
this vetting and backlash he ran into from folks who really did *not
understand* the core of the issue nor were providing any alternative /
guidance. I've gone through his changes and dropped the patches which
dropped the module license tags where an SPDX license tag was missing,
it only consisted of 11 drivers. To see if a pull request deals with a
file which lacks SPDX tags you can just use:
./scripts/spdxcheck.py -f \
$(git diff --name-only commid-id | xargs echo)
You'll see a core module file in this pull request for the above, but
that's not related to his changes. WE just need to add the SPDX
license tag for the kernel/module/kmod.c file in the future but it
demonstrates the effectiveness of the script.
Most of Nick's changes were spread out through different trees, and I
just picked up the slack after rc3 for the last kernel was out. Those
changes have been in linux-next for over two weeks.
The cleanups, debug code I added and final fix I added for modules
were motivated by David Hildenbrand's report of boot failing on a
systems with over 400 CPUs when KASAN was enabled due to running out
of virtual memory space. Although the functional change only consists
of 3 lines in the patch "module: avoid allocation if module is already
present and ready", proving that this was the best we can do on the
modules side took quite a bit of effort and new debug code.
The initial cleanups I did on the modules side of things has been in
linux-next since around rc3 of the last kernel, the actual final fix
for and debug code however have only been in linux-next for about a
week or so but I think it is worth getting that code in for this merge
window as it does help fix / prove / evaluate the issues reported with
larger number of CPUs. Userspace is not yet fixed as it is taking a
bit of time for folks to understand the crux of the issue and find a
proper resolution. Worst come to worst, I have a kludge-of-concept [3]
of how to make kernel_read*() calls for modules unique / converge
them, but I'm currently inclined to just see if userspace can fix this
instead"
Link: https://lore.kernel.org/all/Y/kXDqW+7d71C4wz@bombadil.infradead.org/ [0]
Link: https://lkml.kernel.org/r/025f2151-ce7c-5630-9b90-98742c97ac65@redhat.com [1]
Link: https://lwn.net/Articles/927569/ [2]
Link: https://lkml.kernel.org/r/20230414052840.1994456-3-mcgrof@kernel.org [3]
* tag 'modules-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (121 commits)
module: add debugging auto-load duplicate module support
module: stats: fix invalid_mod_bytes typo
module: remove use of uninitialized variable len
module: fix building stats for 32-bit targets
module: stats: include uapi/linux/module.h
module: avoid allocation if module is already present and ready
module: add debug stats to help identify memory pressure
module: extract patient module check into helper
modules/kmod: replace implementation with a semaphore
Change DEFINE_SEMAPHORE() to take a number argument
module: fix kmemleak annotations for non init ELF sections
module: Ignore L0 and rename is_arm_mapping_symbol()
module: Move is_arm_mapping_symbol() to module_symbol.h
module: Sync code of is_arm_mapping_symbol()
scripts/gdb: use mem instead of core_layout to get the module address
interconnect: remove module-related code
interconnect: remove MODULE_LICENSE in non-modules
zswap: remove MODULE_LICENSE in non-modules
zpool: remove MODULE_LICENSE in non-modules
x86/mm/dump_pagetables: remove MODULE_LICENSE in non-modules
...
Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening in
the driver core in the quest to be able to move "struct bus" and "struct
class" into read-only memory, a task now complete with these changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules for
all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most of
them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZEp7Sw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykitQCfamUHpxGcKOAGuLXMotXNakTEsxgAoIquENm5
LEGadNS38k5fs+73UaxV
=7K4B
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the large set of driver core changes for 6.4-rc1.
Once again, a busy development cycle, with lots of changes happening
in the driver core in the quest to be able to move "struct bus" and
"struct class" into read-only memory, a task now complete with these
changes.
This will make the future rust interactions with the driver core more
"provably correct" as well as providing more obvious lifetime rules
for all busses and classes in the kernel.
The changes required for this did touch many individual classes and
busses as many callbacks were changed to take const * parameters
instead. All of these changes have been submitted to the various
subsystem maintainers, giving them plenty of time to review, and most
of them actually did so.
Other than those changes, included in here are a small set of other
things:
- kobject logging improvements
- cacheinfo improvements and updates
- obligatory fw_devlink updates and fixes
- documentation updates
- device property cleanups and const * changes
- firwmare loader dependency fixes.
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
device property: make device_property functions take const device *
driver core: update comments in device_rename()
driver core: Don't require dynamic_debug for initcall_debug probe timing
firmware_loader: rework crypto dependencies
firmware_loader: Strip off \n from customized path
zram: fix up permission for the hot_add sysfs file
cacheinfo: Add use_arch[|_cache]_info field/function
arch_topology: Remove early cacheinfo error message if -ENOENT
cacheinfo: Check cache properties are present in DT
cacheinfo: Check sib_leaf in cache_leaves_are_shared()
cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
cacheinfo: Add arm64 early level initializer implementation
cacheinfo: Add arch specific early level initializer
tty: make tty_class a static const structure
driver core: class: remove struct class_interface * from callbacks
driver core: class: mark the struct class in struct class_interface constant
driver core: class: make class_register() take a const *
driver core: class: mark class_release() as taking a const *
driver core: remove incorrect comment for device_create*
MIPS: vpe-cmp: remove module owner pointer from struct class usage.
...
ACPI:
* Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
* Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
* Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
* Cleanups to the way in which CPU features are identified from the
ID register fields
* Extend system register definition generation to handle Enum types
when defining shared register fields
* Generate definitions for new _EL2 registers and add new fields
for ID_AA64PFR1_EL1
* Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
* Support for "direct calls" in ftrace, which enables BPF tracing
for arm64
Kdump:
* Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce
TLB pressure when a crashkernel is loaded.
Memory management:
* Try again to remove data cache invalidation from the coherent DMA
allocation path
* Simplify the fixmap code by mapping at page granularity
* Allow the kfence pool to be allocated early, preventing the rest
of the linear mapping from being forced to page granularity
Perf and PMU:
* Move CPU PMU code out to drivers/perf/ where it can be reused
by the 32-bit ARM architecture when running on ARMv8 CPUs
* Fix race between CPU PMU probing and pKVM host de-privilege
* Add support for Apple M2 CPU PMU
* Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
* Minor fixes and cleanups to system PMU drivers
Stack tracing:
* Use the XPACLRI instruction to strip PAC from pointers, rather
than rolling our own function in C
* Remove redundant PAC removal for toolchains that handle this in
their builtins
* Make backtracing more resilient in the face of instrumentation
Miscellaneous:
* Fix single-step with KGDB
* Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
* Minor fixes and cleanups across the board
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmRChcwQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNCgBCADFvkYY9ESztSnd3EpiMbbAzgRCQBiA5H7U
F2Wc+hIWgeAeUEttSH22+F16r6Jb0gbaDvsuhtN2W/rwQhKNbCU0MaUME05MPmg2
AOp+RZb2vdT5i5S5dC6ZM6G3T6u9O78LBWv2JWBdd6RIybamEn+RL00ep2WAduH7
n1FgTbsKgnbScD2qd4K1ejZ1W/BQMwYulkNpyTsmCIijXM12lkzFlxWnMtky3uhR
POpawcIZzXvWI02QAX+SIdynGChQV3VP+dh9GuFbt7ASigDEhgunvfUYhZNSaqf4
+/q0O8toCtmQJBUhF0DEDSB5T8SOz5v9CKxKuwfaX6Trq0ixFQpZ
=78L9
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"ACPI:
- Improve error reporting when failing to manage SDEI on AGDI device
removal
Assembly routines:
- Improve register constraints so that the compiler can make use of
the zero register instead of moving an immediate #0 into a GPR
- Allow the compiler to allocate the registers used for CAS
instructions
CPU features and system registers:
- Cleanups to the way in which CPU features are identified from the
ID register fields
- Extend system register definition generation to handle Enum types
when defining shared register fields
- Generate definitions for new _EL2 registers and add new fields for
ID_AA64PFR1_EL1
- Allow SVE to be disabled separately from SME on the kernel
command-line
Tracing:
- Support for "direct calls" in ftrace, which enables BPF tracing for
arm64
Kdump:
- Don't bother unmapping the crashkernel from the linear mapping,
which then allows us to use huge (block) mappings and reduce TLB
pressure when a crashkernel is loaded.
Memory management:
- Try again to remove data cache invalidation from the coherent DMA
allocation path
- Simplify the fixmap code by mapping at page granularity
- Allow the kfence pool to be allocated early, preventing the rest of
the linear mapping from being forced to page granularity
Perf and PMU:
- Move CPU PMU code out to drivers/perf/ where it can be reused by
the 32-bit ARM architecture when running on ARMv8 CPUs
- Fix race between CPU PMU probing and pKVM host de-privilege
- Add support for Apple M2 CPU PMU
- Adjust the generic PERF_COUNT_HW_BRANCH_INSTRUCTIONS event
dynamically, depending on what the CPU actually supports
- Minor fixes and cleanups to system PMU drivers
Stack tracing:
- Use the XPACLRI instruction to strip PAC from pointers, rather than
rolling our own function in C
- Remove redundant PAC removal for toolchains that handle this in
their builtins
- Make backtracing more resilient in the face of instrumentation
Miscellaneous:
- Fix single-step with KGDB
- Remove harmless warning when 'nokaslr' is passed on the kernel
command-line
- Minor fixes and cleanups across the board"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (72 commits)
KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege
arm64: kexec: include reboot.h
arm64: delete dead code in this_cpu_set_vectors()
arm64/cpufeature: Use helper macro to specify ID register for capabilites
drivers/perf: hisi: add NULL check for name
drivers/perf: hisi: Remove redundant initialized of pmu->name
arm64/cpufeature: Consistently use symbolic constants for min_field_value
arm64/cpufeature: Pull out helper for CPUID register definitions
arm64/sysreg: Convert HFGITR_EL2 to automatic generation
ACPI: AGDI: Improve error reporting for problems during .remove()
arm64: kernel: Fix kernel warning when nokaslr is passed to commandline
perf/arm-cmn: Fix port detection for CMN-700
arm64: kgdb: Set PSTATE.SS to 1 to re-enable single-step
arm64: move PAC masks to <asm/pointer_auth.h>
arm64: use XPACLRI to strip PAC
arm64: avoid redundant PAC stripping in __builtin_return_address()
arm64/sme: Fix some comments of ARM SME
arm64/signal: Alloc tpidr2 sigframe after checking system_supports_tpidr2()
arm64/signal: Use system_supports_tpidr2() to check TPIDR2
arm64/idreg: Don't disable SME when disabling SVE
...
- Improve the VDSO build time checks to cover all dynamic relocations
VDSO does not allow dynamic relcations, but the build time check is
incomplete and fragile.
It's based on architectures specifying the relocation types to search
for and does not handle R_*_NONE relocation entries correctly.
R_*_NONE relocations are injected by some GNU ld variants if they fail
to determine the exact .rel[a]/dyn_size to cover trailing zeros.
R_*_NONE relocations must be ignored by dynamic loaders, so they
should be ignored in the build time check too.
Remove the architecture specific relocation types to check for and
validate strictly that no other relocations than R_*_NONE end up
in the VSDO .so file.
- Prefer signal delivery to the current thread for
CLOCK_PROCESS_CPUTIME_ID based posix-timers
Such timers prefer to deliver the signal to the main thread of a
process even if the context in which the timer expires is the current
task. This has the downside that it might wake up an idle thread.
As there is no requirement or guarantee that the signal has to be
delivered to the main thread, avoid this by preferring the current
task if it is part of the thread group which shares sighand.
This not only avoids waking idle threads, it also distributes the
signal delivery in case of multiple timers firing in the context
of different threads close to each other better.
- Align the tick period properly (again)
For a long time the tick was starting at CLOCK_MONOTONIC zero, which
allowed users space applications to either align with the tick or to
place a periodic computation so that it does not interfere with the
tick. The alignement of the tick period was more by chance than by
intention as the tick is set up before a high resolution clocksource is
installed, i.e. timekeeping is still tick based and the tick period
advances from there.
The early enablement of sched_clock() broke this alignement as the time
accumulated by sched_clock() is taken into account when timekeeping is
initialized. So the base value now(CLOCK_MONOTONIC) is not longer a
multiple of tick periods, which breaks applications which relied on
that behaviour.
Cure this by aligning the tick starting point to the next multiple of
tick periods, i.e 1000ms/CONFIG_HZ.
- A set of NOHZ fixes and enhancements
- Cure the concurrent writer race for idle and IO sleeptime statistics
The statitic values which are exposed via /proc/stat are updated from
the CPU local idle exit and remotely by cpufreq, but that happens
without any form of serialization. As a consequence sleeptimes can be
accounted twice or worse.
Prevent this by restricting the accumulation writeback to the CPU
local idle exit and let the remote access compute the accumulated
value.
- Protect idle/iowait sleep time with a sequence count
Reading idle/iowait sleep time, e.g. from /proc/stat, can race with
idle exit updates. As a consequence the readout may result in random
and potentially going backwards values.
Protect this by a sequence count, which fixes the idle time
statistics issue, but cannot fix the iowait time problem because
iowait time accounting races with remote wake ups decrementing the
remote runqueues nr_iowait counter. The latter is impossible to fix,
so the only way to deal with that is to document it properly and to
remove the assertion in the selftest which triggers occasionally due
to that.
- Restructure struct tick_sched for better cache layout
- Some small cleanups and a better cache layout for struct tick_sched
- Implement the missing timer_wait_running() callback for POSIX CPU timers
For unknown reason the introduction of the timer_wait_running() callback
missed to fixup posix CPU timers, which went unnoticed for almost four
years.
While initially only targeted to prevent livelocks between a timer
deletion and the timer expiry function on PREEMPT_RT enabled kernels, it
turned out that fixing this for mainline is not as trivial as just
implementing a stub similar to the hrtimer/timer callbacks.
The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems
there is a livelock issue independent of RT.
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers
out from hard interrupt context to task work, which is handled before
returning to user space or to a VM. The expiry mechanism moves the
expired timers to a stack local list head with sighand lock held. Once
sighand is dropped the task can be preempted and a task which wants to
delete a timer will spin-wait until the expiry task is scheduled back
in. In the worst case this will end up in a livelock when the preempting
task and the expiry task are pinned on the same CPU.
The timer wheel has a timer_wait_running() mechanism for RT, which uses
a per CPU timer-base expiry lock which is held by the expiry code and the
task waiting for the timer function to complete blocks on that lock.
This does not work in the same way for posix CPU timers as there is no
timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry lock
can be used too in a slightly different way.
Add a per task mutex to struct posix_cputimers_work, let the expiry task
hold it accross the expiry function and let the deleting task which
waits for the expiry to complete block on the mutex.
In the non-contended case this results in an extra mutex_lock()/unlock()
pair on both sides.
This avoids spin-waiting on a task which is scheduled out, prevents the
livelock and cures the problem for RT and !RT systems.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGrj4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZhdEAC/lwfDWCnTXHC8ExQQRDIVNyXmDlLb
EHB8ZY7Wc4gNZ8UEXEOLOXJHMG9bsbtPGctVewJwRGnXZWKVhpPwQba6kCRycyX0
0J6l5DlvUaGGrpoOzOZwgETRmtIZE9tEArZR8xlfRScYd93a7yLhwIjO8JaV9vKs
IQpAQMeJ/ysp6gHrS59qakYfoHU/ERUAu3Tk4GqHUtPtcyz3nX3eTlLWV8LySqs+
00qr2yc0bQFUFoKzTCxtM8lcEi9ja9SOj1rw28348O+BXE4d0HC12Ie7eU/CDN2Y
OAlWYxVjy4LMh24LDrRQKTzoVqx9MXDx2g+09B3t8NK5LgeS+EJIjujDhZF147/H
5y906nplZUKa8BiZW5Rpm/HKH8tFI80T9XWSQCRBeMgTEJyRyRU1yASAwO4xw+dY
Dn3tGmFGymcV/72o4ic9JFKQd8cTSxPjEJS3qqzMkEAtyI/zPBmKxj/Tce50OH40
6FSZq1uU21ZQzszwSHISwgFtNr75laUSK4Z1te5OhPOOz+C7O9YqHvqS/1jwhPj2
tMd8X17fRW3UTUBlBj+zqxqiEGBl/Yk2AvKrJIXGUtfWYCtjMJ7ieCf0kZ7NSVJx
9ewubA0gqseMD783YomZsy8LLtMKnhclJeslUOVb1oKs1q/WF1R/k6qjy9vUwYaB
nIJuHl8mxSetag==
=SVnj
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timers and timekeeping updates from Thomas Gleixner:
- Improve the VDSO build time checks to cover all dynamic relocations
VDSO does not allow dynamic relocations, but the build time check is
incomplete and fragile.
It's based on architectures specifying the relocation types to search
for and does not handle R_*_NONE relocation entries correctly.
R_*_NONE relocations are injected by some GNU ld variants if they
fail to determine the exact .rel[a]/dyn_size to cover trailing zeros.
R_*_NONE relocations must be ignored by dynamic loaders, so they
should be ignored in the build time check too.
Remove the architecture specific relocation types to check for and
validate strictly that no other relocations than R_*_NONE end up in
the VSDO .so file.
- Prefer signal delivery to the current thread for
CLOCK_PROCESS_CPUTIME_ID based posix-timers
Such timers prefer to deliver the signal to the main thread of a
process even if the context in which the timer expires is the current
task. This has the downside that it might wake up an idle thread.
As there is no requirement or guarantee that the signal has to be
delivered to the main thread, avoid this by preferring the current
task if it is part of the thread group which shares sighand.
This not only avoids waking idle threads, it also distributes the
signal delivery in case of multiple timers firing in the context of
different threads close to each other better.
- Align the tick period properly (again)
For a long time the tick was starting at CLOCK_MONOTONIC zero, which
allowed users space applications to either align with the tick or to
place a periodic computation so that it does not interfere with the
tick. The alignement of the tick period was more by chance than by
intention as the tick is set up before a high resolution clocksource
is installed, i.e. timekeeping is still tick based and the tick
period advances from there.
The early enablement of sched_clock() broke this alignement as the
time accumulated by sched_clock() is taken into account when
timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is
not longer a multiple of tick periods, which breaks applications
which relied on that behaviour.
Cure this by aligning the tick starting point to the next multiple of
tick periods, i.e 1000ms/CONFIG_HZ.
- A set of NOHZ fixes and enhancements:
* Cure the concurrent writer race for idle and IO sleeptime
statistics
The statitic values which are exposed via /proc/stat are updated
from the CPU local idle exit and remotely by cpufreq, but that
happens without any form of serialization. As a consequence
sleeptimes can be accounted twice or worse.
Prevent this by restricting the accumulation writeback to the CPU
local idle exit and let the remote access compute the accumulated
value.
* Protect idle/iowait sleep time with a sequence count
Reading idle/iowait sleep time, e.g. from /proc/stat, can race
with idle exit updates. As a consequence the readout may result
in random and potentially going backwards values.
Protect this by a sequence count, which fixes the idle time
statistics issue, but cannot fix the iowait time problem because
iowait time accounting races with remote wake ups decrementing
the remote runqueues nr_iowait counter. The latter is impossible
to fix, so the only way to deal with that is to document it
properly and to remove the assertion in the selftest which
triggers occasionally due to that.
* Restructure struct tick_sched for better cache layout
* Some small cleanups and a better cache layout for struct
tick_sched
- Implement the missing timer_wait_running() callback for POSIX CPU
timers
For unknown reason the introduction of the timer_wait_running()
callback missed to fixup posix CPU timers, which went unnoticed for
almost four years.
While initially only targeted to prevent livelocks between a timer
deletion and the timer expiry function on PREEMPT_RT enabled kernels,
it turned out that fixing this for mainline is not as trivial as just
implementing a stub similar to the hrtimer/timer callbacks.
The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled
systems there is a livelock issue independent of RT.
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU
timers out from hard interrupt context to task work, which is handled
before returning to user space or to a VM. The expiry mechanism moves
the expired timers to a stack local list head with sighand lock held.
Once sighand is dropped the task can be preempted and a task which
wants to delete a timer will spin-wait until the expiry task is
scheduled back in. In the worst case this will end up in a livelock
when the preempting task and the expiry task are pinned on the same
CPU.
The timer wheel has a timer_wait_running() mechanism for RT, which
uses a per CPU timer-base expiry lock which is held by the expiry
code and the task waiting for the timer function to complete blocks
on that lock.
This does not work in the same way for posix CPU timers as there is
no timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry
lock can be used too in a slightly different way.
Add a per task mutex to struct posix_cputimers_work, let the expiry
task hold it accross the expiry function and let the deleting task
which waits for the expiry to complete block on the mutex.
In the non-contended case this results in an extra
mutex_lock()/unlock() pair on both sides.
This avoids spin-waiting on a task which is scheduled out, prevents
the livelock and cures the problem for RT and !RT systems
* tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Implement the missing timer_wait_running callback
selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity
selftests/proc: Remove idle time monotonicity assertions
MAINTAINERS: Remove stale email address
timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick()
timers/nohz: Add a comment about broken iowait counter update race
timers/nohz: Protect idle/iowait sleep time under seqcount
timers/nohz: Only ever update sleeptime from idle exit
timers/nohz: Restructure and reshuffle struct tick_sched
tick/common: Align tick period with the HZ tick.
selftests/timers/posix_timers: Test delivery of signals across threads
posix-timers: Prefer delivery of signals to the current thread
vdso: Improve cmd_vdso_check to check all dynamic relocations
Make use of the set_direct_map() calls for module allocations.
In particular:
- All changes to read-only permissions in kernel VA mappings are also
applied to the direct mapping. Note that execute permissions are
intentionally not applied to the direct mapping in order to make
sure that all allocated pages within the direct mapping stay
non-executable
- module_alloc() passes the VM_FLUSH_RESET_PERMS to __vmalloc_node_range()
to make sure that all implicit permission changes made to the direct
mapping are reset when the allocated vm area is freed again
Side effects: the direct mapping will be fragmented depending on how many
vm areas with VM_FLUSH_RESET_PERMS and/or explicit page permission changes
are allocated and freed again.
For example, just after boot of a system the direct mapping statistics look
like:
$cat /proc/meminfo
...
DirectMap4k: 111628 kB
DirectMap1M: 16665600 kB
DirectMap2G: 0 kB
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390 supports ARCH_HAS_SET_DIRECT_MAP, therefore wire up the
memfd_secret system call, which depends on it.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
relocate_kernel.S seems to be the only assembler file which doesn't
follow the standard way of indentation. Adjust this for the sake of
consistency.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consistently use the SYM* family of macros instead of the
deprecated ENTRY(), ENDPROC(), etc. family of macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
To allow calling of DAT-off code from kernel the stack needs
to be switched to nodat_stack (or other stack mapped as 1:1).
Before call_nodat() macro was introduced that was necessary
to provide the very same memory address for STNSM and STOSM
instructions. If the kernel would stay on a random stack
(e.g. a virtually mapped one) then a virtual address provided
for STNSM instruction could differ from the physical address
needed for the corresponding STOSM instruction.
After call_nodat() macro is introduced the kernel stack does
not need to be mapped 1:1 anymore, since the macro stores the
physical memory address of return PSW in a register before
entering DAT-off mode. This way the return LPSWE instruction
is able to pick the correct memory location and restore the
DAT-on mode. That however might fail in case the 16-byte return
PSW happened to cross page boundary: PSW mask and PSW address
could end up in two separate non-contiguous physical pages.
Align the return PSW on 16-byte boundary so it always fits
into a single physical page. As result any stack (including
the virtually mapped one) could be used for calling DAT-off
code and prior switching to nodat_stack becomes unnecessary.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Calling kdump kernel is a two-step process that involves
invocation of the purgatory code: first time - to verify
the new kernel checksum and second time - to call the new
kernel itself.
The purgatory code operates on real addresses and does not
expect any memory protection. Therefore, before the purgatory
code is entered the DAT mode is always turned off. However,
it is only restored upon return from the new kernel checksum
verification. In case the purgatory was called to start the
new kernel and failed the control is returned to the old
kernel, but the DAT mode continues staying off.
The new kernel start failure is unlikely and leads to the
disabled wait state anyway. Still that poses a risk, since
the kernel code in general is not DAT-off safe and even
calling the disabled_wait() function might crash.
Introduce call_nodat() macro that allows entering DAT-off
mode, calling an arbitrary function and restoring DAT mode
back on. Switch all invocations of DAT-off code to that
macro and avoid the above described scenario altogether.
Name the call_nodat() macro in small letters after the
already existing call_on_stack() and put it to the same
header file.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
[hca@linux.ibm.com: some small modifications to call_nodat() macro]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Avoid unnecessary run-time and compile-time type
conversions of do_start_kdump() function return
value and parameter.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The kernel code is not guaranteed DAT-off mode safe.
Turn the DAT mode off immediately before entering the
purgatory.
Further, to avoid subtle side effects reset the system
immediately before turning DAT mode off while making
all necessary preparations in advance.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove function validate_ctr_auth() and replace this very small
function by its body.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Function validate_ctr_version() first parameter is a pointer to
a large structure, but only member hw_perf_event::config is used.
Supply this structure member value in the function invocation.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The CPU measurement facility counter information instruction qctri()
retrieves information about the available counter sets.
The information varies between machine generations, but is constant
when running on a particular machine.
For example the CPU measurement facility counter first and second
version numbers determine the amount of counters in a counter
set. This information never changes.
The counter sets are identical for all CPUs in the system. It does
not matter which CPU performs the instruction.
Authorization control of the CPU Measurement facility can only
be changed in the activation profile while the LPAR is not running.
Retrieve the CPU measurement counter information at device driver
initialization time and use its constant values.
Function validate_ctr_version() verifies if a user provided
CPU Measurement counter facility counter is valid and defined.
It now uses the newly introduced static CPU counter facility
information.
To avoid repeated recalculation of the counter set sizes (numbers of
counters per set), which never changes on a running machine,
calculate the counter set size once at device driver initialization
and store the result in an array. Functions cpum_cf_make_setsize()
and cpum_cf_read_setsize() are introduced.
Finally remove cpu_cf_events::info member and use the static CPU
counter facility information instead.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation for improving objtool's handling of weak noreturn
functions, mark start_kernel(), arch_call_rest_init(), and rest_init()
__noreturn.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/7194ed8a989a85b98d92e62df660f4a90435a723.1681342859.git.jpoimboe@kernel.org
Simplify pr_err() statement into one line and omit return statement.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Struct s390_ctrset_read userdata is filled by ioctl_read operation
using put_user/copy_to_user. However, the ctrset->data value access
is not performed anywhere during the ioctl_read operation.
Remove unnecessary copy_from_user() call.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When function cfset_all_copy() fails, also log the bad return code
in the debug statement (when turned on).
No functional change
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This is the s390 variant of commit 7dfac3c5f4 ("arm64: module: create
module allocations without exec permissions"):
"The core code manages the executable permissions of code regions of
modules explicitly. It is no longer necessary to create the module vmalloc
regions with RWX permissions. So create them with RW- permissions instead,
which is preferred from a security perspective."
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The ftrace code assumes at two places that module_alloc() returns
executable memory. While this is currently true, this will be changed
with a subsequent patch to follow other architectures which implement
ARCH_HAS_STRICT_MODULE_RWX.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Given that set_memory_rox() and set_memory_rwnx() exist, it is possible
to get rid of all open coded __set_memory() usages and replace them with
proper helper calls everywhere.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Diag 308 subcodes expect a physical address as their parameter.
This currently is not a bug, but in the future physical and virtual
addresses might differ.
Fix the confusion by doing a virtual-to-physical conversion in the
exported diag308() and leave the assembly wrapper __diag308() alone.
Note that several callers pass NULL as addr, so check for the case when
NULL is passed and pass 0 to hardware since virt_to_phys(0) might be
nonzero.
Suggested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Randomize the load address of modules in the kernel to make KASLR effective
for modules.
This is the s390 variant of commit e2b32e6785 ("x86, kaslr: randomize
module base load address").
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Just like other architectures provide a kaslr_enabled() function, instead
of directly accessing a global variable.
Also pass the renamed __kaslr_enabled variable from the decompressor to the
kernel, so that kalsr_enabled() is available there too. This will be used
by a subsequent patch which randomizes the module base load address.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add support for the stackleak feature. Whenever the kernel returns to user
space the kernel stack is filled with a poison value.
Enabling this feature is quite expensive: e.g. after instrumenting the
getpid() system call function to have a 4kb stack the result is an
increased runtime of the system call by a factor of 3.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Allocate early async stack like other early stacks and get rid of
arch_early_irq_init(). This way the async stack is allocated earlier,
and handled like all other stacks.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390 is the only architecture which switches from the initial stack to a
later on allocated different stack for the first process.
This is (at least) problematic for the stackleak feature, which instruments
functions to save the current stackpointer within the task structure of the
running process.
The stackleak code compares stack pointers of the current process - and
doesn't expect that the kernel stack of a task can change. Even though the
stackleak feature itself will not cause any harm, the assumption about
kernel stacks being consistent is there, and only s390 doesn't follow that.
Therefore switch back to use init_thread_union, just like all other
architectures.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make sure the lowcore kernel stack pointer reflects the kernel stack of the
current task as early as possible, instead of having a NULL pointer there.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make STACK_INIT_OFFSET also available for assembler code, and
use it everywhere instead of open-coding it at several places.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The pattern for all in_<type>_stack() functions is the same; especially
also the size of all stacks is the same. Simplify the code by passing only
the stack address to the generic in_stack() helper, which then can assume a
THREAD_SIZE sized stack.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently, exception tables are marked as ro_after_init. However,
since they are sorted during compile time using scripts/sorttable,
they can be moved to RO_DATA using the RO_EXCEPTION_TABLE_ALIGN macro,
which is specifically designed for this purpose.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since commit 4efd417f29 ("s390: raise minimum supported machine
generation to z10"), the long-displacement facility is assumed and
required for the kernel. Clean up a couple of places in the entry code,
where long-displacement could be used directly instead of using a base
register.
However, there are still a few other places where a base register has
to be used to extend short-displacement for the second lowcore page
access. Notably, boot/head.S still has to be built for z900, and in
mcck_int_handler, spt and lbear, which don't have long-displacements,
but need to access save areas at the second lowcore page.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
We need the fixes in here for testing, as well as the driver core
changes for documentation updates to build on.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Return -EFAULT if put_user() for the PTRACE_GET_LAST_BREAK
request fails, instead of silently ignoring it.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This flag is used to process only fully populated sampling buffers
when an sampling event is stopped on a CPU. By default the last sampling
buffer is also scanned for samples even if the sampling block full
indicator is not set in the trailer entry of a sampling buffer page.
This flag can be set via perf_event_attr::config1 field. It was never
used and never documented. It is useless now.
With PERF_CPUM_SF_FULL_BLOCKS:
When a process is scheduled off the CPU, the sampling is stopped and
the samples are copied to the perf ring buffer and marked invalid.
When stopped at the last full sample buffer page (which is
achieved with the PERF_CPUM_SF_FULL_BLOCKS options), the hardware
sampling will resume at the first free sample entry in the current,
partially filled sample buffer.
Without PERF_CPUM_SF_FULL_BLOCKS (default behavior):
The partially filled last sample buffer is scanned and valid samples
are saved to the perf ring buffer. The valid samples are marked invalid.
The sampling is resumed when the process is scheduled on this CPU.
Again the hardware sampling will resume at the first free sample entry in
the current, partially filled sample buffer.
Now the next interrupt handler invocation scans the
full sample block and saves the valid samples to the ring buffer.
It omits the invalid samples at the top of the buffer.
The default behavior is fully sufficient, therefore remove this feature.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.
Changes to include the declaration of the tracepoint were driven by the
following coccinelle script:
@func_use@
@@
smp_send_reschedule(...);
@include@
@@
#include <trace/events/ipi.h>
@no_include depends on func_use && !include@
@@
#include <...>
+
+ #include <trace/events/ipi.h>
[csky bits]
[riscv bits]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
The actual intention is that no dynamic relocation exists in the VDSO. For
this the VDSO build validates that the resulting .so file does not have any
relocations which are specified via $(ARCH_REL_TYPE_ABS) per architecture,
which is fragile as e.g. ARM64 lacks an entry for R_AARCH64_RELATIVE. Aside
of that ARCH_REL_TYPE_ABS is a misnomer as it checks for relative
relocations too.
However, some GNU ld ports produce unneeded R_*_NONE relocation entries. If
a port fails to determine the exact .rel[a].dyn size, the trailing zeros
become R_*_NONE relocations. E.g. ld's powerpc port recently fixed
https://sourceware.org/bugzilla/show_bug.cgi?id=29540). R_*_NONE are
generally a no-op in the dynamic loaders. So just ignore them.
Remove the ARCH_REL_TYPE_ABS defines and just validate that the resulting
.so file does not contain any R_* relocation entries except R_*_NONE.
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for aarch64
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> # for vDSO, aarch64
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lore.kernel.org/r/20230310190750.3323802-1-maskray@google.com
The ftrace selftest code has a trace_direct_tramp() function which it
uses as a direct call trampoline. This happens to work on x86, since the
direct call's return address is in the usual place, and can be returned
to via a RET, but in general the calling convention for direct calls is
different from regular function calls, and requires a trampoline written
in assembly.
On s390, regular function calls place the return address in %r14, and an
ftrace patch-site in an instrumented function places the trampoline's
return address (which is within the instrumented function) in %r0,
preserving the original %r14 value in-place. As a regular C function
will return to the address in %r14, using a C function as the trampoline
results in the trampoline returning to the caller of the instrumented
function, skipping the body of the instrumented function.
Note that the s390 issue is not detcted by the ftrace selftest code, as
the instrumented function is trivial, and returning back into the caller
happens to be equivalent.
On arm64, regular function calls place the return address in x30, and
an ftrace patch-site in an instrumented function saves this into r9
and places the trampoline's return address (within the instrumented
function) in x30. A regular C function will return to the address in
x30, but will not restore x9 into x30. Consequently, using a C function
as the trampoline results in returning to the trampoline's return
address having corrupted x30, such that when the instrumented function
returns, it will return back into itself.
To avoid future issues in this area, remove the trace_direct_tramp()
function, and require that each architecture with direct calls provides
a stub trampoline, named ftrace_stub_direct_tramp. This can be written
to handle the architecture's trampoline calling convention, and in
future could be used elsewhere (e.g. in the ftrace ops sample, to
measure the overhead of direct calls), so we may as well always build it
in.
Link: https://lkml.kernel.org/r/20230321140424.345218-8-revest@chromium.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Florent Revest <revest@chromium.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Use __ALIGN instead of open coded .align statement to make sure that
vdso code follows global kernel function alignment rules.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move the ftrace hotpatch trampolines to mcount.S. This allows to make
use of the standard SYM_CODE macros which again makes sure that the
hotpatch trampolines follow the function alignment rules of the rest
of the kernel.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Vasily Gorbik says:
===================
Combine and generalize all methods for finding unused memory in
decompressor, while decreasing complexity, add memory holes support,
while improving error handling (especially in low-memory conditions)
and debug-ability.
===================
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Since regular paging structs are initialized in decompressor already
move KASAN shadow mapping to decompressor as well. This helps to avoid
allocating KASAN required memory in 1 large chunk, de-duplicate paging
structs creation code and start the uncompressed kernel with KASAN
instrumentation right away. This also allows to avoid all pitfalls
accidentally calling KASAN instrumented code during KASAN initialization.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently several approaches for finding unused memory in decompressor
are utilized. While "safe_addr" grows towards higher addresses, vmem
code allocates paging structures top down. The former requires careful
ordering. In addition to that ipl report handling code verifies potential
intersections with secure boot certificates on its own. Neither of two
approaches are memory holes aware and consistent with each other in low
memory conditions.
To solve that, existing approaches are generalized and combined
together, as well as online memory ranges are now taken into
consideration.
physmem_info has been extended to contain reserved memory ranges. New
set of functions allow to handle reserves and find unused memory.
All reserves and memory allocations are "typed". In case of out of
memory condition decompressor fails with detailed info on current
reserved ranges and usable online memory.
Linux version 6.2.0 ...
Kernel command line: ... mem=100M
Our of memory allocating 100000 bytes 100000 aligned in range 0:5800000
Reserved memory ranges:
0000000000000000 0000000003e33000 DECOMPRESSOR
0000000003f00000 00000000057648a3 INITRD
00000000063e0000 00000000063e8000 VMEM
00000000063eb000 00000000063f4000 VMEM
00000000063f7800 0000000006400000 VMEM
0000000005800000 0000000006300000 KASAN
Usable online memory ranges (info source: sclp read info [3]):
0000000000000000 0000000006400000
Usable online memory total: 6400000 Reserved: 61b10a3 Free: 24ef5d
Call Trace:
(sp:000000000002bd58 [<0000000000012a70>] physmem_alloc_top_down+0x60/0x14c)
sp:000000000002bdc8 [<0000000000013756>] _pa+0x56/0x6a
sp:000000000002bdf0 [<0000000000013bcc>] pgtable_populate+0x45c/0x65e
sp:000000000002be90 [<00000000000140aa>] setup_vmem+0x2da/0x424
sp:000000000002bec8 [<0000000000011c20>] startup_kernel+0x428/0x8b4
sp:000000000002bf60 [<00000000000100f4>] startup_normal+0xd4/0xd4
physmem_alloc_range allows to find free memory in specified range. It
should be used for one time allocations only like finding position for
amode31 and vmlinux.
physmem_alloc_top_down can be used just like physmem_alloc_range, but
it also allows multiple allocations per type and tries to merge sequential
allocations together. Which is useful for paging structures allocations.
If sequential allocations cannot be merged together they are "chained",
allowing easy per type reserved ranges enumeration and migration to
memblock later. Extra "struct reserved_range" allocated for chaining are
not tracked or reserved but rely on the fact that both
physmem_alloc_range and physmem_alloc_top_down search for free memory
only below current top down allocator position. All reserved ranges
should be transferred to memblock before memblock allocations are
enabled.
The startup code has been reordered to delay any memory allocations until
online memory ranges are detected and occupied memory ranges are marked as
reserved to be excluded from follow-up allocations.
Ipl report certificates are a special case, ipl report certificates list
is checked together with other memory reserves until certificates are
saved elsewhere.
KASAN required memory for shadow memory allocation and mapping is reserved
as 1 large chunk which is later passed to KASAN early initialization code.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In preparation to extending mem_detect with additional information like
reserved ranges rename it to more generic physmem_info. This new naming
also help to avoid confusion by using more exact terms like "physmem
online ranges", etc.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
check_image_bootable() has been introduced with commit 627c9b6205
("s390/boot: block uncompressed vmlinux booting attempts") to make sure
that users don't try to boot uncompressed vmlinux ELF image in qemu. It
used to be possible quite some time ago. That commit prevented confusion
with uncompressed vmlinux image starting to boot and even printing
kernel messages until it crashed. Users might have tried to report the
problem without realizing they are doing something which was not intended.
Since commit f1d3c53237 ("s390/boot: move sclp early buffer from fixed
address in asm to C") check_image_bootable() doesn't function properly
anymore, as well as booting uncompressed vmlinux image in qemu doesn't
really produce any output and crashes. Moving forward it doesn't make
sense to fix check_image_bootable() anymore, so simply remove it.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
report_user_fault() currently does not show which library last_break
points to. Call print_vma_addr() to find out; the output now looks
like this:
Last Breaking-Event-Address:
[<000003ffaa2a56e4>] libc.so.6[3ffaa180000+251000]
For kernel it's unchanged:
Last Breaking-Event-Address:
[<000000000030fd06>] trace_hardirqs_on+0x56/0xc8
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-3-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230310234525.3986352-2-mcgrof@kernel.org
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20230313182918.1312597-19-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20230313182918.1312597-18-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Setting and ->psw.addr in childregs of kernel thread is a rudiment of
the old kernel_thread()/kernel_execve() implementation. Mainline hadn't
been using them since 2012.
And clarify the assignments to frame->sf.gprs - the array stores
grp6..gpr15 values to be set by __switch_to(), so frame->sf.gprs[5]
actually affects grp11, etc.
Better spell that as frame->sf.gprs[11 - 6]...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/ZAU6BYFisE8evmYf@ZenIV
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There is no point in changing branch prediction state of a cpu shortly
before it enters stop state. Therefore remove __bpon().
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
TIF_ISOLATE_BP is unused since it was introduced with commit 6b73044b2b
("s390: run user space and KVM guests with modified branch prediction").
Given that there is no use case remove it again.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When leaving interpretive execution because of a program check BPENTER
should be called like it is done on interrupt exit as well.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
module_layout manages different types of memory (text, data, rodata, etc.)
in one allocation, which is problematic for some reasons:
1. It is hard to enable CONFIG_STRICT_MODULE_RWX.
2. It is hard to use huge pages in modules (and not break strict rwx).
3. Many archs uses module_layout for arch-specific data, but it is not
obvious how these data are used (are they RO, RX, or RW?)
Improve the scenario by replacing 2 (or 3) module_layout per module with
up to 7 module_memory per module:
MOD_TEXT,
MOD_DATA,
MOD_RODATA,
MOD_RO_AFTER_INIT,
MOD_INIT_TEXT,
MOD_INIT_DATA,
MOD_INIT_RODATA,
and allocating them separately. This adds slightly more entries to
mod_tree (from up to 3 entries per module, to up to 7 entries per
module). However, this at most adds a small constant overhead to
__module_address(), which is expected to be fast.
Various archs use module_layout for different data. These data are put
into different module_memory based on their location in module_layout.
IOW, data that used to go with text is allocated with MOD_MEM_TYPE_TEXT;
data that used to go with data is allocated with MOD_MEM_TYPE_DATA, etc.
module_memory simplifies quite some of the module code. For example,
ARCH_WANTS_MODULES_DATA_IN_VMALLOC is a lot cleaner, as it just uses a
different allocator for the data. kernel/module/strict_rwx.c is also
much cleaner with module_memory.
Signed-off-by: Song Liu <song@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Before commit 076cbf5d2163 ("x86/xen: don't let xen_pv_play_dead()
return"), in Xen, when a previously offlined CPU was brought back
online, it unexpectedly resumed execution where it left off in the
middle of the idle loop.
There were some hacks to make that work, but the behavior was surprising
as do_idle() doesn't expect an offlined CPU to return from the dead (in
arch_cpu_idle_dead()).
Now that Xen has been fixed, and the arch-specific implementations of
arch_cpu_idle_dead() also don't return, give it a __noreturn attribute.
This will cause the compiler to complain if an arch-specific
implementation might return. It also improves code generation for both
caller and callee.
Also fixes the following warning:
vmlinux.o: warning: objtool: do_idle+0x25f: unreachable instruction
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/60d527353da8c99d4cf13b6473131d46719ed16d.1676358308.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
- Add empty command line parameter handling stubs to kernel for all command
line parameters which are handled in the decompressor. This avoids
invalid "Unknown kernel command line parameters" messages from the
kernel, and also avoids that these will be incorrectly passed to user
space. This caused already confusion, therefore add the empty stubs
- Add missing phys_to_virt() handling to machine check handler
- Introduce and use a union to be used for zcrypt inline assemblies. This
makes sure that only a register wide member of the union is passed as
input and output parameter to inline assemblies, while usual C code uses
other members of the union to access bit fields of it
- Add and use a READ_ONCE_ALIGNED_128() macro, which can be used to
atomically read a 128-bit value from memory. This replaces the (mis-)use
of the 128-bit cmpxchg operation to do the same in cpum_sf code.
Currently gcc does not generate the used lpq instruction if __READ_ONCE()
is used for aligned 128-bit accesses, therefore use this s390 specific
helper
- Simplify machine check handler code if a task needs to be killed because
of e.g. register corruption due to a machine malfunction
- Perform CPU reset to clear pending interrupts and TLB entries on an
already stopped target CPU before delegating work to it
- Generate arch/s390/boot/vmlinux.map link map for the decompressor, when
CONFIG_VMLINUX_MAP is enabled for debugging purposes
- Fix segment type handling for dcssblk devices. It incorrectly always
returned type "READ/WRITE" even for read-only segements, which can result
in a kernel panic if somebody tries to write to a read-only device
- Sort config S390 select list again
- Fix two kprobe reenter bugs revealed by a recently added kprobe kunit
test
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmQB5LgACgkQIg7DeRsp
bsLvPRAAojiU7uZTDKL58Ta67D7gyWWZa6ClI7MHrtEwEM70cOZHmOYToFuyapbT
Af+PBg8YcqDDHxGf+HNuxnYRI4y2hGJUyjURZmGuCzPPJ5IGvy+8/X0d0wsvclqC
fWeTtVvZIxdu5NdBQTs1iMCCxPXg9OfGDVCU+P/pnrlKskHorA6a2ZWGdkpae3fj
kd1pgLLcZq0Jf8+Q6bSLBZjm45R/Q+qEpd8eIOncP7xrQTpLPkLxRSrnZLPtYdLu
7uUDtzEC63x7/0Weri71ZQqbc22CzHNiF6FZj5sm78KsWmparmUUsns/U5CrEO4J
e85/kPXjgfvwMZdQCXSPHZJ8CGTRzN6Zl/Lym9W5+9X6cgb1WNVNCATGZIG3HueA
MYnXmLkmNmaEgsckcYbCVPR9SjIwVibIL+/32hH63oUnc8WnrbVlai/G57j0dRUg
+kxVvwbxaDyesRbF7XKe4PssYZJxpO+QFIhnvo6EEghUuc32o+WVNqjoU04/VBaa
i2FtbARGDWWs5+EwS9td/xHiFPQHpFCJHrMbxacEQmtDbOTOhoLbxNlP17YfQ2oD
ch0QCZ1w0VwE9ehMRLmZQCem51n7aPt11tKOEFKsck4Mb1dXVLW9LWwNpjIBXrIo
2g4U4Ytj2I9HbxWMJmKmLsP4NHn2oG3A+uGUkSWf/wdDTfylB30=
=zZId
-----END PGP SIGNATURE-----
Merge tag 's390-6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Add empty command line parameter handling stubs to kernel for all
command line parameters which are handled in the decompressor. This
avoids invalid "Unknown kernel command line parameters" messages from
the kernel, and also avoids that these will be incorrectly passed to
user space. This caused already confusion, therefore add the empty
stubs
- Add missing phys_to_virt() handling to machine check handler
- Introduce and use a union to be used for zcrypt inline assemblies.
This makes sure that only a register wide member of the union is
passed as input and output parameter to inline assemblies, while
usual C code uses other members of the union to access bit fields of
it
- Add and use a READ_ONCE_ALIGNED_128() macro, which can be used to
atomically read a 128-bit value from memory. This replaces the
(mis-)use of the 128-bit cmpxchg operation to do the same in cpum_sf
code. Currently gcc does not generate the used lpq instruction if
__READ_ONCE() is used for aligned 128-bit accesses, therefore use
this s390 specific helper
- Simplify machine check handler code if a task needs to be killed
because of e.g. register corruption due to a machine malfunction
- Perform CPU reset to clear pending interrupts and TLB entries on an
already stopped target CPU before delegating work to it
- Generate arch/s390/boot/vmlinux.map link map for the decompressor,
when CONFIG_VMLINUX_MAP is enabled for debugging purposes
- Fix segment type handling for dcssblk devices. It incorrectly always
returned type "READ/WRITE" even for read-only segements, which can
result in a kernel panic if somebody tries to write to a read-only
device
- Sort config S390 select list again
- Fix two kprobe reenter bugs revealed by a recently added kprobe kunit
test
* tag 's390-6.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/kprobes: fix current_kprobe never cleared after kprobes reenter
s390/kprobes: fix irq mask clobbering on kprobe reenter from post_handler
s390/Kconfig: sort config S390 select list again
s390/extmem: return correct segment type in __segment_load()
s390/decompressor: add link map saving
s390/smp: perform cpu reset before delegating work to target cpu
s390/mcck: cleanup user process termination path
s390/cpum_sf: use READ_ONCE_ALIGNED_128() instead of 128-bit cmpxchg
s390/rwonce: add READ_ONCE_ALIGNED_128() macro
s390/ap,zcrypt,vfio: introduce and use ap_queue_status_reg union
s390/nmi: fix virtual-physical address confusion
s390/setup: do not complain about parameters handled in decompressor
Recent test_kprobe_missed kprobes kunit test uncovers the following
problem. Once kprobe is triggered from another kprobe (kprobe reenter),
all future kprobes on this cpu are considered as kprobe reenter, thus
pre_handler and post_handler are not being called and kprobes are counted
as "missed".
Commit b9599798f9 ("[S390] kprobes: activation and deactivation")
introduced a simpler scheme for kprobes (de)activation and status
tracking by using push_kprobe/pop_kprobe, which supposed to work for
both initial kprobe entry as well as kprobe reentry and helps to avoid
handling those two cases differently. The problem is that a sequence of
calls in case of kprobes reenter:
push_kprobe() <- NULL (current_kprobe)
push_kprobe() <- kprobe1 (current_kprobe)
pop_kprobe() -> kprobe1 (current_kprobe)
pop_kprobe() -> kprobe1 (current_kprobe)
leaves "kprobe1" as "current_kprobe" on this cpu, instead of setting it
to NULL. In fact push_kprobe/pop_kprobe can only store a single state
(there is just one prev_kprobe in kprobe_ctlblk). Which is a hack but
sufficient, there is no need to have another prev_kprobe just to store
NULL. To make a simple and backportable fix simply reset "prev_kprobe"
when kprobe is poped from this "stack". No need to worry about
"kprobe_status" in this case, because its value is only checked when
current_kprobe != NULL.
Cc: stable@vger.kernel.org
Fixes: b9599798f9 ("[S390] kprobes: activation and deactivation")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Recent test_kprobe_missed kprobes kunit test uncovers the following error
(reported when CONFIG_DEBUG_ATOMIC_SLEEP is enabled):
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 662, name: kunit_try_catch
preempt_count: 0, expected: 0
RCU nest depth: 0, expected: 0
no locks held by kunit_try_catch/662.
irq event stamp: 280
hardirqs last enabled at (279): [<00000003e60a3d42>] __do_pgm_check+0x17a/0x1c0
hardirqs last disabled at (280): [<00000003e3bd774a>] kprobe_exceptions_notify+0x27a/0x318
softirqs last enabled at (0): [<00000003e3c5c890>] copy_process+0x14a8/0x4c80
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 46 PID: 662 Comm: kunit_try_catch Tainted: G N 6.2.0-173644-g44c18d77f0c0 #2
Hardware name: IBM 3931 A01 704 (LPAR)
Call Trace:
[<00000003e60a3a00>] dump_stack_lvl+0x120/0x198
[<00000003e3d02e82>] __might_resched+0x60a/0x668
[<00000003e60b9908>] __mutex_lock+0xc0/0x14e0
[<00000003e60bad5a>] mutex_lock_nested+0x32/0x40
[<00000003e3f7b460>] unregister_kprobe+0x30/0xd8
[<00000003e51b2602>] test_kprobe_missed+0xf2/0x268
[<00000003e51b5406>] kunit_try_run_case+0x10e/0x290
[<00000003e51b7dfa>] kunit_generic_run_threadfn_adapter+0x62/0xb8
[<00000003e3ce30f8>] kthread+0x2d0/0x398
[<00000003e3b96afa>] __ret_from_fork+0x8a/0xe8
[<00000003e60ccada>] ret_from_fork+0xa/0x40
The reason for this error report is that kprobes handling code failed
to restore irqs.
The problem is that when kprobe is triggered from another kprobe
post_handler current sequence of enable_singlestep / disable_singlestep
is the following:
enable_singlestep <- original kprobe (saves kprobe_saved_imask)
enable_singlestep <- kprobe triggered from post_handler (clobbers kprobe_saved_imask)
disable_singlestep <- kprobe triggered from post_handler (restores kprobe_saved_imask)
disable_singlestep <- original kprobe (restores wrong clobbered kprobe_saved_imask)
There is just one kprobe_ctlblk per cpu and both calls saves and
loads irq mask to kprobe_saved_imask. To fix the problem simply move
resume_execution (which calls disable_singlestep) before calling
post_handler. This also fixes the problem that post_handler is called
with pt_regs which were not yet adjusted after single-stepping.
Cc: stable@vger.kernel.org
Fixes: 4ba069b802 ("[S390] add kprobes support.")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Clear CPU state (e.g. all TLB entries, prefetched instructions, etc.)
of the target CPU, however without clearing register contents before
starting any work on it.
This puts the target CPU in a more defined state compared to the
current Stop + Restart sigp orders.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
If a machine check interrupt hits while user process is
running __s390_handle_mcck() helper function is called
directly from the interrupt handler and terminates the
current process by calling make_task_dead() routine.
The make_task_dead() is not allowed to be called from
interrupt context which forces the machine check handler
switch to the kernel stack and enable local interrupts
first.
The __s390_handle_mcck() could also be called to service
pending work, but this time from the external interrupts
handler. It is the machine check handler that establishes
the work and schedules the external interrupt, therefore
the machine check interrupt itself should be disabled
while reading out the corresponding variable:
local_mcck_disable();
mcck = *this_cpu_ptr(&cpu_mcck);
memset(this_cpu_ptr(&cpu_mcck), 0, sizeof(mcck));
local_mcck_enable();
However, local_mcck_disable() does not have effect when
__s390_handle_mcck() is called directly form the machine
check handler, since the machine check interrupt is still
disabled. Therefore, it is not the opening bracket to the
following local_mcck_enable() function.
Simplify the user process termination flow by scheduling
the external interrupt and killing the affected process
from the interrupt context.
Assume a kernel-generated signal is always delivered and
ignore a value returned by do_send_sig_info() funciton.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use READ_ONCE_ALIGNED_128() to read the previous value in front of a
128-bit cmpxchg loop, instead of (mis-)using a 128-bit cmpxchg operation to
do the same.
This makes the code more readable and is faster.
Link: https://lore.kernel.org/r/20230224100237.3247871-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When a machine check is received while in SIE, it is reinjected into the
guest in some cases. The respective code needs to access the sie_block,
which is taken from the backed up R14.
Since reinjection only occurs while we are in SIE (i.e. between the
labels sie_entry and sie_leave in entry.S and thus if CIF_MCCK_GUEST is
set), the backed up R14 will always contain a physical address in
s390_backup_mcck_info.
This currently works, because virtual and physical addresses are
the same.
Add phys_to_virt() to resolve the virtual-physical confusion.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20230216121208.4390-2-nrb@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently there are several kernel command line parameters which are
only parsed and handled in decompressor and not known to the kernel.
This leads to the following error message during kernel boot:
Unknown kernel command line parameters "mem=3G nokaslr", will be passed
to user space.
To avoid confusion, register those parameters with an empty stub so that
kernel does not complain about them.
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Change V=1 option to print both short log and full command log.
- Allow V=1 and V=2 to be combined as V=12.
- Make W=1 detect wrong .gitignore files.
- Tree-wide cleanups for unused command line arguments passed to Clang.
- Stop using -Qunused-arguments with Clang.
- Make scripts/setlocalversion handle only correct release tags instead
of any arbitrary annotated tag.
- Create Debian and RPM source packages without cleaning the source tree.
- Various cleanups for packaging.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmP7iHoVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGL/cQAK9q5rsNL5a2LgTbm89ORA+UV+ST
hrAoGo5DkJHUbVH53oPzyLynFBZPvUzLK8yjApjXkyAzy2hXYnj+vbTs0s+JVCFL
owS4NB0YP+tpHGuy8bGpWI0GMZSMwmspUteqxk86zuH8uQVAhnCaeV1/Cr6Aqj1h
2jk1FZid3/h7qEkEgu5U8soeyFnV6VhAT6Ie5yfZ2O2RdsSqPUh6vfKrgdyW4RWz
gito0SOUwvjIDfSmTnIIacUibisPRv2OW29OvmDp1aXj5rMhe3UfOznVE3NR86yl
ZbWDAIm6KYT8V1ASOoAUR80qent9IPKytThLK9BVEQCT6bsujCZMvhYhhEvO30TF
Lzsdr+FrES//xag3+hgc63FEied2xxWGQG1cRtzAhfRL9tJ03+mY1omoW6SyKqW/
Gc9PIcTgQbCIrkeL0HuAI1q3I1vkvHXInJKtGkoHh1J9aJ8v5gQpwGA+DDRUnA+A
LQSeEbT2Hf3MoF4CqZRnConvfhlMuLI+j5v54YPrhokxXmv7u807kjfwMFTiZ/+m
CJFlEMf9YRv3pi8g/AYyGAg5ZQigCwzOCRUC5kguFqzZdgnjiI907GEL804lm1Mg
lpx/HtYPyxwWEd2XyU6/C9AEIl3gm7MBd6b1tD54Tb/VmE+AvjS/O9jFYXZqnAnM
Llv4BfK/cQKwHb6o
=HpFZ
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Change V=1 option to print both short log and full command log
- Allow V=1 and V=2 to be combined as V=12
- Make W=1 detect wrong .gitignore files
- Tree-wide cleanups for unused command line arguments passed to Clang
- Stop using -Qunused-arguments with Clang
- Make scripts/setlocalversion handle only correct release tags instead
of any arbitrary annotated tag
- Create Debian and RPM source packages without cleaning the source
tree
- Various cleanups for packaging
* tag 'kbuild-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (74 commits)
kbuild: rpm-pkg: remove unneeded KERNELRELEASE from modules/headers_install
docs: kbuild: remove description of KBUILD_LDS_MODULE
.gitattributes: use 'dts' diff driver for *.dtso files
kbuild: deb-pkg: improve the usability of source package
kbuild: deb-pkg: fix binary-arch and clean in debian/rules
kbuild: tar-pkg: use tar rules in scripts/Makefile.package
kbuild: make perf-tar*-src-pkg work without relying on git
kbuild: deb-pkg: switch over to source format 3.0 (quilt)
kbuild: deb-pkg: make .orig tarball a hard link if possible
kbuild: deb-pkg: hide KDEB_SOURCENAME from Makefile
kbuild: srcrpm-pkg: create source package without cleaning
kbuild: rpm-pkg: build binary packages from source rpm
kbuild: deb-pkg: create source package without cleaning
kbuild: add a tool to list files ignored by git
Documentation/llvm: add Chimera Linux, Google and Meta datacenters
setlocalversion: use only the correct release tag for git-describe
setlocalversion: clean up the construction of version output
.gitignore: ignore *.cover and *.mbx
kbuild: remove --include-dir MAKEFLAG from top Makefile
kbuild: fix trivial typo in comment
...
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()") which
does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter". These filters provide users
with finer-grained control over DAMOS's actions. SeongJae has also done
some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series "mm:
support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap
PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with his
series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings. The previous BPF-based approach had
shortcomings. See "mm: In-kernel support for memory-deny-write-execute
(MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a per-node
basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage during
compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in ths
series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's series
"mm, arch: add generic implementation of pfn_valid() for FLATMEM" and
"fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest of
the kernel in the series "Simplify the external interface for GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the series
"mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA
jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K
DmxHkn0LAitGgJRS/W9w81yrgig9tAQ=
=MlGs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
- Large cleanup of the con3270/tty3270 driver. Among others this fixes:
* Background Color Support
* ASCII Line Character Support
* VT100 Support
* Geometries other than 80x24
- Cleanup and improve cmpxchg() code. Also add cmpxchg_user_key() to
uaccess functions, which will be used by KVM to access KVM guest memory
with a specific storage key.
- Add support for user space events counting to CPUMF.
- Cleanup the vfio/ccw code, which also allows now to properly support 2K
Format-2 IDALs.
- Move kernel page table allocation and initialization to decompressor,
which finally allows to enter the kernel with dynamic address translation
enabled. This in turn allows to get rid of code with special handling in
the kernel, which has to distinguish if DAT is on or off.
- Replace kretprobe with rethook.
- Various improvements to vfio/ap queue resets:
* Use TAPQ to verify completion of a reset in progress rather than
multiple invocations of ZAPQ.
* Check TAPQ response codes when verifying successful completion of ZAPQ.
* Fix erroneous handling of some error response codes.
* Increase the maximum amount of time to wait for successful completion
of ZAPQ.
- Rework system call wrappers to get rid of alias functions, which were
only left on s390.
- Cleanup diag288_wdt watchdog driver. It has been agreed on with Guenter
Roeck that this goes upstream via the s390 tree.
- Add missing loadparm parameter handling for list-directed ECKD ipl/reipl.
- Various improvements to memory detection code.
- Remove arch_cpu_idle_time() since the current implementation is broken,
and allows user space observable accounted idle times which can
temporarily decrease.
- Add Reset DAT-Protection support: (only) allow to change PTEs from RO to
RW with a new RDP instruction. Unlike the currently used IPTE
instruction, this does not necessarily guarantee that TLBs of all CPUs
are synchronously flushed; and that remote CPUs can see spurious
protection faults. The overall improvement for not requiring an all CPU
synchronization, like it is required with IPTE, should be beneficial.
- Fix KFENCE page fault reporting.
- Smaller cleanups and improvement all over the place.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmPzWhQACgkQIg7DeRsp
bsKkdQ//QbCwDMt6T3bmi6gGgs9HRSkLOTHAlOIRuetEzRiBzm/O4Gm8NycvPspl
BIcuXmQKt+gBS44tWikKpwuhmWrAtiFUxs/M1uPfRXqjUf+ZFJinPJgtPCBa/3rv
tQkh541QxpX4K5Ks71WKv2Kh0RjaTqw5Kj+rlDBYHsxZvb28mDigINRYoVSxNUKi
dTVlR0UgdGLecXfezpvWeEAbJu6Q2pbIkOT3tNOumNqRAoUN4cbH3P0agHJdq8oj
L/++d4tfVbwL8N/VCwIVBeW/AQzA0B2UCDVz75Pd55+FFrIGVp1hn7QC9QQieomL
fzGOTrL4D9U8JkAIJqhioA1NlcN1+QW2svoMVo0N3vBJpIbzX4bZKTDxZZ26dG9H
ox7YvhsZtJA7p34X5hetoObzZcmiYJStT+BDao7q1x3oLf4G31HaP+YUDIKBPmNW
ieZa+ujYbKor1pD6ysaMVX+c1qhfX6S/V0uBAikoqMWUVUvH/ZeuSxCSfMuvWUrQ
KFuc0HnPiiIO1Ux3wN5oN33+pWCSdUcJOeg4aj0jkkFT9Ct3TOBupIGBGyhOAh6r
OTp1iqJuQjwOmkWPLyRuGMzRmDDp+hWz9qNF/DFwIV1IMi9AzJjEOg31cMVxx3gG
iM25560uqvhhUwQFbyjVgJmj00dmqMX07q8QSVOgg9oUrnRhQrc=
=MW9x
-----END PGP SIGNATURE-----
Merge tag 's390-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Large cleanup of the con3270/tty3270 driver. Among others this fixes:
- Background Color Support
- ASCII Line Character Support
- VT100 Support
- Geometries other than 80x24
- Cleanup and improve cmpxchg() code. Also add cmpxchg_user_key() to
uaccess functions, which will be used by KVM to access KVM guest
memory with a specific storage key
- Add support for user space events counting to CPUMF
- Cleanup the vfio/ccw code, which also allows now to properly support
2K Format-2 IDALs
- Move kernel page table allocation and initialization to decompressor,
which finally allows to enter the kernel with dynamic address
translation enabled. This in turn allows to get rid of code with
special handling in the kernel, which has to distinguish if DAT is on
or off
- Replace kretprobe with rethook
- Various improvements to vfio/ap queue resets:
- Use TAPQ to verify completion of a reset in progress rather than
multiple invocations of ZAPQ.
- Check TAPQ response codes when verifying successful completion of
ZAPQ.
- Fix erroneous handling of some error response codes.
- Increase the maximum amount of time to wait for successful
completion of ZAPQ
- Rework system call wrappers to get rid of alias functions, which were
only left on s390
- Cleanup diag288_wdt watchdog driver. It has been agreed on with
Guenter Roeck that this goes upstream via the s390 tree
- Add missing loadparm parameter handling for list-directed ECKD
ipl/reipl
- Various improvements to memory detection code
- Remove arch_cpu_idle_time() since the current implementation is
broken, and allows user space observable accounted idle times which
can temporarily decrease
- Add Reset DAT-Protection support: (only) allow to change PTEs from RO
to RW with a new RDP instruction. Unlike the currently used IPTE
instruction, this does not necessarily guarantee that TLBs of all
CPUs are synchronously flushed; and that remote CPUs can see spurious
protection faults. The overall improvement for not requiring an all
CPU synchronization, like it is required with IPTE, should be
beneficial
- Fix KFENCE page fault reporting
- Smaller cleanups and improvement all over the place
* tag 's390-6.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (182 commits)
s390/irq,idle: simplify idle check
s390/processor: add test_and_set_cpu_flag() and test_and_clear_cpu_flag()
s390/processor: let cpu helper functions return boolean values
s390/kfence: fix page fault reporting
s390/zcrypt: introduce ctfm field in struct CPRBX
s390: remove confusing comment from uapi types header file
vfio/ccw: remove WARN_ON during shutdown
s390/entry: remove toolchain dependent micro-optimization
s390/mem_detect: do not truncate online memory ranges info
s390/vx: remove __uint128_t type from __vector128 struct again
s390/mm: add support for RDP (Reset DAT-Protection)
s390/mm: define private VM_FAULT_* reasons from top bits
Documentation: s390: correct spelling
s390/ap: fix status returned by ap_qact()
s390/ap: fix status returned by ap_aqic()
s390: vfio-ap: tighten the NIB validity check
Revert "s390/mem_detect: do not update output parameters on failure"
s390/idle: remove arch_cpu_idle_time() and corresponding code
s390/vx: use simple assignments to access __vector128 members
s390/vx: add 64 and 128 bit members to __vector128 struct
...
- Improve the scalability of the CFS bandwidth unthrottling logic
with large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with
the generic scheduler code. Add __cpuidle methods as noinstr to
objtool's noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS,
to query previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period,
to improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- ... Misc other cleanups, fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzbJwRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIvA//ZcEaB8Z6ChLRQjM+bsaudKJu3pdLQbPK
iYbP8Da+LsAfxbEfYuGV3m+jIp0LlBOtsI/EezxQrXV+V7FvNyAX9Y00eEu/zlj8
7Jn3LMy/DBYTwH7LwVdcU0MyIVI8ZPc6WNnkx0LOtGZn8n+qfHPSDzcP3CW+a5AV
UvllPYpYyEmsX0Eby7CF4Ue8mSmbViw/xR3rNr8ZSve0c25XzKabw8O9kE3jiHxP
d/zERJoAYeDyYUEuZqhfn5dTlB4an4IjNEkAfRE5SQ09RA8Gkxsa5Ar8gob9e9M1
eQsdd4/bdhnrkM8L5qDZczqmgCTZ2bukQrxkBXhRDhLgoFxwAn77b+2ZjmIW3Lae
AyGqRcDSg1q2oxaYm5ZiuO/t26aDOZu9vPHyHRDGt95EGbZlrp+GgeePyfCigJYz
UmPdZAAcHdSymnnnlcvdG37WVvaVkpgWZzd8LbtBi23QR+Zc4WQ2IlgnUS5WKNNf
VOBcAcP6E1IslDotZDQCc2dPFFQoQQEssVooyUc5oMytm7BsvxXLOeHG+Ncu/8uc
H+U8Qn8jnqTxJbC5hkWQIJlhVKCq2FJrHxxySYTKROfUNcDgCmxboFeAcXTCIU1K
T0S+sdoTS/CvtLklRkG0j6B8N4N98mOd9cFwUV3tX+/gMLMep3hCQs5L76JagvC5
skkQXoONNaM=
=l1nN
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Improve the scalability of the CFS bandwidth unthrottling logic with
large number of CPUs.
- Fix & rework various cpuidle routines, simplify interaction with the
generic scheduler code. Add __cpuidle methods as noinstr to objtool's
noinstr detection and fix boatloads of cpuidle bugs & quirks.
- Add new ABI: introduce MEMBARRIER_CMD_GET_REGISTRATIONS, to query
previously issued registrations.
- Limit scheduler slice duration to the sysctl_sched_latency period, to
improve scheduling granularity with a large number of SCHED_IDLE
tasks.
- Debuggability enhancement on sys_exit(): warn about disabled IRQs,
but also enable them to prevent a cascade of followup problems and
repeat warnings.
- Fix the rescheduling logic in prio_changed_dl().
- Micro-optimize cpufreq and sched-util methods.
- Micro-optimize ttwu_runnable()
- Micro-optimize the idle-scanning in update_numa_stats(),
select_idle_capacity() and steal_cookie_task().
- Update the RSEQ code & self-tests
- Constify various scheduler methods
- Remove unused methods
- Refine __init tags
- Documentation updates
- Misc other cleanups, fixes
* tag 'sched-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
sched/rt: pick_next_rt_entity(): check list_entry
sched/deadline: Add more reschedule cases to prio_changed_dl()
sched/fair: sanitize vruntime of entity being placed
sched/fair: Remove capacity inversion detection
sched/fair: unlink misfit task from cpu overutilized
objtool: mem*() are not uaccess safe
cpuidle: Fix poll_idle() noinstr annotation
sched/clock: Make local_clock() noinstr
sched/clock/x86: Mark sched_clock() noinstr
x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
x86/atomics: Always inline arch_atomic64*()
cpuidle: tracing, preempt: Squash _rcuidle tracing
cpuidle: tracing: Warn about !rcu_is_watching()
cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUG
cpuidle: drivers: firmware: psci: Dont instrument suspend code
KVM: selftests: Fix build of rseq test
exit: Detect and fix irq disabled state in oops
cpuidle, arm64: Fix the ARM64 cpuidle logic
cpuidle: mvebu: Fix duplicate flags assignment
sched/fair: Limit sched slice duration
...
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmPzaHgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jYQg/+KRfobCevMQlZVnz09T3SsJ4ahJ587BL6
g2C6kobyUNfeChpFVroBkTR+yCb6Mq4xGr2nda9+2E978BYu9eanpx/u/bXNQ6NU
6YhLwgRrlFXonYn07kFfUJeELZ0W+zpPvymEN1KhTQWcrgXDfXRt2VfMwNsVxGRF
ZRyCWK+UOzSMU22FtW3I/xVLBB0vio9Y6wRC5QOpDVW5YtGwQGust7GJ53JPK43J
m2soJvWORauT+v0aqc7ggOtKd6pahVoXrDrbktxtq9N0ZGI+PubVCGevex++cXm/
B3QSf6VcMMuU6pfzxiEwRa8Whrc3XFeSDEfvMjC5v3becGNkdNBnGOJzYprwgRZJ
irb6/dSrv5P2lj6WphsO1Wzcm7EoWh8M7DVOMh/13Y/oODRdOrv48112Don9UURC
EPyvzAzizqdwdDopUmfiqUwuAXqb8uPZqCgmlz/NJkVz1/ijlfrmLgeDuf0vI7Aq
HznzzRwjFHzyCH7D+rtonFh3JDaqgaouY76tpC5yTtzKbZPlFT8kzeCvqkTMnGgH
czZnSNc/kBup0HDkNSlthK+TyrMXWKeVa8KQSY1E0NJHO4IBBCMzZywSoAaeofQK
hqfQyofX9XHmuHhCA4yIfv1XkZGlBTxpPAyDdHjgs9iJTsodSYMs8ESY08eW8DXn
Ld/35O6SylM=
=ztUT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
- Optimize perf_sample_data layout
- Prepare sample data handling for BPF integration
- Update the x86 PMU driver for Intel Meteor Lake
- Restructure the x86 uncore code to fix a SPR (Sapphire Rapids)
discovery breakage
- Fix the x86 Zhaoxin PMU driver
- Cleanups
* tag 'perf-core-2023-02-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
perf/x86/intel/uncore: Add Meteor Lake support
x86/perf/zhaoxin: Add stepping check for ZXC
perf/x86/intel/ds: Fix the conversion from TSC to perf time
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
perf/x86/uncore: Add a quirk for UPI on SPR
perf/x86/uncore: Ignore broken units in discovery table
perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
perf/x86/uncore: Factor out uncore_device_to_die()
perf/core: Call perf_prepare_sample() before running BPF
perf/core: Introduce perf_prepare_header()
perf/core: Do not pass header for sample ID init
perf/core: Set data->sample_flags in perf_prepare_sample()
perf/core: Add perf_sample_save_brstack() helper
perf/core: Add perf_sample_save_raw_data() helper
perf/core: Add perf_sample_save_callchain() helper
perf/core: Save the dynamic parts of sample data size
x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulation
perf/core: Change the layout of perf_sample_data
perf/x86/msr: Add Meteor Lake support
perf/x86/cstate: Add Meteor Lake support
...
Use the per-cpu CIF_ENABLED_WAIT flag to decide if an interrupt
occurred while a cpu was idle, instead of checking two conditions
within the old psw.
Also move clearing of the CIF_ENABLED_WAIT bit to the early interrupt
handler, which in turn makes arch_vcpu_is_preempted() also a bit more
precise, since the flag is now cleared before interrupt handlers have
been called.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of CONFIG_AS_IS_LLVM in entry.S to make the code a bit more
readable. This removes a micro-optimization, but given that the llvm IAS
limitation will likely stay, just use the version that works with llvm.
See commit 4c25f0ff63 ("s390/entry: workaround llvm's IAS limitations")
for further details.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit bf64f0517e ("s390/mem_detect: handle online memory limit
just once") introduced truncation of mem_detect online ranges
based on identity mapping size. For kdump case however the full
set of online memory ranges has to be feed into memblock_physmem_add
so that crashed system memory could be extracted.
Instead of truncating introduce a "usable limit" which is respected by
mem_detect api. Also add extra online memory ranges iterator which still
provides full set of online memory ranges disregarding the "usable limit".
Fixes: bf64f0517e ("s390/mem_detect: handle online memory limit just once")
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
RDP instruction allows to reset DAT-protection bit in a PTE, with less
CPU synchronization overhead than IPTE instruction. In particular, IPTE
can cause machine-wide synchronization overhead, and excessive IPTE usage
can negatively impact machine performance.
RDP can be used instead of IPTE, if the new PTE only differs in SW bits
and _PAGE_PROTECT HW bit, for PTE protection changes from RO to RW.
SW PTE bit changes are allowed, e.g. for dirty and young tracking, but none
of the other HW-defined part of the PTE must change. This is because the
architecture forbids such changes to an active and valid PTE, which
is why invalidation with IPTE is always used first, before writing a new
entry.
The RDP optimization helps mainly for fault-driven SW dirty-bit tracking.
Writable PTEs are initially always mapped with HW _PAGE_PROTECT bit set,
to allow SW dirty-bit accounting on first write protection fault, where
the DAT-protection would then be reset. The reset is now done with RDP
instead of IPTE, if RDP instruction is available.
RDP cannot always guarantee that the DAT-protection reset is propagated
to all CPUs immediately. This means that spurious TLB protection faults
on other CPUs can now occur. For this, common code provides a
flush_tlb_fix_spurious_fault() handler, which will now be used to do a
CPU-local TLB flush. However, this will clear the whole TLB of a CPU, and
not just the affected entry. For more fine-grained flushing, by simply
doing a (local) RDP again, flush_tlb_fix_spurious_fault() would need to
also provide the PTE pointer.
Note that spurious TLB protection faults cannot really be distinguished
from racing pagetable updates, where another thread already installed the
correct PTE. In such a case, the local TLB flush would be unnecessary
overhead, but overall reduction of CPU synchronization overhead by not
using IPTE is still expected to be beneficial.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
arch_cpu_idle_time() returns the idle time of any given cpu if it is in
idle, or zero if not. All if this is racy and partially incorrect. Time
stamps taken with store clock extended and store clock fast from different
cpus are compared, while the architecture states that this is nothing which
can be relied on (see Principles of Operation; Chapter 4, "Setting and
Inspecting the Clock").
A more fundamental problem is that the timestamp when a cpu is leaving idle
is taken early in the assembler part of the interrupt handler, and this
value is only transferred many cycles later to the cpu's per-cpu idle data
structure.
This per cpu data structure is read by arch_cpu_idle() to tell for which
period of time a remote cpu is idle: if only an idle_enter value is
present, the assumed idle time of the cpu is calculated by taking a local
timestamp and returning the difference of the local timestamp and the
idle_enter value. This is potentially incorrect, since the remote cpu may
have already left idle, but the taken timestamp may not have been
transferred to the per-cpu data structure. This in turn means that too much
idle time may be reported for a cpu, and a subsequent calculation of system
idle time may result in a smaller value.
Instead of coming up with even more complex code trying to fix this, just
remove this code, and only account idle time of a cpu, after idle state is
left.
Another minor bug is that it is assumed that timestamps are non-zero, which
is not necessarily the case for timestamps taken with store clock
fast. This however is just a very minor problem, since this can only happen
when the epoch increases.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is no reason to do idle time accounting in arch_cpu_idle().
Do idle time accounting in account_idle_time_irq(), where it belongs
to. The accounted values don't change between account_idle_time_irq() and
arch_cpu_idle(); so the result is the same.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Introduce mem_detect_truncate() to cut any online memory ranges above
established identity mapping size, so that mem_detect users wouldn't
have to do it over and over again.
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of this sparse warning:
arch/s390/kernel/diag.c:69:29: warning: symbol '__diag8c_tmp_amode31' was not declared. Should it be static?
Fixes: fbaee7464f ("s390/tty3270: add support for diag 8c")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Compiling the kernel with CONFIG_KPROBES disabled, but CONFIG_RETHOOK
enabled, results in this sparse warning:
arch/s390/kernel/rethook.c:26:15: warning: no previous prototype for 'arch_rethook_trampoline_callback' [-Wmissing-prototypes]
26 | unsigned long arch_rethook_trampoline_callback(struct pt_regs *regs)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Add a local rethook header file similar to riscv to address this.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 1a280f48c0 ("s390/kprobes: replace kretprobe with rethook")
Link: https://lore.kernel.org/all/202302030102.69dZIuJk-lkp@intel.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the current code each reipl type implements its own pair of loadparm
show/store functions. Add a macro to deduplicate the code a bit.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Fixes: 87fd22e0ae ("s390/ipl: add eckd support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
SDc/Y/c=
=zpaD
-----END PGP SIGNATURE-----
Merge tag 'v6.2-rc6' into sched/core, to pick up fixes
Pick up fixes before merging another batch of cpuidle updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
points out that there is a linking phase flag added to CFLAGS, which
will only be used for compiling
clang-16: error: argument unused during compilation: '-shared' [-Werror,-Wunused-command-line-argument]
'-shared' is already present in ldflags-y so it can just be dropped.
Fixes: 2b2a25845d ("s390/vdso: Use $(LD) instead of $(CC) to link vDSO")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When clang's -Qunused-arguments is dropped from KBUILD_CPPFLAGS, it
warns:
clang-16: error: argument unused during compilation: '-s' [-Werror,-Wunused-command-line-argument]
The compiler's '-s' flag is a linking option (it is passed along to the
linker directly), which means it does nothing when the linker is not
invoked by the compiler. The kernel builds all .o files with '-c', which
stops the compilation pipeline before linking, so '-s' can be safely
dropped from KBUILD_AFLAGS_64.
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
When debugging vmlinux with QEMU + GDB, the following GDB error may
occur:
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint -1.
Cannot access memory at address 0xffffffffffff95c0
Command aborted.
(gdb)
The reason is that, when .interp section is present, GDB tries to
locate the file specified in it in memory and put a number of
breakpoints there (see enable_break() function in gdb/solib-svr4.c).
Sometimes GDB finds a bogus location that matches its heuristics,
fails to set a breakpoint and stops. This makes further debugging
impossible.
The .interp section contains misleading information anyway (vmlinux
does not need ld.so), so fix by discarding it.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Simplify the use of constants PMC_INIT and PMC_RELEASE.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
With no in-kernel user, the source files can be merged.
Move all functions and the variable definitions to file perf_cpum_cf.c
This file now contains all the necessary functions and definitions
for the CPU Measurement counter facility device driver.
The files cpu_mcf.h and perf_cpum_cf_common.c are deleted.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 17bebcc68e ("s390/cpum_cf: Add minimal in-kernel interface for
counter measurements") introduced a small in-kernel interface for CPU
Measurement counter facility.
There are no users of this interface, therefore remove it.
The following functions are removed:
kernel_cpumcf_alert(),
kernel_cpumcf_begin(),
kernel_cpumcf_end(),
kernel_cpumcf_avail()
there is no need for them anymore.
With the removal of function kernel_cpumcf_alert(), also remove
member alert in struct cpu_cf_events. Its purpose was to counter
measurement alert interrupts for the in-kernel interface.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Function stccm_avail() is defined in a header file and the
only user is one single source file. Move this function to the source
file where it is also used and remove it from the header file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Function cpum_cf_ctrset_size() is defined in one source file and the
only user is in another source file. Move this function to the source
file where it is used and remove its prototype from the header file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
To remove an event from the CPU Measurement counter facility
use the lock/unlock scheme as done in event creation. Remove
the atomic_add_unless function to make the code easier.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
That's an adaptation of commit f3a112c0c4 ("x86,rethook,kprobes:
Replace kretprobe with rethook on x86") to s390.
Replaces the kretprobe code with rethook on s390. With this patch,
kretprobe on s390 uses the rethook instead of kretprobe specific
trampoline code.
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The CPU Measurement Sampling Facility (CPUM_SF) installs large buffers
to save samples collected by hardware. These buffers are organized
as Sample Data Buffer Tables (SDBT) and Sample Data Buffers (SDB).
SDBs contain the samples which are extracted and saved in the perf
ring buffer.
The SDBTs are chained using real addresses and refer to SDBs using
real addresses.
The diagnostic sampling setup uses buffers provided by the process
which invokes perf_event_open system call. The buffers are memory
mapped. The buffers have been allocated by the kernel event
subsystem.
Add proper virtual to phyiscal address translation to the buffer
chaining. The current constraint which requires virtual equals
real address layout is removed.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Macro AUX_SDB_NUM() has three parameters. The first one is not used.
Remove the first parameter. Also convert the macros to inline functions.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The CPU Measurement Sampling Facility (CPUM_SF) installs large buffers
to save samples collected by hardware. These buffers are organized
as Sample Data Buffer Tables (SDBT) and Sample Data Buffers (SDB).
SDBs contain the samples which are extracted and saved in the perf
ring buffer.
The SDBTs are chained using real addresses and refer to SDBs using
real addresses.
Adds proper virtual to phyiscal address translation to the buffer
chaining. The current constraint which requires virtual equals real
address layout is removed.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove debug statements from function setup_pmc_cpu().
The debug statement displays a pointer value to a per cpu variable.
This pointer value is printed nowhere else, so it has no use for
cross reference.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Some inline helper functions are defined in a header file but used
in only one source file. Move these functions to the source file.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas. While working on [1], it was
discovered that all callers of zap_page_range pass a range entirely within
a single vma. In addition, the mmu notification call within zap_page
range does not correctly handle ranges that span multiple vmas. When
crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
the new vma should be made.
Instead of fixing zap_page_range, do the following:
- Create a new routine zap_vma_pages() that will remove all pages within
the passed vma. Most users of zap_page_range pass the entire vma and
can use this new routine.
- For callers of zap_page_range not passing the entire vma, instead call
zap_page_range_single().
- Remove zap_page_range.
[1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Factor out perf_prepare_header() so that it can call
perf_prepare_sample() without a header if not needed.
Also it checks the filtered_sample_type to avoid duplicate
work when perf_prepare_sample() is called twice (or more).
Suggested-by: Peter Zijlstr <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size. To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
GCC 11.1.0 and 11.2.0 generate a wrong warning when compiling the
kernel e.g. with allmodconfig:
arch/s390/kernel/setup.c: In function ‘setup_lowcore_dat_on’:
./include/linux/fortify-string.h:57:33: error: ‘__builtin_memcpy’ reading 128 bytes from a region of size 0 [-Werror=stringop-overread]
...
arch/s390/kernel/setup.c:526:9: note: in expansion of macro ‘memcpy’
526 | memcpy(abs_lc->cregs_save_area, S390_lowcore.cregs_save_area,
| ^~~~~~
This could be addressed by using absolute_pointer() with the
S390_lowcore macro, but this is not a good idea since this generates
worse code for performance critical paths.
Therefore simply use a for loop to copy the array in question and get
rid of the warning.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move __amode31_base declaration to proper header file to get rid of
arch/s390/boot/startup.c:24:15:
warning: symbol '__amode31_base' was not declared. Should it be static?
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move Absolute Lowcore Area allocation to the decompressor.
As result, get_abs_lowcore() and put_abs_lowcore() access
brackets become really straight and do not require complex
execution context analysis and LAP and interrupts tackling.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move Real Memory Copy Area allocation to the decompressor.
As result, memcpy_real() and memcpy_real_iter() movers
become usable since the very moment the kernel starts.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The setup of the kernel virtual address space is spread
throughout the sources, boot stages and config options
like this:
1. The available physical memory regions are queried
and stored as mem_detect information for later use
in the decompressor.
2. Based on the physical memory availability the virtual
memory layout is established in the decompressor;
3. If CONFIG_KASAN is disabled the kernel paging setup
code populates kernel pgtables and turns DAT mode on.
It uses the information stored at step [1].
4. If CONFIG_KASAN is enabled the kernel early boot
kasan setup populates kernel pgtables and turns DAT
mode on. It uses the information stored at step [1].
The kasan setup creates early_pg_dir directory and
directly overwrites swapper_pg_dir entries to make
shadow memory pages available.
Move the kernel virtual memory setup to the decompressor
and start the kernel with DAT turned on right from the
very first istruction. That completely eliminates the
boot phase when the kernel runs in DAT-off mode, simplies
the overall design and consolidates pgtables setup.
The identity mapping is created in the decompressor, while
kasan shadow mappings are still created by the early boot
kernel code.
Share with decompressor the existing kasan memory allocator.
It decreases the size of a newly requested memory block from
pgalloc_pos and ensures that kernel image is not overwritten.
pgalloc_low and pgalloc_pos pointers are made preserved boot
variables for that.
Use the bootdata infrastructure to setup swapper_pg_dir
and invalid_pg_dir directories used by the kernel later.
The interim early_pg_dir directory established by the
kasan initialization code gets eliminated as result.
As the kernel runs in DAT-on mode only the PSW_KERNEL_BITS
define gets PSW_MASK_DAT bit by default. Additionally, the
setup_lowcore_dat_off() and setup_lowcore_dat_on() routines
get merged, since there is no DAT-off mode stage anymore.
The memory mappings are created with RW+X protection that
allows the early boot code setting up all necessary data
and services for the kernel being booted. Just before the
paging is enabled the memory protection is changed to
RO+X for text, RO+NX for read-only data and RW+NX for
kernel data and the identity mapping.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit ada1da31ce ("s390/sclp: sort out physical vs
virtual pointers usage") fixed the notion of virtual
address for sclp_early_sccb pointer. However, it did
not take into account that kasan_early_init() can also
output messages and sclp_early_sccb should be adjusted
by the time kasan_early_init() is called.
Currently it is not a problem, since virtual and physical
addresses on s390 are the same. Nevertheless, should they
ever differ, this would cause an invalid pointer access.
Fixes: ada1da31ce ("s390/sclp: sort out physical vs virtual pointers usage")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.
However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.
Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
Idle code is very like entry code in that RCU isn't available. As
such, add a little validation.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.373461409@infradead.org
The current cmpxchg_double() loops within the perf hw sampling code do not
have READ_ONCE() semantics to read the old value from memory. This allows
the compiler to generate code which reads the "old" value several times
from memory, which again allows for inconsistencies.
For example:
/* Reset trailer (using compare-double-and-swap) */
do {
te_flags = te->flags & ~SDB_TE_BUFFER_FULL_MASK;
te_flags |= SDB_TE_ALERT_REQ_MASK;
} while (!cmpxchg_double(&te->flags, &te->overflow,
te->flags, te->overflow,
te_flags, 0ULL));
The compiler could generate code where te->flags used within the
cmpxchg_double() call may be refetched from memory and which is not
necessarily identical to the previous read version which was used to
generate te_flags. Which in turn means that an incorrect update could
happen.
Fix this by adding READ_ONCE() semantics to all cmpxchg_double()
loops. Given that READ_ONCE() cannot generate code on s390 which atomically
reads 16 bytes, use a private compare-and-swap-double implementation to
achieve that.
Also replace cmpxchg_double() with the private implementation to be able to
re-use the old value within the loops.
As a side effect this converts the whole code to only use bit fields
to read and modify bits within the hws trailer header.
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Thomas Richter <tmricht@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/linux-s390/Y71QJBhNTIatvxUT@osiris/T/#ma14e2a5f7aa8ed4b94b6f9576799b3ad9c60f333
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This commit addresses the following erroneous situation with file-based
kdump executed on a system with a valid IPL report.
On s390, a kdump kernel, its initrd and IPL report if present are loaded
into a special and reserved on boot memory region - crashkernel. When
a system crashes and kdump was activated before, the purgatory code
is entered first which swaps the crashkernel and [0 - crashkernel size]
memory regions. Only after that the kdump kernel is entered. For this
reason, the pointer to an IPL report in lowcore must point to the IPL report
after the swap and not to the address of the IPL report that was located in
crashkernel memory region before the swap. Failing to do so, makes the
kdump's decompressor try to read memory from the crashkernel memory region
which already contains the production's kernel memory.
The situation described above caused spontaneous kdump failures/hangs
on systems where the Secure IPL is activated because on such systems
an IPL report is always present. In that case kdump's decompressor tried
to parse an IPL report which frequently lead to illegal memory accesses
because an IPL report contains addresses to various data.
Cc: <stable@vger.kernel.org>
Fixes: 99feaa717e ("s390/kexec_file: Create ipl report and pass to next kernel")
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The current code uses diag210 to infer the 3270 geometry from the
model number when running on z/VM. This doesn't work well as almost
all 3270 software clients report as 3279-2 with a custom resolution.
tty3270 assumes it has a 80x24 terminal connected because of the -2
suffix. Use diag 8c to fetch the realy geometry from z/VM.
Note that this doesn't allow dynamic resizing, i.e. reconnecting to
a z/VM session with a different geometry.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
CPU Measurement counting facility events PROBLEM_STATE_CPU_CYCLES(32)
and PROBLEM_STATE_INSTRUCTIONS(33) are valid events. However the device
driver returns error -EOPNOTSUPP when these event are to be installed.
Fix this and allow installation of events PROBLEM_STATE_CPU_CYCLES,
PROBLEM_STATE_CPU_CYCLES:u, PROBLEM_STATE_INSTRUCTIONS and
PROBLEM_STATE_INSTRUCTIONS:u.
Kernel space counting only is still not supported by s390.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Nathan Chancellor reports that the s390 vmlinux fails to link with
GNU ld < 2.36 since commit 99cb0d917f ("arch: fix broken BuildID
for arm64 and riscv").
It happens for defconfig, or more specifically for CONFIG_EXPOLINE=y.
$ s390x-linux-gnu-ld --version | head -n1
GNU ld (GNU Binutils for Debian) 2.35.2
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- allnoconfig
$ ./scripts/config -e CONFIG_EXPOLINE
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- olddefconfig
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu-
`.exit.text' referenced in section `.s390_return_reg' of drivers/base/dd.o: defined in discarded section `.exit.text' of drivers/base/dd.o
make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
make: *** [Makefile:1252: vmlinux] Error 2
arch/s390/kernel/vmlinux.lds.S wants to keep EXIT_TEXT:
.exit.text : {
EXIT_TEXT
}
But, at the same time, EXIT_TEXT is thrown away by DISCARD because
s390 does not define RUNTIME_DISCARD_EXIT.
I still do not understand why the latter wins after 99cb0d917f,
but defining RUNTIME_DISCARD_EXIT seems correct because the comment
line in arch/s390/kernel/vmlinux.lds.S says:
/*
* .exit.text is discarded at runtime, not link time,
* to deal with references from __bug_table
*/
Nathan also found that binutils commit 21401fc7bf67 ("Duplicate output
sections in scripts") cured this issue, so we cannot reproduce it with
binutils 2.36+, but it is better to not rely on it.
Fixes: 99cb0d917f ("arch: fix broken BuildID for arm64 and riscv")
Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20230105031306.1455409-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Symbols _edata and _end in the linker script are the
only unaligned expicitly on page boundary. Although
_end is aligned implicitly by BSS_SECTION macro that
is still inconsistent and could lead to a bug if a tool
or function would assume that _edata is as aligned as
others.
For example, vmem_map_init() function does not align
symbols _etext, _einittext etc. Should these symbols
be unaligned as well, the size of ranges to update
were short on one page.
Instead of fixing every occurrence of this kind in the
code and external tools just force the alignment on
these two symbols.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The archs that use cputime_to_nsecs() internally provide their own
definition and don't need the fallback. cputime_to_usecs() unused except
in this fallback, and is not defined anywhere.
This removes the final remnant of the cputime_t code from the kernel.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
The <asm/archrandom.h> header is a random.c private detail, not
something to be called by other code. As such, don't make it
automatically available by way of random.h.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
- Thoroughly rewrite the data structures that implement perf task context handling,
with the goal of fixing various quirks and unfeatures both in already merged,
and in upcoming proposed code.
The old data structure is the per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and
a single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology discovery
on various hardware models.
- Misc fixes & cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXjuURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j+VhAAknimsLwenTHCGQp7yqsWSKfBr9KI2UgD
ZgtQuuwRwSzwqAEwC5Mt6zcIkxRNhU1ookFPqQbpY3XA0W4aNakUk8bDF8QIEKW0
MFWxn7PtReWqKcUay2oEGGurqZ5OtfpljJGxigQh5oVeMGc+itIwHF2JefeyoRnu
pq7R2qDgOBb7Np4lWTdqXGmKufzp04/nely2IZQBO8x80cGRZiKQIrGrch6vLUf7
3iEz9rwmvPyz0aczYSpa/duEZDMLm4lWNK4oMUEXuUWC8gU7CUzBJsJ3AS5NgxAu
yGBXe/s7GHqwtc/F30l5gK/J5WAyK83IF7sckxTj0dBUpyC6wQwwYPm8BaCAMoqN
X6mU7Ve938Siih1TyOBZfZsrtDDILhV2N/nku2erb3iqes26u0RcT25rWtu9Yqvn
hm4Gm6cmkHWq4EOHSBvAdC7l7lDZ3fyVI5+8nN9ly9Qv867HjG70dvIr9iEEolpX
rhFAz8r/NwTXhDY0AmFZcOkrnNV3IuHtibJ/9wJlgJNqDPqN12Wxqdzy0Nj3HH6G
EsukBO05cWaDS0gB8MpaO6Q6YtqAr87ZY+afHDBwcfkME50/CyBLr5rd47dTR+Ip
B+zreYKcaNHdEMd1A9KULRTTDnEjlXYMwjVVJiPRV0jcmA3dHmM46HN5Ae9NdO6+
R2BAWv9XR6M=
=KNaI
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Thoroughly rewrite the data structures that implement perf task
context handling, with the goal of fixing various quirks and
unfeatures both in already merged, and in upcoming proposed code.
The old data structure is the per task and per cpu
perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and a
single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology
discovery on various hardware models.
- Misc fixes & cleanups
* tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
perf/x86/intel/uncore: Fix reference count leak in __uncore_imc_init_box()
perf/x86/intel/uncore: Fix reference count leak in snr_uncore_mmio_map()
perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox()
perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()
perf/x86/intel/uncore: Make set_mapping() procedure void
perf/x86/intel/uncore: Update sysfs-devices-mapping file
perf/x86/intel/uncore: Enable UPI topology discovery for Sapphire Rapids
perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server
perf/x86/intel/uncore: Get UPI NodeID and GroupID
perf/x86/intel/uncore: Enable UPI topology discovery for Skylake Server
perf/x86/intel/uncore: Generalize get_topology() for SKX PMUs
perf/x86/intel/uncore: Disable I/O stacks to PMU mapping on ICX-D
perf/x86/intel/uncore: Clear attr_update properly
perf/x86/intel/uncore: Introduce UPI topology type
perf/x86/intel/uncore: Generalize IIO topology support
perf/core: Don't allow grouping events from different hw pmus
perf/amd/ibs: Make IBS a core pmu
perf: Fix function pointer case
perf/x86/amd: Remove the repeated declaration
perf: Fix possible memleak in pmu_dev_alloc()
...
- Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and the
work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown
timer. Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync() should
be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding rearm
attempts silently. A warning for the case that a rearm attempt of a
shutdown timer is detected would not be really helpful because it's
entirely unclear how it should be acted upon. The only way to address
such a case is to add 'if (in_shutdown)' conditionals all over the
place. This is error prone and in most cases of teardown not required
all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
- Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUuC0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodpZD/9kCDi009n65QFF1J4kE5aZuABbRMtO
7sy66fJpDyB/MtcbPPH29uzQUEs1VMTQVB+ZM+7e1YGoxSWuSTzeoFH+yK1w4tEZ
VPbOcvUEjG0esKUehwYFeOjSnIjy6M1Y41aOUaDnq00/azhfTrzLxQA1BbbFbkpw
S7u2hllbyRJ8KdqQyV9cVpXmze6fcpdtNhdQeoA7qQCsSPnJ24MSpZ/PG9bAovq8
75IRROT7CQRd6AMKAVpA9Ov8ak9nbY3EgQmoKcp5ZXfXz8kD3nHky9Lste7djgYB
U085Vwcelt39V5iXevDFfzrBYRUqrMKOXIf2xnnoDNeF5Jlj5gChSNVZwTLO38wu
RFEVCjCjuC41GQJWSck9LRSYdriW/htVbEE8JLc6uzUJGSyjshgJRn/PK4HjpiLY
AvH2rd4rAap/rjDKvfWvBqClcfL7pyBvavgJeyJ8oXyQjHrHQwapPcsMFBm0Cky5
soF0Lr3hIlQ9u+hwUuFdNZkY9mOg09g9ImEjW1AZTKY0DfJMc5JAGjjSCfuopVUN
Uf/qqcUeQPSEaC+C9xiFs0T3svYFxBqpgPv4B6t8zAnozon9fyZs+lv5KdRg4X77
qX395qc6PaOSQlA7gcxVw3vjCPd0+hljXX84BORP7z+uzcsomvIH1MxJepIHmgaJ
JrYbSZ5qzY5TTA==
=JlDe
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timers, timekeeping and drivers:
Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and
the work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown timer.
Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync()
should be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding
rearm attempts silently.
A warning for the case that a rearm attempt of a shutdown timer is
detected would not be really helpful because it's entirely unclear
how it should be acted upon. The only way to address such a case is
to add 'if (in_shutdown)' conditionals all over the place. This is
error prone and in most cases of teardown not required all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place"
* tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
dt-bindings: timer: renesas,cmt: Add r8a779g0 CMT support
dt-bindings: timer: renesas,tmu: Add r8a779g0 support
clocksource/drivers/arm_arch_timer: Use kstrtobool() instead of strtobool()
clocksource/drivers/timer-ti-dm: Fix missing clk_disable_unprepare in dmtimer_systimer_init_clock()
clocksource/drivers/timer-ti-dm: Clear settings on probe and free
clocksource/drivers/timer-ti-dm: Make timer_get_irq static
clocksource/drivers/timer-ti-dm: Fix warning for omap_timer_match
clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error
clocksource/drivers/timer-npcm7xx: Enable timer 1 clock before use
dt-bindings: timer: nuvoton,npcm7xx-timer: Allow specifying all clocks
dt-bindings: timer: rockchip: Add rockchip,rk3128-timer
clockevents: Repair kernel-doc for clockevent_delta2ns()
clocksource/drivers/ingenic-ost: Define pm functions properly in platform_driver struct
clocksource/drivers/sh_cmt: Access registers according to spec
vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy
Bluetooth: hci_qca: Fix the teardown problem for real
timers: Update the documentation to reflect on the new timer_shutdown() API
timers: Provide timer_shutdown[_sync]()
timers: Add shutdown mechanism to the internal functions
timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode
...
- Factor out handle_write() function and simplify 3215 console
write operation.
- When 3170 terminal emulator is connected to the 3215 console
driver the boot time could be very long due to limited buffer
space or missing operator input. Add con3215_drop command line
parameter and con3215_drop sysfs attribute file to instruct
the kernel drop console data when such conditions are met.
- Fix white space errors in 3215 console driver.
- Move enum paiext_mode definition to a header file and rename
it to paievt_mode to indicate this is now used for several
events. Rename PAI_MODE_COUNTER to PAI_MODE_COUNTING to make
consistent with PAI_MODE_SAMPLING.
- Simplify the logic of PMU pai_crypto mapped buffer reference
counter and make it consistent with PMU pai_ext.
- Rename PMU pai_crypto mapped buffer structure member users
to active_events to make it consistent with PMU pai_ext.
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option.
This results in saving of 12K per 1M hugetlb page (~1.2%)
and 32764K per 2G hugetlb page (~1.6%).
- Use generic serial.h, bugs.h, shmparam.h and vga.h header
files and scrap s390-specific versions.
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning
and provide sane CPU-to-node and CPU-to-CPU distance mappings
implementat a minimal version of setup_per_cpu_areas().
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes.
- Avoid unnecessary lookup of a pointer to MSI descriptor when
setting IRQ affinity for a PCI device.
- Get rid of "an incompatible function type cast" warning by
changing debug_sprintf_format_fn() function prototype so it
matches the debug_format_proc_t function type.
- Remove unused info_blk_hdr__pcpus() and get_page_state()
functions.
- Get rid of clang "unused unused insn cache ops function"
warning by moving s390_insn definition to a private header.
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE
is enabled.
- Use kstrobool() to parse sclp_con_drop parameter to make it
identical to the con3215_drop parameter and allow passing
values like "yes" and "true".
- Use sysfs_emit() for all SCLP sysfs show functions, which is
the current standard way to generate output strings.
- Make SCLP con_drop sysfs attribute also writable and allow to
change its value during runtime. This makes SCLP console drop
handling consistent with the 3215 device driver.
- Virtual and physical addresses are indentical on s390. However,
there is still a confusion when pointers are directly casted to
physical addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers.
- Support for power managemant has been removed from s390 since
quite some time. Remove unused power managemant code from the
appldata device driver.
- Allow memory tools like KASAN see memory accesses from the
checksum code. Switch to GENERIC_CSUM if KASAN is enabled,
just like x86 does.
- Add support of ECKD DASDs disks so it could be used as boot
and dump devices.
- Follow checkpatch recommendations and use octal values instead
of S_IRUGO and S_IWUSR for dump device attributes in sysfs.
- Changes to vx-insn.h do not cause a recompile of C files that
use asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include
header file to avoid this problem.
- Use vector instruction macros instead of byte patterns to
increase register validation routine readability.
- The current machine check register validation handling does not
take into account various scenarios and might lead to killing a
wrong user process or potentially ignore corrupted FPU registers.
Simplify logic of the machine check handler and stop the whole
machine if the previous context was kerenel mode. If the previous
context was user mode, kill the current task.
- Introduce sclp_emergency_printk() function which can be used to
emit a message in emergency cases. It is supposed to be used in
cases where regular console device drivers may not work anymore,
e.g. unrecoverable machine checks.
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation.
- In case a system will be stopped because of an unrecoverable
machine check error print the machine check interruption code
to give a hint of what went wrong.
- Move storage error checking from the assembly entry code to C
in order to simplify machine check handling. Enter the handler
with DAT turned on, which simplifies the entry code even more.
- The machine check extended save areas are allocated using
a private "nmi_save_areas" slab cache which guarantees a
required power-of-two alignment. Get rid of that cache in
favour of kmalloc().
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY5ckrhccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8NlrAQD8NCLeEAkhGCRnzdTyngExCrzV
Mw//cEnksUkIPqalJgEArbyFjGh05ecNaiDQduH8Gh94/qOhGE4obMdTgMWq7QY=
=3aou
-----END PGP SIGNATURE-----
Merge tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Factor out handle_write() function and simplify 3215 console write
operation
- When 3170 terminal emulator is connected to the 3215 console driver
the boot time could be very long due to limited buffer space or
missing operator input. Add con3215_drop command line parameter and
con3215_drop sysfs attribute file to instruct the kernel drop console
data when such conditions are met
- Fix white space errors in 3215 console driver
- Move enum paiext_mode definition to a header file and rename it to
paievt_mode to indicate this is now used for several events. Rename
PAI_MODE_COUNTER to PAI_MODE_COUNTING to make consistent with
PAI_MODE_SAMPLING
- Simplify the logic of PMU pai_crypto mapped buffer reference counter
and make it consistent with PMU pai_ext
- Rename PMU pai_crypto mapped buffer structure member users to
active_events to make it consistent with PMU pai_ext
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option. This
results in saving of 12K per 1M hugetlb page (~1.2%) and 32764K per
2G hugetlb page (~1.6%)
- Use generic serial.h, bugs.h, shmparam.h and vga.h header files and
scrap s390-specific versions
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning and
provide sane CPU-to-node and CPU-to-CPU distance mappings implementat
a minimal version of setup_per_cpu_areas()
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes
- Avoid unnecessary lookup of a pointer to MSI descriptor when setting
IRQ affinity for a PCI device
- Get rid of "an incompatible function type cast" warning by changing
debug_sprintf_format_fn() function prototype so it matches the
debug_format_proc_t function type
- Remove unused info_blk_hdr__pcpus() and get_page_state() functions
- Get rid of clang "unused unused insn cache ops function" warning by
moving s390_insn definition to a private header
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE is
enabled
- Use kstrobool() to parse sclp_con_drop parameter to make it identical
to the con3215_drop parameter and allow passing values like "yes" and
"true"
- Use sysfs_emit() for all SCLP sysfs show functions, which is the
current standard way to generate output strings
- Make SCLP con_drop sysfs attribute also writable and allow to change
its value during runtime. This makes SCLP console drop handling
consistent with the 3215 device driver
- Virtual and physical addresses are indentical on s390. However, there
is still a confusion when pointers are directly casted to physical
addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers
- Support for power managemant has been removed from s390 since quite
some time. Remove unused power managemant code from the appldata
device driver
- Allow memory tools like KASAN see memory accesses from the checksum
code. Switch to GENERIC_CSUM if KASAN is enabled, just like x86 does
- Add support of ECKD DASDs disks so it could be used as boot and dump
devices
- Follow checkpatch recommendations and use octal values instead of
S_IRUGO and S_IWUSR for dump device attributes in sysfs
- Changes to vx-insn.h do not cause a recompile of C files that use
asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include header
file to avoid this problem
- Use vector instruction macros instead of byte patterns to increase
register validation routine readability
- The current machine check register validation handling does not take
into account various scenarios and might lead to killing a wrong user
process or potentially ignore corrupted FPU registers. Simplify logic
of the machine check handler and stop the whole machine if the
previous context was kerenel mode. If the previous context was user
mode, kill the current task
- Introduce sclp_emergency_printk() function which can be used to emit
a message in emergency cases. It is supposed to be used in cases
where regular console device drivers may not work anymore, e.g.
unrecoverable machine checks
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation
- In case a system will be stopped because of an unrecoverable machine
check error print the machine check interruption code to give a hint
of what went wrong
- Move storage error checking from the assembly entry code to C in
order to simplify machine check handling. Enter the handler with DAT
turned on, which simplifies the entry code even more
- The machine check extended save areas are allocated using a private
"nmi_save_areas" slab cache which guarantees a required power-of-two
alignment. Get rid of that cache in favour of kmalloc()
* tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (38 commits)
s390/nmi: get rid of private slab cache
s390/nmi: move storage error checking back to C, enter with DAT on
s390/nmi: print machine check interruption code before stopping system
s390/sclp: introduce sclp_emergency_printk()
s390/sclp: keep sclp_early_sccb
s390/nmi: rework register validation handling
s390/nmi: use vector instruction macros instead of byte patterns
s390/vx: add vx-insn.h wrapper include file
s390/ipl: use octal values instead of S_* macros
s390/ipl: add eckd dump support
s390/ipl: add eckd support
vfio/ccw: identify CCW data addresses as physical
vfio/ccw: sort out physical vs virtual pointers usage
s390/checksum: support GENERIC_CSUM, enable it for KASAN
s390/appldata: remove power management callbacks
s390/cio: sort out physical vs virtual pointers usage
s390/sclp: allow to change sclp_console_drop during runtime
s390/sclp: convert to use sysfs_emit()
s390/sclp: use kstrobool() to parse sclp_con_drop parameter
s390/3270: make raw3270_state_final() depend on CONFIG_TN3270_CONSOLE
...
Get rid of private "nmi_save_areas" slab cache. The only reason this was
introduced years ago was that with some slab debugging options allocations
would only guarantee a minimum alignment of ARCH_KMALLOC_MINALIGN, which
was eight bytes back then. This is not sufficient for the extended machine
check save area.
However since commit 59bb47985c ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)") kmalloc guarantees a power-of-two
alignment even with debugging options enabled.
Therefore the private slab cache can be removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Checking for storage errors in machine check entry code was done in order
to handle also storage errors on kernel page tables. However this is
extremely unlikely and some basic assumptions what works on machine check
entry are necessary anyway. In order to simplify machine check handling
delay checking for storage errors to C code.
With this also change the machine check new PSW to have DAT on, which
simplifies the entry code even further.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In case a system will be stopped because of e.g. missing validity bits
print the machine check interruption code before the system is stopped.
This is helpful, since up to now no message was printed in such a
case. Only a disabled wait PSW was loaded, which doesn't give a hint of
what went wrong.
Improve this by printing a message with debug information.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
If a machine check happens in kernel mode, and the machine check
interruption code indicates that e.g. vector register contents in the
machine check area are not valid, the logic is to kill current.
The idea behind this was that if within kernel context vector
registers are not used then it is sufficient to kill the current user
space process to avoid that it continues with potentially corrupt
register contents. This however does not necessarily work, since the
current code does not take into account that a machine check can also
happen when a kernel thread is running (= no user space context), and
in addition there is no way to distinguish between the "previous" and
"next" user process task, if the machine check happens when a task
switch happens.
Given that machine checks with invalid saved register contents in the
machine check save area are extremely rare, simplify the logic: if
register contents are invalid and the previous context was kernel
mode, stop the whole machine. If the previous context was user mode,
kill the corresponding task.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use vector instruction macros instead of byte patterns to increase
readability. The generated code is nearly identical:
- 1e8: e7 0f 10 00 00 36 vlm %v0,%v15,0(%r1)
- 1ee: e7 0f 11 00 0c 36 vlm %v16,%v31,256(%r1)
+ 1e8: e7 0f 10 00 30 36 vlm %v0,%v15,0(%r1),3
+ 1ee: e7 0f 11 00 3c 36 vlm %v16,%v31,256(%r1),3
By using the VLM macro the alignment hint is automatically specified
too. Even though from a performance perspective it doesn't matter at
all for the machine check code, this shows yet another benefit when
using the macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The vector instruction macros can also be used in inline assemblies. For
this the magic
asm(".include \"asm/vx-insn.h\"\n");
must be added to C files in order to avoid that the pre-processor
eliminates the __ASSEMBLY__ guarded macros. This however comes with the
problem that changes to asm/vx-insn.h do not cause a recompile of C files
which have only this magic statement instead of a proper include statement.
This can be observed with the arch/s390/kernel/fpu.c file.
In order to fix this problem and also to avoid that the include must
be specified twice, add a wrapper include header file which will do
all necessary steps.
This way only the vx-insn.h header file needs to be included and changes to
the new vx-insn-asm.h header file cause a recompile of all dependent files
like it should.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
octal values are easier to read and checkpatch also recommends
to use them, so replace all the S_* macros with their counterparts.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to use ECKD disks as dump device
to linux. The new dump type is called 'eckd_dump', parameters
are the same as for eckd ipl.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to IPL from ECKD DASDs to linux.
It introduces a few sysfs files in /sys/firmware/reipl/eckd:
bootprog: the boot program selector
clear: whether to issue a diag308 LOAD_NORMAL or LOAD_CLEAR
device: the device to ipl from
br_chr: Cylinder/Head/Record number to read the bootrecord from.
Might be '0' or 'auto' if it should be read from the
volume label.
scpdata: data to be passed to the ipl'd program.
The new ipl type is called 'eckd'.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmN/eYwACgkQ41TmuOI4
ufjoxA/9Et38aXO/IhmUt8v0QhA4yec+sc5GSFfQSYehej/1Vqhw0DXx+ORUiRgg
+rtiXJSSqkuD2dL+BDffY2xoul6nzNdVf4AbkcnrWscfWr6xwVYlPvuL0ymGI6J2
U/IPedRoKw0bHw/wHs05yV5PubrRwDFERKhtyXWYGbPJhX0w2n3IFOoKH1oWBhLW
Dc8jEs6t3gDbJ71Er0xoeBUoiuu+PgZG06cpOvzBZ0KclRgjADXyISqqk8/4mu8w
R+/Wf8NcrbQYV1jfCeq5zIsKC8uvnFj25UuyTLumn5vh+dNNsvE72Khe4tz7LI0I
ZPZ+GZuemu7Yi12dKjw4Sw3ui0ejWH/5XL1SVB0X/xYIWrBqOot+Lq6538GCng+c
tJt+zsu64VFgXCCZ8O9qO4uE4DBL70H3ThT7VZxIghSTZtY0xh3uFc64f3/3d9dy
K4WTJHrmMxhXaA/rqtIa8I53JvFl8CztofZATiQQesyPuc7lZ01w1Co5el4xYaxe
YknyMTq11qf/iYqVOW7sjoWW/YRuuMZ4+FhpI3o/SllVdN98iTwkk1kP3wcoBO5P
bvzpm+WXHbv9OxifPrqkqv34+upbjfEmSogHudQzagBX4vl3rZRfBCdQGCAha0Uc
ZYyg68kiil5sWmHI/Ln/ZjANYfbS5sF0CreuWxnmqcwKl2NSN/E=
=/1yt
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-6.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The size of the TOD programmable field was incorrectly increased from
four to eight bytes with commit 1a2c5840ac ("s390/dump: cleanup CPU
save area handling").
This leads to an elf notes section NT_S390_TODPREG which has a size of
eight instead of four bytes in case of kdump, however even worse is
that the contents is incorrect: it is supposed to contain only the
contents of the TOD programmable field, but in fact contains a mix of
the TOD programmable field (32 bit upper bits) and parts of the CPU
timer register (lower 32 bits).
Fix this by simply changing the size of the todpreg field within the
save area structure. This will implicitly also fix the size of the
corresponding elf notes sections.
This also gets rid of this compile time warning:
in function ‘fortify_memcpy_chk’,
inlined from ‘save_area_add_regs’ at arch/s390/kernel/crash_dump.c:99:2:
./include/linux/fortify-string.h:413:25: error: call to ‘__read_overflow2_field’
declared with attribute warning: detected read beyond size of field
(2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 1a2c5840ac ("s390/dump: cleanup CPU save area handling")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an unused insn cache ops function:
arch/s390/kernel/kprobes.c:34:1:
error: unused function 'is_kprobe_s390_insn_slot' [-Werror,-Wunused-function]
DEFINE_INSN_CACHE_OPS(s390_insn);
^
./include/linux/kprobes.h:335:20: note: expanded from macro 'DEFINE_INSN_CACHE_OPS'
static inline bool is_kprobe_##__name##_slot(unsigned long addr) \
^
<scratch space>:88:1: note: expanded from here
is_kprobe_s390_insn_slot
^
Move the definition to a private header file, which is also similar to
the generic insn cache ops.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an incompatible function type cast:
CC arch/s390/kernel/debug.o
arch/s390/kernel/debug.c:142:2: error: cast from 'int (*)(debug_info_t *, struct debug_view *, char *, debug_sprintf_entry_t *)' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, debug_sprintf_entry_t *)') to 'debug_format_proc_t *' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, const char *)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
(debug_format_proc_t *)&debug_sprintf_format_fn,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Get rid of this warning by changing debug_sprintf_format_fn() so it matches
the debug_format_proc_t function type, and do the cast of the last
parameter within the function itself.
This is the standard way of handling such cases anyway.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Link: https://lore.kernel.org/all/cover.1667336095.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
If the appropriate UV feature bit is set, there is no need to perform
an export before import.
The misc feature indicates, among other things, that importing a shared
page from a different protected VM will automatically also transfer its
ownership.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-5-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-5-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).
Notably:
- HW breakpoint PMU
- ARM big.little PMU / Intel ADL PMU
- Intel Branch Monitoring PMU
- AMD IBS PMU
- S390 cpum_cf PMU
- PowerPC trace_imc PMU
*Current design:*
Currently we have a per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.
Each perf_event is then associated with its PMU and one
perf_event_context.
*Proposed design:*
New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
With the new design, perf_event_context will hold all events for all
pmus in the (respective pinned/flexible) rbtrees. This can be achieved
by adding pmu to rbtree key:
{cpu, pmu, cgroup, group_index}
Each perf_event_context carries a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For example, it keeps track
of currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.
Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu
state like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.
Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.
Further optimizations to current implementation are possible. For
example, ctx_resched() can be optimized to reschedule only single pmu
events.
Much thanks to Ravi for picking this up and pushing it towards
completion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
Commit 838d9bb62d ("perf: Use sample_flags for raw_data")
changed the way the raw data of an event is collected.
Adjust the PMU pai_ext to the new scheme.
Fixes: 838d9bb62d ("perf: Use sample_flags for raw_data")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename structure member users to active_events to make it consistent
with PMU pai_ext. Also use the same prefix syntax for increment and
decrement operators in both PMUs.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rework the mapped buffer reference count in PMU pai_crypto
to match the same technique as in PMU pai_ext.
This simplifies the logic.
Do not count the individual number of counter and sampling
processes. Remember the type of access and the total number of
references to the buffer.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move enum definition to header file. This is done in preparation
for a follow on patch where this enum will be used in another source
file.
Also change the enum name from paiext_mode to paievt_mode
to indicate this enum is now used for several events.
Make naming consistent and rename PAI_MODE_COUNTER to PAI_MODE_COUNTING.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Fix virtual vs physical address confusion (which currently are the
same).
sie_block is accessed in entry.S and passed it to hardware, which is why
both its physical and virtual address are needed. To avoid every caller
having to do the virtual-physical conversion, add a new function sie64a()
which converts the virtual address to physical.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20221020143159.294605-3-nrb@linux.ibm.com
Message-Id: <20221020143159.294605-3-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
simply use the get_random_{u8,u16}() functions, which are faster than
wasting the additional bytes from a 32-bit value. This was done by hand,
identifying all of the places where one of the random integer functions
was used in a non-32-bit context.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc. in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing particular
sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmM+4vcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGY2IQAInr0JUNnkkxwUSXtOcQuA3IK8RJ
FbU9HXJRoV9H+7+l3SMlN7mIbrs5eE5fTY3iwQ3CVe139d1+1q7nvTMRv8owywJx
GBgzswncuu1lk7iQQ//CxiqMwSCG8GJdYn1uDVy4I5jg3o+DtFZJtyq2Wb7pqsMm
ZhZ4PozRN+idYQJSF6Vx/zEVLHI7quMBwfe4CME8/0Kg2+hnYzbXV/aUf0ED2emq
zdCMDQgIOK5AhY+8qgMXKYnBUJMTqBp6LoR4p3ApfUkwRFY0sGa0/LK3U/B22OE7
uWyR4fCUExGyerlcHEVev+9eBfmsLLPyqlchNwpSDOPf5OSdnKmgqJEBR/Cvx0eh
URerPk7EHxyH3G8yi+cU2GtofNTGc5RHPRgJE2ADsQEi5TAUKGmbXMlsFRL/51Vn
lTANZObBNa1d4enljF6TfTL5nuccOa+DKvXnH9fQ49t0QdtSikv6J/lGwilwm1Sr
BctmCsySPuURZfkpI9OQnLuouloMXl9f7Q/+S39haS/tSgvPpyITyO71nxDnXn/s
BbFObZJUk9QkqOACjBP1hNErTLt83uBxQ9z+rDCw/SbLIe4nw0wyneuygfHI5rI8
3RZB2DbGauuJHX2Zs6YGS14SLSY33IsLqKR1/Vy3LrPvOHuEvNiOR8LITq5E0YCK
OffK2Y5cIlXR0QWf
=DHiN
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in
kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing
particular sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
* tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
docs: bump minimal GNU Make version to 3.82
ia64: simplify esi object addition in Makefile
Revert "kbuild: Check if linker supports the -X option"
kbuild: rebuild .vmlinux.export.o when its prerequisite is updated
kbuild: move modules.builtin(.modinfo) rules to Makefile.vmlinux_o
zstd: Fixing mixed module-builtin objects
kallsyms: ignore __kstrtab_* and __kstrtabns_* symbols
kallsyms: take the input file instead of reading stdin
kallsyms: drop duplicated ignore patterns from kallsyms.c
kbuild: reuse mksysmap output for kallsyms
mksysmap: update comment about __crc_*
kbuild: remove head-y syntax
kbuild: use obj-y instead extra-y for objects placed at the head
kbuild: hide error checker logs for V=1 builds
kbuild: re-run modpost when it is updated
kbuild: unify two modpost invocations
kbuild: move vmlinux.o rule to the top Makefile
kbuild: move .vmlinux.objs rule to Makefile.modpost
kbuild: list sub-directories in ./Kbuild
Makefile.compiler: replace cc-ifversion with compiler-specific macros
...
- PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2)
feature support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information,
if available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on
AMD CPUs by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
- HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs and
thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot()
and fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/2pMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIMA/+J+MCEVTt9kwZeBtHoPX7iZ5gnq1+McoQ
f6ALX19AO/ZSuA7EBA3cS3Ny5eyGy3ofYUnRW+POezu9CpflLW/5N27R2qkZFrWC
A09B86WH676ZrmXt+oI05rpZ2y/NGw4gJxLLa4/bWF3g9xLfo21i+YGKwdOnNFpl
DEdCVHtjlMcOAU3+on6fOYuhXDcYd7PKGcCfLE7muOMOAtwyj0bUDBt7m+hneZgy
qbZHzDU2DZ5L/LXiMyuZj5rC7V4xUbfZZfXglG38YCW1WTieS3IjefaU2tREhu7I
rRkCK48ULDNNJR3dZK8IzXJRxusq1ICPG68I+nm/K37oZyTZWtfYZWehW/d/TnPa
tUiTwimabz7UUqaGq9ZptxwINcAigax0nl6dZ3EseeGhkDE6j71/3kqrkKPz4jth
+fCwHLOrI3c4Gq5qWgPvqcUlUneKf3DlOMtzPKYg7sMhla2zQmFpYCPzKfm77U/Z
BclGOH3FiwaK6MIjPJRUXTePXqnUseqCR8PCH/UPQUeBEVHFcMvqCaa15nALed8x
dFi76VywR9mahouuLNq6sUNePlvDd2B124PygNwegLlBfY9QmKONg9qRKOnQpuJ6
UprRJjLOOucZ/N/jn6+ShHkqmXsnY2MhfUoHUoMQ0QAI+n++e+2AuePo251kKWr8
QlqKxd9PMQU=
=LcGg
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature
support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information, if
available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs
by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs
and thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key
operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and
fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements"
* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
perf: Fix pmu_filter_match()
perf: Fix lockdep_assert_event_ctx()
perf/x86/amd/lbr: Adjust LBR regardless of filtering
perf/x86/utils: Fix uninitialized var in get_branch_type()
perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file
perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR
perf/x86/amd: Support PERF_SAMPLE_ADDR
perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT}
perf/x86/amd: Support PERF_SAMPLE_DATA_SRC
perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions
perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO}
perf/x86/uncore: Add new Raptor Lake S support
perf/x86/cstate: Add new Raptor Lake S support
perf/x86/msr: Add new Raptor Lake S support
perf/x86: Add new Raptor Lake S support
bpf: Check flags for branch stack in bpf_read_branch_records helper
perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails
perf: Use sample_flags for raw_data
perf: Use sample_flags for addr
...
The objects placed at the head of vmlinux need special treatments:
- arch/$(SRCARCH)/Makefile adds them to head-y in order to place
them before other archives in the linker command line.
- arch/$(SRCARCH)/kernel/Makefile adds them to extra-y instead of
obj-y to avoid them going into built-in.a.
This commit gets rid of the latter.
Create vmlinux.a to collect all the objects that are unconditionally
linked to vmlinux. The objects listed in head-y are moved to the head
of vmlinux.a by using 'ar m'.
With this, arch/$(SRCARCH)/kernel/Makefile can consistently use obj-y
for builtin objects.
There is no *.o that is directly linked to vmlinux. Drop unneeded code
in scripts/clang-tools/gen_compile_commands.py.
$(AR) mPi needs 'T' to workaround the llvm-ar bug. The fix was suggested
by Nathan Chancellor [1].
[1]: https://lore.kernel.org/llvm/YyjjT5gQ2hGMH0ni@dev-arch.thelio-3990X/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Use the new sample_flags to indicate whether the raw data field is
filled by the PMU driver. Although it could check with the NULL,
follow the same rule with other fields.
Remove the raw field from the perf_sample_data_init() to minimize
the number of cache lines touched.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
PMU device driver perf_paiext supports Processor Activity
Instrumentation Extension (PAIE1), available with IBM z16:
- maps a 512 byte block to lowcore address 0x1508 called PAIE1 control
block.
- maps a 1024 byte block at PAIE1 control block entry with index 2.
- uses control register bit 14 to enable PAIE1 control block lookup.
- turn PAIE1 nnpa counting on and off by setting bit 63 in
PAIE1 control block entry with index 2.
- creates a sample with raw data on each context switch out when
at context switch some mapped counters have a value of nonzero.
This device driver only supports CPU wide context, no task context
is allowed.
Support for counting:
- one or more counters can be specified using
perf stat -e pai_ext/xxx/
where xxx stands for the counter event name. Multiple invocation
of this command is possible. The counter names are listed in
/sys/devices/pai_ext/events directory.
- one special counters can be specified using
perf stat -e pai_ext/NNPA_ALL/
which returns the sum of all incremented nnpa counters.
- multiple counting events can run in parallel.
Support for Sampling:
- one event pai_ext/NNPA_ALL/ is reserved for sampling.
The event collects data at context switch out and saves them in
the ring buffer.
- no multiple invocations are possible.
The PAIE1 nnpa counter events are system wide. No task context is
supported. Therefore some restrictions documented in function
paiext_busy() apply.
Extend qpaci assembly instruction to query supported memory mapped nnpa
counters. It returns the number of counters (no holes allowed in that
range).
PAIE1 nnpa counter events can not be created when a CPU hot plug
add is processed. This means a CPU hot plug add does not get
the necessary PAIE1 event to record PAIE1 nnpa counter increments
on the newly added CPU. CPU hot plug remove removes the event and
terminates the counting of PAIE1 counters immediately.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Uninline copy_oldmem_kernel() function and make it consistent
with a very similar memcpy_real() implementation, by moving
to code to crash_dump.c, where it actually belongs.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Function memcpy_real() is an univeral data mover that does not
require DAT mode to be able reading from a physical address.
Its advantage is an ability to read from any address, even
those for which no kernel virtual mapping exists.
Although memcpy_real() is interrupt-safe, there are no handlers
that make use of this function. The compiler instrumentation
have to be disabled and separate no-DAT stack used to allow
execution of the function once DAT mode is disabled.
Rework memcpy_real() to overcome these shortcomings. As result,
data copying (which is primarily reading out a crashed system
memory by a user process) is executed on a regular stack with
enabled interrupts. Also, use of memcpy_real_buf swap buffer
becomes unnecessary and the swapping is eliminated.
The above is achieved by using a fixed virtual address range
that spans a single page and remaps that page repeatedly when
memcpy_real() is called for a particular physical address.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Function smp_save_dump_cpus() collects CPU state of a crashed
system for secondary CPUs and for the IPL CPU very differently.
The Signal Processor stop-and-store-status orders are used for
the former while Hardware System Area requests and memcpy_real()
routine are called for the latter. In addition a system reset is
triggered, which pins smp_save_dump_cpus() function call before
CPU and device initialization.
Move the collection of IPL CPU state to a later stage when DAT
becomes available. That is needed to allow a follow-up rework of
memcpy_real() routine.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Temporary unsetting of the prefix page in memcpy_absolute() routine
poses a risk of executing code path with unexpectedly disabled prefix
page. This rework avoids the prefix page uninstalling and disabling
of normal and machine check interrupts when accessing the absolute
zero memory.
Although memcpy_absolute() routine can access the whole memory, it is
only used to update the absolute zero lowcore. This rework therefore
introduces a new mechanism for the absolute zero lowcore access and
scraps memcpy_absolute() routine for good.
Instead, an area is reserved in the virtual memory that is used for
the absolute lowcore access only. That area holds an array of 8KB
virtual mappings - one per CPU. Whenever a CPU is brought online, the
corresponding item is mapped to the real address of the previously
installed prefix page.
The absolute zero lowcore access works like this: a CPU calls the
new primitive get_abs_lowcore() to obtain its 8KB mapping as a
pointer to the struct lowcore. Virtual address references to that
pointer get translated to the real addresses of the prefix page,
which in turn gets swapped with the absolute zero memory addresses
due to prefixing. Once the pointer is not needed it must be released
with put_abs_lowcore() primitive:
struct lowcore *abs_lc;
unsigned long flags;
abs_lc = get_abs_lowcore(&flags);
abs_lc->... = ...;
put_abs_lowcore(abs_lc, flags);
To ensure the described mechanism works large segment- and region-
table entries must be avoided for the 8KB mappings. Failure to do
so results in usage of Region-Frame Absolute Address (RFAA) or
Segment-Frame Absolute Address (SFAA) large page fields. In that
case absolute addresses would be used to address the prefix page
instead of the real ones and the prefixing would get bypassed.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>