CPU Measurement counting facility events PROBLEM_STATE_CPU_CYCLES(32)
and PROBLEM_STATE_INSTRUCTIONS(33) are valid events. However the device
driver returns error -EOPNOTSUPP when these event are to be installed.
Fix this and allow installation of events PROBLEM_STATE_CPU_CYCLES,
PROBLEM_STATE_CPU_CYCLES:u, PROBLEM_STATE_INSTRUCTIONS and
PROBLEM_STATE_INSTRUCTIONS:u.
Kernel space counting only is still not supported by s390.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Nathan Chancellor reports that the s390 vmlinux fails to link with
GNU ld < 2.36 since commit 99cb0d917f ("arch: fix broken BuildID
for arm64 and riscv").
It happens for defconfig, or more specifically for CONFIG_EXPOLINE=y.
$ s390x-linux-gnu-ld --version | head -n1
GNU ld (GNU Binutils for Debian) 2.35.2
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- allnoconfig
$ ./scripts/config -e CONFIG_EXPOLINE
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- olddefconfig
$ make -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu-
`.exit.text' referenced in section `.s390_return_reg' of drivers/base/dd.o: defined in discarded section `.exit.text' of drivers/base/dd.o
make[1]: *** [scripts/Makefile.vmlinux:34: vmlinux] Error 1
make: *** [Makefile:1252: vmlinux] Error 2
arch/s390/kernel/vmlinux.lds.S wants to keep EXIT_TEXT:
.exit.text : {
EXIT_TEXT
}
But, at the same time, EXIT_TEXT is thrown away by DISCARD because
s390 does not define RUNTIME_DISCARD_EXIT.
I still do not understand why the latter wins after 99cb0d917f,
but defining RUNTIME_DISCARD_EXIT seems correct because the comment
line in arch/s390/kernel/vmlinux.lds.S says:
/*
* .exit.text is discarded at runtime, not link time,
* to deal with references from __bug_table
*/
Nathan also found that binutils commit 21401fc7bf67 ("Duplicate output
sections in scripts") cured this issue, so we cannot reproduce it with
binutils 2.36+, but it is better to not rely on it.
Fixes: 99cb0d917f ("arch: fix broken BuildID for arm64 and riscv")
Link: https://lore.kernel.org/all/Y7Jal56f6UBh1abE@dev-arch.thelio-3990X/
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20230105031306.1455409-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Symbols _edata and _end in the linker script are the
only unaligned expicitly on page boundary. Although
_end is aligned implicitly by BSS_SECTION macro that
is still inconsistent and could lead to a bug if a tool
or function would assume that _edata is as aligned as
others.
For example, vmem_map_init() function does not align
symbols _etext, _einittext etc. Should these symbols
be unaligned as well, the size of ranges to update
were short on one page.
Instead of fixing every occurrence of this kind in the
code and external tools just force the alignment on
these two symbols.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The archs that use cputime_to_nsecs() internally provide their own
definition and don't need the fallback. cputime_to_usecs() unused except
in this fallback, and is not defined anywhere.
This removes the final remnant of the cputime_t code from the kernel.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20221220070705.2958959-1-npiggin@gmail.com
The <asm/archrandom.h> header is a random.c private detail, not
something to be called by other code. As such, don't make it
automatically available by way of random.h.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
- Thoroughly rewrite the data structures that implement perf task context handling,
with the goal of fixing various quirks and unfeatures both in already merged,
and in upcoming proposed code.
The old data structure is the per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and
a single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology discovery
on various hardware models.
- Misc fixes & cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmOXjuURHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j+VhAAknimsLwenTHCGQp7yqsWSKfBr9KI2UgD
ZgtQuuwRwSzwqAEwC5Mt6zcIkxRNhU1ookFPqQbpY3XA0W4aNakUk8bDF8QIEKW0
MFWxn7PtReWqKcUay2oEGGurqZ5OtfpljJGxigQh5oVeMGc+itIwHF2JefeyoRnu
pq7R2qDgOBb7Np4lWTdqXGmKufzp04/nely2IZQBO8x80cGRZiKQIrGrch6vLUf7
3iEz9rwmvPyz0aczYSpa/duEZDMLm4lWNK4oMUEXuUWC8gU7CUzBJsJ3AS5NgxAu
yGBXe/s7GHqwtc/F30l5gK/J5WAyK83IF7sckxTj0dBUpyC6wQwwYPm8BaCAMoqN
X6mU7Ve938Siih1TyOBZfZsrtDDILhV2N/nku2erb3iqes26u0RcT25rWtu9Yqvn
hm4Gm6cmkHWq4EOHSBvAdC7l7lDZ3fyVI5+8nN9ly9Qv867HjG70dvIr9iEEolpX
rhFAz8r/NwTXhDY0AmFZcOkrnNV3IuHtibJ/9wJlgJNqDPqN12Wxqdzy0Nj3HH6G
EsukBO05cWaDS0gB8MpaO6Q6YtqAr87ZY+afHDBwcfkME50/CyBLr5rd47dTR+Ip
B+zreYKcaNHdEMd1A9KULRTTDnEjlXYMwjVVJiPRV0jcmA3dHmM46HN5Ae9NdO6+
R2BAWv9XR6M=
=KNaI
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Thoroughly rewrite the data structures that implement perf task
context handling, with the goal of fixing various quirks and
unfeatures both in already merged, and in upcoming proposed code.
The old data structure is the per task and per cpu
perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
In this new design this is replaced with a single task context and a
single CPU context, plus intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
[ See commit bd27568117 for more details. ]
This rewrite was developed by Peter Zijlstra and Ravi Bangoria.
- Optimize perf_tp_event()
- Update the Intel uncore PMU driver, extending it with UPI topology
discovery on various hardware models.
- Misc fixes & cleanups
* tag 'perf-core-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
perf/x86/intel/uncore: Fix reference count leak in __uncore_imc_init_box()
perf/x86/intel/uncore: Fix reference count leak in snr_uncore_mmio_map()
perf/x86/intel/uncore: Fix reference count leak in hswep_has_limit_sbox()
perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()
perf/x86/intel/uncore: Make set_mapping() procedure void
perf/x86/intel/uncore: Update sysfs-devices-mapping file
perf/x86/intel/uncore: Enable UPI topology discovery for Sapphire Rapids
perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server
perf/x86/intel/uncore: Get UPI NodeID and GroupID
perf/x86/intel/uncore: Enable UPI topology discovery for Skylake Server
perf/x86/intel/uncore: Generalize get_topology() for SKX PMUs
perf/x86/intel/uncore: Disable I/O stacks to PMU mapping on ICX-D
perf/x86/intel/uncore: Clear attr_update properly
perf/x86/intel/uncore: Introduce UPI topology type
perf/x86/intel/uncore: Generalize IIO topology support
perf/core: Don't allow grouping events from different hw pmus
perf/amd/ibs: Make IBS a core pmu
perf: Fix function pointer case
perf/x86/amd: Remove the repeated declaration
perf: Fix possible memleak in pmu_dev_alloc()
...
- Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and the
work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown
timer. Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync() should
be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding rearm
attempts silently. A warning for the case that a rearm attempt of a
shutdown timer is detected would not be really helpful because it's
entirely unclear how it should be acted upon. The only way to address
such a case is to add 'if (in_shutdown)' conditionals all over the
place. This is error prone and in most cases of teardown not required
all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
- Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmOUuC0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodpZD/9kCDi009n65QFF1J4kE5aZuABbRMtO
7sy66fJpDyB/MtcbPPH29uzQUEs1VMTQVB+ZM+7e1YGoxSWuSTzeoFH+yK1w4tEZ
VPbOcvUEjG0esKUehwYFeOjSnIjy6M1Y41aOUaDnq00/azhfTrzLxQA1BbbFbkpw
S7u2hllbyRJ8KdqQyV9cVpXmze6fcpdtNhdQeoA7qQCsSPnJ24MSpZ/PG9bAovq8
75IRROT7CQRd6AMKAVpA9Ov8ak9nbY3EgQmoKcp5ZXfXz8kD3nHky9Lste7djgYB
U085Vwcelt39V5iXevDFfzrBYRUqrMKOXIf2xnnoDNeF5Jlj5gChSNVZwTLO38wu
RFEVCjCjuC41GQJWSck9LRSYdriW/htVbEE8JLc6uzUJGSyjshgJRn/PK4HjpiLY
AvH2rd4rAap/rjDKvfWvBqClcfL7pyBvavgJeyJ8oXyQjHrHQwapPcsMFBm0Cky5
soF0Lr3hIlQ9u+hwUuFdNZkY9mOg09g9ImEjW1AZTKY0DfJMc5JAGjjSCfuopVUN
Uf/qqcUeQPSEaC+C9xiFs0T3svYFxBqpgPv4B6t8zAnozon9fyZs+lv5KdRg4X77
qX395qc6PaOSQlA7gcxVw3vjCPd0+hljXX84BORP7z+uzcsomvIH1MxJepIHmgaJ
JrYbSZ5qzY5TTA==
=JlDe
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Updates for timers, timekeeping and drivers:
Core:
- The timer_shutdown[_sync]() infrastructure:
Tearing down timers can be tedious when there are circular
dependencies to other things which need to be torn down. A prime
example is timer and workqueue where the timer schedules work and
the work arms the timer.
What needs to prevented is that pending work which is drained via
destroy_workqueue() does not rearm the previously shutdown timer.
Nothing in that shutdown sequence relies on the timer being
functional.
The conclusion was that the semantics of timer_shutdown_sync()
should be:
- timer is not enqueued
- timer callback is not running
- timer cannot be rearmed
Preventing the rearming of shutdown timers is done by discarding
rearm attempts silently.
A warning for the case that a rearm attempt of a shutdown timer is
detected would not be really helpful because it's entirely unclear
how it should be acted upon. The only way to address such a case is
to add 'if (in_shutdown)' conditionals all over the place. This is
error prone and in most cases of teardown not required all.
- The real fix for the bluetooth HCI teardown based on
timer_shutdown_sync().
A larger scale conversion to timer_shutdown_sync() is work in
progress.
- Consolidation of VDSO time namespace helper functions
- Small fixes for timer and timerqueue
Drivers:
- Prevent integer overflow on the XGene-1 TVAL register which causes
an never ending interrupt storm.
- The usual set of new device tree bindings
- Small fixes and improvements all over the place"
* tag 'timers-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
dt-bindings: timer: renesas,cmt: Add r8a779g0 CMT support
dt-bindings: timer: renesas,tmu: Add r8a779g0 support
clocksource/drivers/arm_arch_timer: Use kstrtobool() instead of strtobool()
clocksource/drivers/timer-ti-dm: Fix missing clk_disable_unprepare in dmtimer_systimer_init_clock()
clocksource/drivers/timer-ti-dm: Clear settings on probe and free
clocksource/drivers/timer-ti-dm: Make timer_get_irq static
clocksource/drivers/timer-ti-dm: Fix warning for omap_timer_match
clocksource/drivers/arm_arch_timer: Fix XGene-1 TVAL register math error
clocksource/drivers/timer-npcm7xx: Enable timer 1 clock before use
dt-bindings: timer: nuvoton,npcm7xx-timer: Allow specifying all clocks
dt-bindings: timer: rockchip: Add rockchip,rk3128-timer
clockevents: Repair kernel-doc for clockevent_delta2ns()
clocksource/drivers/ingenic-ost: Define pm functions properly in platform_driver struct
clocksource/drivers/sh_cmt: Access registers according to spec
vdso/timens: Refactor copy-pasted find_timens_vvar_page() helper into one copy
Bluetooth: hci_qca: Fix the teardown problem for real
timers: Update the documentation to reflect on the new timer_shutdown() API
timers: Provide timer_shutdown[_sync]()
timers: Add shutdown mechanism to the internal functions
timers: Split [try_to_]del_timer[_sync]() to prepare for shutdown mode
...
- Factor out handle_write() function and simplify 3215 console
write operation.
- When 3170 terminal emulator is connected to the 3215 console
driver the boot time could be very long due to limited buffer
space or missing operator input. Add con3215_drop command line
parameter and con3215_drop sysfs attribute file to instruct
the kernel drop console data when such conditions are met.
- Fix white space errors in 3215 console driver.
- Move enum paiext_mode definition to a header file and rename
it to paievt_mode to indicate this is now used for several
events. Rename PAI_MODE_COUNTER to PAI_MODE_COUNTING to make
consistent with PAI_MODE_SAMPLING.
- Simplify the logic of PMU pai_crypto mapped buffer reference
counter and make it consistent with PMU pai_ext.
- Rename PMU pai_crypto mapped buffer structure member users
to active_events to make it consistent with PMU pai_ext.
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option.
This results in saving of 12K per 1M hugetlb page (~1.2%)
and 32764K per 2G hugetlb page (~1.6%).
- Use generic serial.h, bugs.h, shmparam.h and vga.h header
files and scrap s390-specific versions.
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning
and provide sane CPU-to-node and CPU-to-CPU distance mappings
implementat a minimal version of setup_per_cpu_areas().
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes.
- Avoid unnecessary lookup of a pointer to MSI descriptor when
setting IRQ affinity for a PCI device.
- Get rid of "an incompatible function type cast" warning by
changing debug_sprintf_format_fn() function prototype so it
matches the debug_format_proc_t function type.
- Remove unused info_blk_hdr__pcpus() and get_page_state()
functions.
- Get rid of clang "unused unused insn cache ops function"
warning by moving s390_insn definition to a private header.
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE
is enabled.
- Use kstrobool() to parse sclp_con_drop parameter to make it
identical to the con3215_drop parameter and allow passing
values like "yes" and "true".
- Use sysfs_emit() for all SCLP sysfs show functions, which is
the current standard way to generate output strings.
- Make SCLP con_drop sysfs attribute also writable and allow to
change its value during runtime. This makes SCLP console drop
handling consistent with the 3215 device driver.
- Virtual and physical addresses are indentical on s390. However,
there is still a confusion when pointers are directly casted to
physical addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers.
- Support for power managemant has been removed from s390 since
quite some time. Remove unused power managemant code from the
appldata device driver.
- Allow memory tools like KASAN see memory accesses from the
checksum code. Switch to GENERIC_CSUM if KASAN is enabled,
just like x86 does.
- Add support of ECKD DASDs disks so it could be used as boot
and dump devices.
- Follow checkpatch recommendations and use octal values instead
of S_IRUGO and S_IWUSR for dump device attributes in sysfs.
- Changes to vx-insn.h do not cause a recompile of C files that
use asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include
header file to avoid this problem.
- Use vector instruction macros instead of byte patterns to
increase register validation routine readability.
- The current machine check register validation handling does not
take into account various scenarios and might lead to killing a
wrong user process or potentially ignore corrupted FPU registers.
Simplify logic of the machine check handler and stop the whole
machine if the previous context was kerenel mode. If the previous
context was user mode, kill the current task.
- Introduce sclp_emergency_printk() function which can be used to
emit a message in emergency cases. It is supposed to be used in
cases where regular console device drivers may not work anymore,
e.g. unrecoverable machine checks.
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation.
- In case a system will be stopped because of an unrecoverable
machine check error print the machine check interruption code
to give a hint of what went wrong.
- Move storage error checking from the assembly entry code to C
in order to simplify machine check handling. Enter the handler
with DAT turned on, which simplifies the entry code even more.
- The machine check extended save areas are allocated using
a private "nmi_save_areas" slab cache which guarantees a
required power-of-two alignment. Get rid of that cache in
favour of kmalloc().
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCY5ckrhccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8NlrAQD8NCLeEAkhGCRnzdTyngExCrzV
Mw//cEnksUkIPqalJgEArbyFjGh05ecNaiDQduH8Gh94/qOhGE4obMdTgMWq7QY=
=3aou
-----END PGP SIGNATURE-----
Merge tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Factor out handle_write() function and simplify 3215 console write
operation
- When 3170 terminal emulator is connected to the 3215 console driver
the boot time could be very long due to limited buffer space or
missing operator input. Add con3215_drop command line parameter and
con3215_drop sysfs attribute file to instruct the kernel drop console
data when such conditions are met
- Fix white space errors in 3215 console driver
- Move enum paiext_mode definition to a header file and rename it to
paievt_mode to indicate this is now used for several events. Rename
PAI_MODE_COUNTER to PAI_MODE_COUNTING to make consistent with
PAI_MODE_SAMPLING
- Simplify the logic of PMU pai_crypto mapped buffer reference counter
and make it consistent with PMU pai_ext
- Rename PMU pai_crypto mapped buffer structure member users to
active_events to make it consistent with PMU pai_ext
- Enable HUGETLB_PAGE_OPTIMIZE_VMEMMAP configuration option. This
results in saving of 12K per 1M hugetlb page (~1.2%) and 32764K per
2G hugetlb page (~1.6%)
- Use generic serial.h, bugs.h, shmparam.h and vga.h header files and
scrap s390-specific versions
- The generic percpu setup code does not expect the s390-like
implementation and emits a warning. To get rid of that warning and
provide sane CPU-to-node and CPU-to-CPU distance mappings implementat
a minimal version of setup_per_cpu_areas()
- Use kstrtobool() instead of strtobool() for re-IPL sysfs device
attributes
- Avoid unnecessary lookup of a pointer to MSI descriptor when setting
IRQ affinity for a PCI device
- Get rid of "an incompatible function type cast" warning by changing
debug_sprintf_format_fn() function prototype so it matches the
debug_format_proc_t function type
- Remove unused info_blk_hdr__pcpus() and get_page_state() functions
- Get rid of clang "unused unused insn cache ops function" warning by
moving s390_insn definition to a private header
- Get rid of clang "unused function" warning by making function
raw3270_state_final() only available if CONFIG_TN3270_CONSOLE is
enabled
- Use kstrobool() to parse sclp_con_drop parameter to make it identical
to the con3215_drop parameter and allow passing values like "yes" and
"true"
- Use sysfs_emit() for all SCLP sysfs show functions, which is the
current standard way to generate output strings
- Make SCLP con_drop sysfs attribute also writable and allow to change
its value during runtime. This makes SCLP console drop handling
consistent with the 3215 device driver
- Virtual and physical addresses are indentical on s390. However, there
is still a confusion when pointers are directly casted to physical
addresses or vice versa. Use correct address converters
virt_to_phys() and phys_to_virt() for s390 channel IO drivers
- Support for power managemant has been removed from s390 since quite
some time. Remove unused power managemant code from the appldata
device driver
- Allow memory tools like KASAN see memory accesses from the checksum
code. Switch to GENERIC_CSUM if KASAN is enabled, just like x86 does
- Add support of ECKD DASDs disks so it could be used as boot and dump
devices
- Follow checkpatch recommendations and use octal values instead of
S_IRUGO and S_IWUSR for dump device attributes in sysfs
- Changes to vx-insn.h do not cause a recompile of C files that use
asm(".include \"asm/vx-insn.h\"\n") magic to access vector
instruction macros from inline assemblies. Add wrapper include header
file to avoid this problem
- Use vector instruction macros instead of byte patterns to increase
register validation routine readability
- The current machine check register validation handling does not take
into account various scenarios and might lead to killing a wrong user
process or potentially ignore corrupted FPU registers. Simplify logic
of the machine check handler and stop the whole machine if the
previous context was kerenel mode. If the previous context was user
mode, kill the current task
- Introduce sclp_emergency_printk() function which can be used to emit
a message in emergency cases. It is supposed to be used in cases
where regular console device drivers may not work anymore, e.g.
unrecoverable machine checks
Keep the early Service-Call Control Block so it can also be used
after initdata has been freed to allow sclp_emergency_printk()
implementation
- In case a system will be stopped because of an unrecoverable machine
check error print the machine check interruption code to give a hint
of what went wrong
- Move storage error checking from the assembly entry code to C in
order to simplify machine check handling. Enter the handler with DAT
turned on, which simplifies the entry code even more
- The machine check extended save areas are allocated using a private
"nmi_save_areas" slab cache which guarantees a required power-of-two
alignment. Get rid of that cache in favour of kmalloc()
* tag 's390-6.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (38 commits)
s390/nmi: get rid of private slab cache
s390/nmi: move storage error checking back to C, enter with DAT on
s390/nmi: print machine check interruption code before stopping system
s390/sclp: introduce sclp_emergency_printk()
s390/sclp: keep sclp_early_sccb
s390/nmi: rework register validation handling
s390/nmi: use vector instruction macros instead of byte patterns
s390/vx: add vx-insn.h wrapper include file
s390/ipl: use octal values instead of S_* macros
s390/ipl: add eckd dump support
s390/ipl: add eckd support
vfio/ccw: identify CCW data addresses as physical
vfio/ccw: sort out physical vs virtual pointers usage
s390/checksum: support GENERIC_CSUM, enable it for KASAN
s390/appldata: remove power management callbacks
s390/cio: sort out physical vs virtual pointers usage
s390/sclp: allow to change sclp_console_drop during runtime
s390/sclp: convert to use sysfs_emit()
s390/sclp: use kstrobool() to parse sclp_con_drop parameter
s390/3270: make raw3270_state_final() depend on CONFIG_TN3270_CONSOLE
...
Get rid of private "nmi_save_areas" slab cache. The only reason this was
introduced years ago was that with some slab debugging options allocations
would only guarantee a minimum alignment of ARCH_KMALLOC_MINALIGN, which
was eight bytes back then. This is not sufficient for the extended machine
check save area.
However since commit 59bb47985c ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)") kmalloc guarantees a power-of-two
alignment even with debugging options enabled.
Therefore the private slab cache can be removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Checking for storage errors in machine check entry code was done in order
to handle also storage errors on kernel page tables. However this is
extremely unlikely and some basic assumptions what works on machine check
entry are necessary anyway. In order to simplify machine check handling
delay checking for storage errors to C code.
With this also change the machine check new PSW to have DAT on, which
simplifies the entry code even further.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In case a system will be stopped because of e.g. missing validity bits
print the machine check interruption code before the system is stopped.
This is helpful, since up to now no message was printed in such a
case. Only a disabled wait PSW was loaded, which doesn't give a hint of
what went wrong.
Improve this by printing a message with debug information.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
If a machine check happens in kernel mode, and the machine check
interruption code indicates that e.g. vector register contents in the
machine check area are not valid, the logic is to kill current.
The idea behind this was that if within kernel context vector
registers are not used then it is sufficient to kill the current user
space process to avoid that it continues with potentially corrupt
register contents. This however does not necessarily work, since the
current code does not take into account that a machine check can also
happen when a kernel thread is running (= no user space context), and
in addition there is no way to distinguish between the "previous" and
"next" user process task, if the machine check happens when a task
switch happens.
Given that machine checks with invalid saved register contents in the
machine check save area are extremely rare, simplify the logic: if
register contents are invalid and the previous context was kernel
mode, stop the whole machine. If the previous context was user mode,
kill the corresponding task.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Use vector instruction macros instead of byte patterns to increase
readability. The generated code is nearly identical:
- 1e8: e7 0f 10 00 00 36 vlm %v0,%v15,0(%r1)
- 1ee: e7 0f 11 00 0c 36 vlm %v16,%v31,256(%r1)
+ 1e8: e7 0f 10 00 30 36 vlm %v0,%v15,0(%r1),3
+ 1ee: e7 0f 11 00 3c 36 vlm %v16,%v31,256(%r1),3
By using the VLM macro the alignment hint is automatically specified
too. Even though from a performance perspective it doesn't matter at
all for the machine check code, this shows yet another benefit when
using the macros.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The vector instruction macros can also be used in inline assemblies. For
this the magic
asm(".include \"asm/vx-insn.h\"\n");
must be added to C files in order to avoid that the pre-processor
eliminates the __ASSEMBLY__ guarded macros. This however comes with the
problem that changes to asm/vx-insn.h do not cause a recompile of C files
which have only this magic statement instead of a proper include statement.
This can be observed with the arch/s390/kernel/fpu.c file.
In order to fix this problem and also to avoid that the include must
be specified twice, add a wrapper include header file which will do
all necessary steps.
This way only the vx-insn.h header file needs to be included and changes to
the new vx-insn-asm.h header file cause a recompile of all dependent files
like it should.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
octal values are easier to read and checkpatch also recommends
to use them, so replace all the S_* macros with their counterparts.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to use ECKD disks as dump device
to linux. The new dump type is called 'eckd_dump', parameters
are the same as for eckd ipl.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
This adds support to IPL from ECKD DASDs to linux.
It introduces a few sysfs files in /sys/firmware/reipl/eckd:
bootprog: the boot program selector
clear: whether to issue a diag308 LOAD_NORMAL or LOAD_CLEAR
device: the device to ipl from
br_chr: Cylinder/Head/Record number to read the bootrecord from.
Might be '0' or 'auto' if it should be read from the
volume label.
scpdata: data to be passed to the ipl'd program.
The new ipl type is called 'eckd'.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
find_timens_vvar_page() is not architecture-specific, as can be seen from
how all five per-architecture versions of it are the same.
(arm64, powerpc and riscv are exactly the same; x86 and s390 have two
characters difference inside a comment, less blank lines, and mark the
!CONFIG_TIME_NS version as inline.)
Refactor the five copies into a central copy in kernel/time/namespace.c.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130115320.2918447-1-jannh@google.com
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmN/eYwACgkQ41TmuOI4
ufjoxA/9Et38aXO/IhmUt8v0QhA4yec+sc5GSFfQSYehej/1Vqhw0DXx+ORUiRgg
+rtiXJSSqkuD2dL+BDffY2xoul6nzNdVf4AbkcnrWscfWr6xwVYlPvuL0ymGI6J2
U/IPedRoKw0bHw/wHs05yV5PubrRwDFERKhtyXWYGbPJhX0w2n3IFOoKH1oWBhLW
Dc8jEs6t3gDbJ71Er0xoeBUoiuu+PgZG06cpOvzBZ0KclRgjADXyISqqk8/4mu8w
R+/Wf8NcrbQYV1jfCeq5zIsKC8uvnFj25UuyTLumn5vh+dNNsvE72Khe4tz7LI0I
ZPZ+GZuemu7Yi12dKjw4Sw3ui0ejWH/5XL1SVB0X/xYIWrBqOot+Lq6538GCng+c
tJt+zsu64VFgXCCZ8O9qO4uE4DBL70H3ThT7VZxIghSTZtY0xh3uFc64f3/3d9dy
K4WTJHrmMxhXaA/rqtIa8I53JvFl8CztofZATiQQesyPuc7lZ01w1Co5el4xYaxe
YknyMTq11qf/iYqVOW7sjoWW/YRuuMZ4+FhpI3o/SllVdN98iTwkk1kP3wcoBO5P
bvzpm+WXHbv9OxifPrqkqv34+upbjfEmSogHudQzagBX4vl3rZRfBCdQGCAha0Uc
ZYyg68kiil5sWmHI/Ln/ZjANYfbS5sF0CreuWxnmqcwKl2NSN/E=
=/1yt
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-6.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address support
- Removal of a unused function
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The size of the TOD programmable field was incorrectly increased from
four to eight bytes with commit 1a2c5840ac ("s390/dump: cleanup CPU
save area handling").
This leads to an elf notes section NT_S390_TODPREG which has a size of
eight instead of four bytes in case of kdump, however even worse is
that the contents is incorrect: it is supposed to contain only the
contents of the TOD programmable field, but in fact contains a mix of
the TOD programmable field (32 bit upper bits) and parts of the CPU
timer register (lower 32 bits).
Fix this by simply changing the size of the todpreg field within the
save area structure. This will implicitly also fix the size of the
corresponding elf notes sections.
This also gets rid of this compile time warning:
in function ‘fortify_memcpy_chk’,
inlined from ‘save_area_add_regs’ at arch/s390/kernel/crash_dump.c:99:2:
./include/linux/fortify-string.h:413:25: error: call to ‘__read_overflow2_field’
declared with attribute warning: detected read beyond size of field
(2nd parameter); maybe use struct_group()? [-Werror=attribute-warning]
413 | __read_overflow2_field(q_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 1a2c5840ac ("s390/dump: cleanup CPU save area handling")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an unused insn cache ops function:
arch/s390/kernel/kprobes.c:34:1:
error: unused function 'is_kprobe_s390_insn_slot' [-Werror,-Wunused-function]
DEFINE_INSN_CACHE_OPS(s390_insn);
^
./include/linux/kprobes.h:335:20: note: expanded from macro 'DEFINE_INSN_CACHE_OPS'
static inline bool is_kprobe_##__name##_slot(unsigned long addr) \
^
<scratch space>:88:1: note: expanded from here
is_kprobe_s390_insn_slot
^
Move the definition to a private header file, which is also similar to
the generic insn cache ops.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
clang warns about an incompatible function type cast:
CC arch/s390/kernel/debug.o
arch/s390/kernel/debug.c:142:2: error: cast from 'int (*)(debug_info_t *, struct debug_view *, char *, debug_sprintf_entry_t *)' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, debug_sprintf_entry_t *)') to 'debug_format_proc_t *' (aka 'int (*)(struct debug_info *, struct debug_view *, char *, const char *)') converts to incompatible function type [-Werror,-Wcast-function-type-strict]
(debug_format_proc_t *)&debug_sprintf_format_fn,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Get rid of this warning by changing debug_sprintf_format_fn() so it matches
the debug_format_proc_t function type, and do the cast of the last
parameter within the function itself.
This is the standard way of handling such cases anyway.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.
In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.
While at it, include the corresponding header file (<linux/kstrtox.h>)
Link: https://lore.kernel.org/all/cover.1667336095.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
If the appropriate UV feature bit is set, there is no need to perform
an export before import.
The misc feature indicates, among other things, that importing a shared
page from a different protected VM will automatically also transfer its
ownership.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20221111170632.77622-5-imbrenda@linux.ibm.com
Message-Id: <20221111170632.77622-5-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).
Notably:
- HW breakpoint PMU
- ARM big.little PMU / Intel ADL PMU
- Intel Branch Monitoring PMU
- AMD IBS PMU
- S390 cpum_cf PMU
- PowerPC trace_imc PMU
*Current design:*
Currently we have a per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
^ | ^ | ^
`---------------------------------' | `--> pmu ---'
v ^
perf_event ------'
Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.
Each perf_event is then associated with its PMU and one
perf_event_context.
*Proposed design:*
New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
^ | ^ ^
`---------------------------' | |
| | perf_cpu_pmu_context <--.
| `----. ^ |
| | | |
| v v |
| ,--> perf_event_pmu_context |
| | |
| | |
v v |
perf_event ---> pmu ----------------'
With the new design, perf_event_context will hold all events for all
pmus in the (respective pinned/flexible) rbtrees. This can be achieved
by adding pmu to rbtree key:
{cpu, pmu, cgroup, group_index}
Each perf_event_context carries a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For example, it keeps track
of currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.
Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu
state like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.
Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.
Further optimizations to current implementation are possible. For
example, ctx_resched() can be optimized to reschedule only single pmu
events.
Much thanks to Ravi for picking this up and pushing it towards
completion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
Commit 838d9bb62d ("perf: Use sample_flags for raw_data")
changed the way the raw data of an event is collected.
Adjust the PMU pai_ext to the new scheme.
Fixes: 838d9bb62d ("perf: Use sample_flags for raw_data")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rename structure member users to active_events to make it consistent
with PMU pai_ext. Also use the same prefix syntax for increment and
decrement operators in both PMUs.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Rework the mapped buffer reference count in PMU pai_crypto
to match the same technique as in PMU pai_ext.
This simplifies the logic.
Do not count the individual number of counter and sampling
processes. Remember the type of access and the total number of
references to the buffer.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move enum definition to header file. This is done in preparation
for a follow on patch where this enum will be used in another source
file.
Also change the enum name from paiext_mode to paievt_mode
to indicate this enum is now used for several events.
Make naming consistent and rename PAI_MODE_COUNTER to PAI_MODE_COUNTING.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Fix virtual vs physical address confusion (which currently are the
same).
sie_block is accessed in entry.S and passed it to hardware, which is why
both its physical and virtual address are needed. To avoid every caller
having to do the virtual-physical conversion, add a new function sie64a()
which converts the virtual address to physical.
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20221020143159.294605-3-nrb@linux.ibm.com
Message-Id: <20221020143159.294605-3-nrb@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
simply use the get_random_{u8,u16}() functions, which are faster than
wasting the additional bytes from a 32-bit value. This was done by hand,
identifying all of the places where one of the random integer functions
was used in a non-32-bit context.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
linux-next for a couple of months without, to my knowledge, any negative
reports (or any positive ones, come to that).
- Also the Maple Tree from Liam R. Howlett. An overlapping range-based
tree for vmas. It it apparently slight more efficient in its own right,
but is mainly targeted at enabling work to reduce mmap_lock contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
(https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
This has yet to be addressed due to Liam's unfortunately timed
vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down to
the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
=xfWx
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc. in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing particular
sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmM+4vcVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGY2IQAInr0JUNnkkxwUSXtOcQuA3IK8RJ
FbU9HXJRoV9H+7+l3SMlN7mIbrs5eE5fTY3iwQ3CVe139d1+1q7nvTMRv8owywJx
GBgzswncuu1lk7iQQ//CxiqMwSCG8GJdYn1uDVy4I5jg3o+DtFZJtyq2Wb7pqsMm
ZhZ4PozRN+idYQJSF6Vx/zEVLHI7quMBwfe4CME8/0Kg2+hnYzbXV/aUf0ED2emq
zdCMDQgIOK5AhY+8qgMXKYnBUJMTqBp6LoR4p3ApfUkwRFY0sGa0/LK3U/B22OE7
uWyR4fCUExGyerlcHEVev+9eBfmsLLPyqlchNwpSDOPf5OSdnKmgqJEBR/Cvx0eh
URerPk7EHxyH3G8yi+cU2GtofNTGc5RHPRgJE2ADsQEi5TAUKGmbXMlsFRL/51Vn
lTANZObBNa1d4enljF6TfTL5nuccOa+DKvXnH9fQ49t0QdtSikv6J/lGwilwm1Sr
BctmCsySPuURZfkpI9OQnLuouloMXl9f7Q/+S39haS/tSgvPpyITyO71nxDnXn/s
BbFObZJUk9QkqOACjBP1hNErTLt83uBxQ9z+rDCw/SbLIe4nw0wyneuygfHI5rI8
3RZB2DbGauuJHX2Zs6YGS14SLSY33IsLqKR1/Vy3LrPvOHuEvNiOR8LITq5E0YCK
OffK2Y5cIlXR0QWf
=DHiN
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Remove potentially incomplete targets when Kbuid is interrupted by
SIGINT etc in case GNU Make may miss to do that when stderr is piped
to another program.
- Rewrite the single target build so it works more correctly.
- Fix rpm-pkg builds with V=1.
- List top-level subdirectories in ./Kbuild.
- Ignore auto-generated __kstrtab_* and __kstrtabns_* symbols in
kallsyms.
- Avoid two different modules in lib/zstd/ having shared code, which
potentially causes building the common code as build-in and modular
back-and-forth.
- Unify two modpost invocations to optimize the build process.
- Remove head-y syntax in favor of linker scripts for placing
particular sections in the head of vmlinux.
- Bump the minimal GNU Make version to 3.82.
- Clean up misc Makefiles and scripts.
* tag 'kbuild-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
docs: bump minimal GNU Make version to 3.82
ia64: simplify esi object addition in Makefile
Revert "kbuild: Check if linker supports the -X option"
kbuild: rebuild .vmlinux.export.o when its prerequisite is updated
kbuild: move modules.builtin(.modinfo) rules to Makefile.vmlinux_o
zstd: Fixing mixed module-builtin objects
kallsyms: ignore __kstrtab_* and __kstrtabns_* symbols
kallsyms: take the input file instead of reading stdin
kallsyms: drop duplicated ignore patterns from kallsyms.c
kbuild: reuse mksysmap output for kallsyms
mksysmap: update comment about __crc_*
kbuild: remove head-y syntax
kbuild: use obj-y instead extra-y for objects placed at the head
kbuild: hide error checker logs for V=1 builds
kbuild: re-run modpost when it is updated
kbuild: unify two modpost invocations
kbuild: move vmlinux.o rule to the top Makefile
kbuild: move .vmlinux.objs rule to Makefile.modpost
kbuild: list sub-directories in ./Kbuild
Makefile.compiler: replace cc-ifversion with compiler-specific macros
...
- PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2)
feature support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information,
if available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on
AMD CPUs by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
- HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs and
thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot()
and fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/2pMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIMA/+J+MCEVTt9kwZeBtHoPX7iZ5gnq1+McoQ
f6ALX19AO/ZSuA7EBA3cS3Ny5eyGy3ofYUnRW+POezu9CpflLW/5N27R2qkZFrWC
A09B86WH676ZrmXt+oI05rpZ2y/NGw4gJxLLa4/bWF3g9xLfo21i+YGKwdOnNFpl
DEdCVHtjlMcOAU3+on6fOYuhXDcYd7PKGcCfLE7muOMOAtwyj0bUDBt7m+hneZgy
qbZHzDU2DZ5L/LXiMyuZj5rC7V4xUbfZZfXglG38YCW1WTieS3IjefaU2tREhu7I
rRkCK48ULDNNJR3dZK8IzXJRxusq1ICPG68I+nm/K37oZyTZWtfYZWehW/d/TnPa
tUiTwimabz7UUqaGq9ZptxwINcAigax0nl6dZ3EseeGhkDE6j71/3kqrkKPz4jth
+fCwHLOrI3c4Gq5qWgPvqcUlUneKf3DlOMtzPKYg7sMhla2zQmFpYCPzKfm77U/Z
BclGOH3FiwaK6MIjPJRUXTePXqnUseqCR8PCH/UPQUeBEVHFcMvqCaa15nALed8x
dFi76VywR9mahouuLNq6sUNePlvDd2B124PygNwegLlBfY9QmKONg9qRKOnQpuJ6
UprRJjLOOucZ/N/jn6+ShHkqmXsnY2MhfUoHUoMQ0QAI+n++e+2AuePo251kKWr8
QlqKxd9PMQU=
=LcGg
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
"PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature
support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information, if
available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs
by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs
and thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
per-CPU rwsem that is read-locked during most of the key
operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and
fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements"
* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
perf: Fix pmu_filter_match()
perf: Fix lockdep_assert_event_ctx()
perf/x86/amd/lbr: Adjust LBR regardless of filtering
perf/x86/utils: Fix uninitialized var in get_branch_type()
perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file
perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR
perf/x86/amd: Support PERF_SAMPLE_ADDR
perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT}
perf/x86/amd: Support PERF_SAMPLE_DATA_SRC
perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions
perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO}
perf/x86/uncore: Add new Raptor Lake S support
perf/x86/cstate: Add new Raptor Lake S support
perf/x86/msr: Add new Raptor Lake S support
perf/x86: Add new Raptor Lake S support
bpf: Check flags for branch stack in bpf_read_branch_records helper
perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails
perf: Use sample_flags for raw_data
perf: Use sample_flags for addr
...
The objects placed at the head of vmlinux need special treatments:
- arch/$(SRCARCH)/Makefile adds them to head-y in order to place
them before other archives in the linker command line.
- arch/$(SRCARCH)/kernel/Makefile adds them to extra-y instead of
obj-y to avoid them going into built-in.a.
This commit gets rid of the latter.
Create vmlinux.a to collect all the objects that are unconditionally
linked to vmlinux. The objects listed in head-y are moved to the head
of vmlinux.a by using 'ar m'.
With this, arch/$(SRCARCH)/kernel/Makefile can consistently use obj-y
for builtin objects.
There is no *.o that is directly linked to vmlinux. Drop unneeded code
in scripts/clang-tools/gen_compile_commands.py.
$(AR) mPi needs 'T' to workaround the llvm-ar bug. The fix was suggested
by Nathan Chancellor [1].
[1]: https://lore.kernel.org/llvm/YyjjT5gQ2hGMH0ni@dev-arch.thelio-3990X/
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Use the new sample_flags to indicate whether the raw data field is
filled by the PMU driver. Although it could check with the NULL,
follow the same rule with other fields.
Remove the raw field from the perf_sample_data_init() to minimize
the number of cache lines touched.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220921220032.2858517-2-namhyung@kernel.org
PMU device driver perf_paiext supports Processor Activity
Instrumentation Extension (PAIE1), available with IBM z16:
- maps a 512 byte block to lowcore address 0x1508 called PAIE1 control
block.
- maps a 1024 byte block at PAIE1 control block entry with index 2.
- uses control register bit 14 to enable PAIE1 control block lookup.
- turn PAIE1 nnpa counting on and off by setting bit 63 in
PAIE1 control block entry with index 2.
- creates a sample with raw data on each context switch out when
at context switch some mapped counters have a value of nonzero.
This device driver only supports CPU wide context, no task context
is allowed.
Support for counting:
- one or more counters can be specified using
perf stat -e pai_ext/xxx/
where xxx stands for the counter event name. Multiple invocation
of this command is possible. The counter names are listed in
/sys/devices/pai_ext/events directory.
- one special counters can be specified using
perf stat -e pai_ext/NNPA_ALL/
which returns the sum of all incremented nnpa counters.
- multiple counting events can run in parallel.
Support for Sampling:
- one event pai_ext/NNPA_ALL/ is reserved for sampling.
The event collects data at context switch out and saves them in
the ring buffer.
- no multiple invocations are possible.
The PAIE1 nnpa counter events are system wide. No task context is
supported. Therefore some restrictions documented in function
paiext_busy() apply.
Extend qpaci assembly instruction to query supported memory mapped nnpa
counters. It returns the number of counters (no holes allowed in that
range).
PAIE1 nnpa counter events can not be created when a CPU hot plug
add is processed. This means a CPU hot plug add does not get
the necessary PAIE1 event to record PAIE1 nnpa counter increments
on the newly added CPU. CPU hot plug remove removes the event and
terminates the counting of PAIE1 counters immediately.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Uninline copy_oldmem_kernel() function and make it consistent
with a very similar memcpy_real() implementation, by moving
to code to crash_dump.c, where it actually belongs.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Function memcpy_real() is an univeral data mover that does not
require DAT mode to be able reading from a physical address.
Its advantage is an ability to read from any address, even
those for which no kernel virtual mapping exists.
Although memcpy_real() is interrupt-safe, there are no handlers
that make use of this function. The compiler instrumentation
have to be disabled and separate no-DAT stack used to allow
execution of the function once DAT mode is disabled.
Rework memcpy_real() to overcome these shortcomings. As result,
data copying (which is primarily reading out a crashed system
memory by a user process) is executed on a regular stack with
enabled interrupts. Also, use of memcpy_real_buf swap buffer
becomes unnecessary and the swapping is eliminated.
The above is achieved by using a fixed virtual address range
that spans a single page and remaps that page repeatedly when
memcpy_real() is called for a particular physical address.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Function smp_save_dump_cpus() collects CPU state of a crashed
system for secondary CPUs and for the IPL CPU very differently.
The Signal Processor stop-and-store-status orders are used for
the former while Hardware System Area requests and memcpy_real()
routine are called for the latter. In addition a system reset is
triggered, which pins smp_save_dump_cpus() function call before
CPU and device initialization.
Move the collection of IPL CPU state to a later stage when DAT
becomes available. That is needed to allow a follow-up rework of
memcpy_real() routine.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Temporary unsetting of the prefix page in memcpy_absolute() routine
poses a risk of executing code path with unexpectedly disabled prefix
page. This rework avoids the prefix page uninstalling and disabling
of normal and machine check interrupts when accessing the absolute
zero memory.
Although memcpy_absolute() routine can access the whole memory, it is
only used to update the absolute zero lowcore. This rework therefore
introduces a new mechanism for the absolute zero lowcore access and
scraps memcpy_absolute() routine for good.
Instead, an area is reserved in the virtual memory that is used for
the absolute lowcore access only. That area holds an array of 8KB
virtual mappings - one per CPU. Whenever a CPU is brought online, the
corresponding item is mapped to the real address of the previously
installed prefix page.
The absolute zero lowcore access works like this: a CPU calls the
new primitive get_abs_lowcore() to obtain its 8KB mapping as a
pointer to the struct lowcore. Virtual address references to that
pointer get translated to the real addresses of the prefix page,
which in turn gets swapped with the absolute zero memory addresses
due to prefixing. Once the pointer is not needed it must be released
with put_abs_lowcore() primitive:
struct lowcore *abs_lc;
unsigned long flags;
abs_lc = get_abs_lowcore(&flags);
abs_lc->... = ...;
put_abs_lowcore(abs_lc, flags);
To ensure the described mechanism works large segment- and region-
table entries must be avoided for the 8KB mappings. Failure to do
so results in usage of Region-Frame Absolute Address (RFAA) or
Segment-Frame Absolute Address (SFAA) large page fields. In that
case absolute addresses would be used to address the prefix page
instead of the real ones and the prefixing would get bypassed.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently smp_reinit_ipl_cpu() is a pre-SMP early initcall.
That ensures no CPU is running in parallel, but still not
enough to assume the code is exclusive, since the scheduling
is already available.
Move the function call to arch_call_rest_init() callback
to ensure no thread could be preempted and allow lockless
allocation of the kernel page tables. That is needed to
allow a follow-up rework of the absolute lowcore access
mechanism.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
As result of commit 915fea04f9 ("s390/smp: enable DAT before
CPU restart callback is called") the low-address protection bit
gets mistakenly unset in control register 0 save area of the
absolute zero memory. That area is used when manual PSW restart
happened to hit an offline CPU. In this case the low-address
protection for that CPU will be dropped.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Fixes: 915fea04f9 ("s390/smp: enable DAT before CPU restart callback is called")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Crash dump always starts on CPU0. In case CPU0 is offline the
prefix page is not installed and the absolute zero lowcore is
used. However, struct lowcore::mcesad is never assigned and
stays zero. That leads to __machine_kdump() -> save_vx_regs()
call silently stores vector registers to the absolute lowcore
at 0x11b0 offset.
Fixes: a62bc07392 ("s390/kdump: add support for vector extension")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add proper alignment for .nospec_call_table and .nospec_return_table in
vmlinux.
[hca@linux.ibm.com]: The problem with the missing alignment of the nospec
tables exist since a long time, however only since commit e6ed91fd07
("s390/alternatives: remove padding generation code") and with
CONFIG_RELOCATABLE=n the kernel may also crash at boot time.
The above named commit reduced the size of struct alt_instr by one byte,
so its new size is 11 bytes. Therefore depending on the number of cpu
alternatives the size of the __alt_instructions array maybe odd, which
again also causes that the addresses of the nospec tables will be odd.
If the address of __nospec_call_start is odd and the kernel is compiled
With CONFIG_RELOCATABLE=n the compiler may generate code that loads the
address of __nospec_call_start with a 'larl' instruction.
This will generate incorrect code since the 'larl' instruction only works
with even addresses. In result the members of the nospec tables will be
accessed with an off-by-one offset, which subsequently may lead to
addressing exceptions within __nospec_revert().
Fixes: f19fbd5ed6 ("s390: introduce execute-trampolines for branches")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/8719bf1ce4a72ebdeb575200290094e9ce047bcc.1661557333.git.jpoimboe@kernel.org
Cc: <stable@vger.kernel.org> # 4.16
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The pointers for guarded storage and runtime instrumentation control
blocks are stored in the thread_struct of the associated task. These
pointers are initially copied on fork() via arch_dup_task_struct()
and then cleared via copy_thread() before fork() returns. If fork()
happens to fail after the initial task dup and before copy_thread(),
the newly allocated task and associated thread_struct memory are
freed via free_task() -> arch_release_task_struct(). This results in
a double free of the guarded storage and runtime info structs
because the fields in the failed task still refer to memory
associated with the source task.
This problem can manifest as a BUG_ON() in set_freepointer() (with
CONFIG_SLAB_FREELIST_HARDENED enabled) or KASAN splat (if enabled)
when running trinity syscall fuzz tests on s390x. To avoid this
problem, clear the associated pointer fields in
arch_dup_task_struct() immediately after the new task is copied.
Note that the RI flag is still cleared in copy_thread() because it
resides in thread stack memory and that is where stack info is
copied.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Fixes: 8d9047f8b9 ("s390/runtime instrumentation: simplify task exit handling")
Fixes: 7b83c6297d ("s390/guarded storage: simplify task exit handling")
Cc: <stable@vger.kernel.org> # 4.15
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20220816155407.537372-1-bfoster@redhat.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Rework copy_oldmem_page() callback to take an iov_iter.
This includes few prerequisite updates and fixes to the
oldmem reading code.
- Rework cpufeature implementation to allow for various CPU feature
indications, which is not only limited to hardware capabilities,
but also allows CPU facilities.
- Use the cpufeature rework to autoload Ultravisor module when CPU
facility 158 is available.
- Add ELF note type for encrypted CPU state of a protected virtual CPU.
The zgetdump tool from s390-tools package will decrypt the CPU state
using a Customer Communication Key and overwrite respective notes to
make the data accessible for crash and other debugging tools.
- Use vzalloc() instead of vmalloc() + memset() in ChaCha20 crypto test.
- Fix incorrect recovery of kretprobe modified return address in stacktrace.
- Switch the NMI handler to use generic irqentry_nmi_enter() and
irqentry_nmi_exit() helper functions.
- Rework the cryptographic Adjunct Processors (AP) pass-through design
to support dynamic changes to the AP matrix of a running guest as well
as to implement more of the AP architecture.
- Minor boot code cleanups.
- Grammar and typo fixes to hmcdrv and tape drivers.
-----BEGIN PGP SIGNATURE-----
iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCYu4dRBccYWdvcmRlZXZA
bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8DnlAP45Sk4cE35T+Z0vdHE2f0uMXE/p
uHNjS3fDZOQVFJ2jZwEA99xPF5qPCttbR/b1VHsMSb30684IT1A4PC7y05kgfAw=
=jCc3
-----END PGP SIGNATURE-----
Merge tag 's390-5.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev:
- Rework copy_oldmem_page() callback to take an iov_iter.
This includes a few prerequisite updates and fixes to the oldmem
reading code.
- Rework cpufeature implementation to allow for various CPU feature
indications, which is not only limited to hardware capabilities, but
also allows CPU facilities.
- Use the cpufeature rework to autoload Ultravisor module when CPU
facility 158 is available.
- Add ELF note type for encrypted CPU state of a protected virtual CPU.
The zgetdump tool from s390-tools package will decrypt the CPU state
using a Customer Communication Key and overwrite respective notes to
make the data accessible for crash and other debugging tools.
- Use vzalloc() instead of vmalloc() + memset() in ChaCha20 crypto
test.
- Fix incorrect recovery of kretprobe modified return address in
stacktrace.
- Switch the NMI handler to use generic irqentry_nmi_enter() and
irqentry_nmi_exit() helper functions.
- Rework the cryptographic Adjunct Processors (AP) pass-through design
to support dynamic changes to the AP matrix of a running guest as
well as to implement more of the AP architecture.
- Minor boot code cleanups.
- Grammar and typo fixes to hmcdrv and tape drivers.
* tag 's390-5.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (46 commits)
Revert "s390/smp: enforce lowcore protection on CPU restart"
Revert "s390/smp: rework absolute lowcore access"
Revert "s390/smp,ptdump: add absolute lowcore markers"
s390/unwind: fix fgraph return address recovery
s390/nmi: use irqentry_nmi_enter()/irqentry_nmi_exit()
s390: add ELF note type for encrypted CPU state of a PV VCPU
s390/smp,ptdump: add absolute lowcore markers
s390/smp: rework absolute lowcore access
s390/setup: rearrange absolute lowcore initialization
s390/boot: cleanup adjust_to_uv_max() function
s390/smp: enforce lowcore protection on CPU restart
s390/tape: fix comment typo
s390/hmcdrv: fix Kconfig "its" grammar
s390/docs: fix warnings for vfio_ap driver doc
s390/docs: fix warnings for vfio_ap driver lock usage doc
s390/crash: support multi-segment iterators
s390/crash: use static swap buffer for copy_to_user_real()
s390/crash: move copy_to_user_real() to crash_dump.c
s390/zcore: fix race when reading from hardware system area
s390/crash: fix incorrect number of bytes to copy to user space
...
This reverts commit 7d06fed77b.
This introduced vmem_mutex locking from vmem_map_4k_page()
function called from smp_reinit_ipl_cpu() with interrupts
disabled. While it is a pre-SMP early initcall no other CPUs
running in parallel nor other code taking vmem_mutex on this
boot stage - it still needs to be fixed.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
* Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
* Rework of the sysreg access from userspace, with a complete
rewrite of the vgic-v3 view to allign with the rest of the
infrastructure
* Disagregation of the vcpu flags in separate sets to better track
their use model.
* A fix for the GICv2-on-v3 selftest
* A small set of cosmetic fixes
RISC-V:
* Track ISA extensions used by Guest using bitmap
* Added system instruction emulation framework
* Added CSR emulation framework
* Added gfp_custom flag in struct kvm_mmu_memory_cache
* Added G-stage ioremap() and iounmap() functions
* Added support for Svpbmt inside Guest
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
x86:
* Permit guests to ignore single-bit ECC errors
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Use try_cmpxchg64 instead of cmpxchg64
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* Allow NX huge page mitigation to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
* Miscellaneous cleanups:
** MCE MSR emulation
** Use separate namespaces for guest PTEs and shadow PTEs bitmasks
** PIO emulation
** Reorganize rmap API, mostly around rmap destruction
** Do not workaround very old KVM bugs for L0 that runs with nesting enabled
** new selftests API for CPUID
Generic:
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmLnyo4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMtQQf/XjVWiRcWLPR9dqzRM/vvRXpiG+UL
jU93R7m6ma99aqTtrxV/AE+kHgamBlma3Cwo+AcWk9uCVNbIhFjv2YKg6HptKU0e
oJT3zRYp+XIjEo7Kfw+TwroZbTlG6gN83l1oBLFMqiFmHsMLnXSI2mm8MXyi3dNB
vR2uIcTAl58KIprqNNsYJ2dNn74ogOMiXYx9XzoA9/5Xb6c0h4rreHJa5t+0s9RO
Gz7Io3PxumgsbJngjyL1Ve5oxhlIAcZA8DU0PQmjxo3eS+k6BcmavGFd45gNL5zg
iLpCh4k86spmzh8CWkAAwWPQE4dZknK6jTctJc0OFVad3Z7+X7n0E8TFrA==
=PM8o
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"Quite a large pull request due to a selftest API overhaul and some
patches that had come in too late for 5.19.
ARM:
- Unwinder implementations for both nVHE modes (classic and
protected), complete with an overflow stack
- Rework of the sysreg access from userspace, with a complete rewrite
of the vgic-v3 view to allign with the rest of the infrastructure
- Disagregation of the vcpu flags in separate sets to better track
their use model.
- A fix for the GICv2-on-v3 selftest
- A small set of cosmetic fixes
RISC-V:
- Track ISA extensions used by Guest using bitmap
- Added system instruction emulation framework
- Added CSR emulation framework
- Added gfp_custom flag in struct kvm_mmu_memory_cache
- Added G-stage ioremap() and iounmap() functions
- Added support for Svpbmt inside Guest
s390:
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to use TAP interface
- enable interpretive execution of zPCI instructions (for PCI
passthrough)
- First part of deferred teardown
- CPU Topology
- PV attestation
- Minor fixes
x86:
- Permit guests to ignore single-bit ECC errors
- Intel IPI virtualization
- Allow getting/setting pending triple fault with
KVM_GET/SET_VCPU_EVENTS
- PEBS virtualization
- Simplify PMU emulation by just using PERF_TYPE_RAW events
- More accurate event reinjection on SVM (avoid retrying
instructions)
- Allow getting/setting the state of the speaker port data bit
- Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls
are inconsistent
- "Notify" VM exit (detect microarchitectural hangs) for Intel
- Use try_cmpxchg64 instead of cmpxchg64
- Ignore benign host accesses to PMU MSRs when PMU is disabled
- Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
- Allow NX huge page mitigation to be disabled on a per-vm basis
- Port eager page splitting to shadow MMU as well
- Enable CMCI capability by default and handle injected UCNA errors
- Expose pid of vcpu threads in debugfs
- x2AVIC support for AMD
- cleanup PIO emulation
- Fixes for LLDT/LTR emulation
- Don't require refcounted "struct page" to create huge SPTEs
- Miscellaneous cleanups:
- MCE MSR emulation
- Use separate namespaces for guest PTEs and shadow PTEs bitmasks
- PIO emulation
- Reorganize rmap API, mostly around rmap destruction
- Do not workaround very old KVM bugs for L0 that runs with nesting enabled
- new selftests API for CPUID
Generic:
- Fix races in gfn->pfn cache refresh; do not pin pages tracked by
the cache
- new selftests API using struct kvm_vcpu instead of a (vm, id)
tuple"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits)
selftests: kvm: set rax before vmcall
selftests: KVM: Add exponent check for boolean stats
selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test
selftests: KVM: Check stat name before other fields
KVM: x86/mmu: remove unused variable
RISC-V: KVM: Add support for Svpbmt inside Guest/VM
RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap()
RISC-V: KVM: Add G-stage ioremap() and iounmap() functions
KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache
RISC-V: KVM: Add extensible CSR emulation framework
RISC-V: KVM: Add extensible system instruction emulation framework
RISC-V: KVM: Factor-out instruction emulation into separate sources
RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run
RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function
RISC-V: KVM: Fix variable spelling mistake
RISC-V: KVM: Improve ISA extension by using a bitmap
KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs()
KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog
KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT
KVM: x86: Do not block APIC write for non ICR registers
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmLnDOwACgkQSfxwEqXe
A65Fiw//Z0YaPejSslQIGitQ1b0XzdWBhyJArYDieaaiQRXMqlaSKlIUqHz38xb7
+FykUY51/SJLjHV2riPxq1OK3/MPmk6VlTd0HHihcHVmg77oZcFcv2tPnDpZoqND
TsBOujLbXKwxP8tNFedRY/4+K7w+ue9BTfDjuH7aCtz7uWd+4cNJmPg3x9FCfkMA
+hbcRluwE9W3Pg4OCKwv+qxL0JF3qQtNKEOp1wpnjGAZZW/I9gFNgFBEkykvcAsj
TkIRDc3agPFj6QgDeRIgLdnf9KCsLubKAg5oJneeCvQztJJUCSkn8nQXxpx+4sLo
GsRgvCdfL/GyJqfSAzQJVYDHKtKMkJiCiWCC/oOALR8dzHJfSlULDAjbY1m/DAr9
at+vi4678Or7TNx2ZSaUlCXXKZ+UT7yWMlQWax9JuxGk1hGYP5/eT1AH5SGjqUwF
w1q8oyzxt1vUcnOzEddFXPFirnqqhAk4dQFtu83+xKM4ZssMVyeB4NZdEhAdW0ng
MX+RjrVj4l5gWWuoS0Cx3LUxDCgV6WT0dN+Vl9axAZkoJJbcXLEmXwQ6NbzTLPWg
1/MT7qFTxNcTCeAArMdZvvFbeh7pOBXO42pafrK/7vDRnTMUIw9tqXNLQUfvdFQp
F5flPgiVRHDU2vSzKIFtnPTyXU0RBBGvNb4n0ss2ehH2DSsCxYE=
=Zy3d
-----END PGP SIGNATURE-----
Merge tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
"Though there's been a decent amount of RNG-related development during
this last cycle, not all of it is coming through this tree, as this
cycle saw a shift toward tackling early boot time seeding issues,
which took place in other trees as well.
Here's a summary of the various patches:
- The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot
option have been removed, as they overlapped with the more widely
supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and
"random.trust_cpu". This change allowed simplifying a bit of arch
code.
- x86's RDRAND boot time test has been made a bit more robust, with
RDRAND disabled if it's clearly producing bogus results. This would
be a tip.git commit, technically, but I took it through random.git
to avoid a large merge conflict.
- The RNG has long since mixed in a timestamp very early in boot, on
the premise that a computer that does the same things, but does so
starting at different points in wall time, could be made to still
produce a different RNG state. Unfortunately, the clock isn't set
early in boot on all systems, so now we mix in that timestamp when
the time is actually set.
- User Mode Linux now uses the host OS's getrandom() syscall to
generate a bootloader RNG seed and later on treats getrandom() as
the platform's RDRAND-like faculty.
- The arch_get_random_{seed_,}_long() family of functions is now
arch_get_random_{seed_,}_longs(), which enables certain platforms,
such as s390, to exploit considerable performance advantages from
requesting multiple CPU random numbers at once, while at the same
time compiling down to the same code as before on platforms like
x86.
- A small cleanup changing a cmpxchg() into a try_cmpxchg(), from
Uros.
- A comment spelling fix"
More info about other random number changes that come in through various
architecture trees in the full commentary in the pull request:
https://lore.kernel.org/all/20220731232428.2219258-1-Jason@zx2c4.com/
* tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
random: correct spelling of "overwrites"
random: handle archrandom with multiple longs
um: seed rng using host OS rng
random: use try_cmpxchg in _credit_init_bits
timekeeping: contribute wall clock to rng on time change
x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu"
random: remove CONFIG_ARCH_RANDOM
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQQdXVVFGN5XqKr1Hj7LwZzRsCrn5QUCYulqTBQcem9oYXJAbGlu
dXguaWJtLmNvbQAKCRDLwZzRsCrn5SBBAP9nbAW1SPa/hDqbrclHdDrS59VkSVwv
6ZO2yAmxJAptHwD+JzyJpJiZsqVN/Tu85V1PqeAt9c8az8f3CfDBp2+w7AA=
=Ad+c
-----END PGP SIGNATURE-----
Merge tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"Aside from the one EVM cleanup patch, all the other changes are kexec
related.
On different architectures different keyrings are used to verify the
kexec'ed kernel image signature. Here are a number of preparatory
cleanup patches and the patches themselves for making the keyrings -
builtin_trusted_keyring, .machine, .secondary_trusted_keyring, and
.platform - consistent across the different architectures"
* tag 'integrity-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
kexec, KEYS, s390: Make use of built-in and secondary keyring for signature verification
arm64: kexec_file: use more system keyrings to verify kernel image signature
kexec, KEYS: make the code in bzImage64_verify_sig generic
kexec: clean up arch_kexec_kernel_verify_sig
kexec: drop weak attribute from functions
kexec_file: drop weak attribute from functions
evm: Use IS_ENABLED to initialize .enabled
- lockdep: Fix a handful of the more complex lockdep_init_map_*() primitives
that can lose the lock_type & cause false reports. No such mishap was
observed in the wild.
- jump_label improvements: simplify the cross-arch support of
initial NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn3jERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hzeg/7BTC90XeMANhTiL23iiH7dOYZwqdFeB12
VBqdaPaGC8i+mJzVAdGyPFwCFDww6Ak6P33PcHkemuIO5+DhWis8hfw5krHEOO1k
AyVSMOZuWJ8/g6ZenjgNFozQ8C+3NqURrpdqN55d7jhMazPWbsNLLqUgvSSqo6DY
Ah2O+EKrDfGNCxT6/YaTAmUryctotxafSyFDQxv3RKPfCoIIVv9b3WApYqTOqFIu
VYTPr+aAcMsU20hPMWQI4kbQaoCxFqr3bZiZtAiS/IEunqi+PlLuWjrnCUpLwVTC
+jOCkNJHt682FPKTWelUnCnkOg9KhHRujRst5mi1+2tWAOEvKltxfe05UpsZYC3b
jhzddREMwBt3iYsRn65LxxsN4AMK/C/41zjejHjZpf+Q5kwDsc6Ag3L5VifRFURS
KRwAy9ejoVYwnL7CaVHM2zZtOk4YNxPeXmiwoMJmOufpdmD1LoYbNUbpSDf+goIZ
yPJpxFI5UN8gi8IRo3DMe4K2nqcFBC3wFn8tNSAu+44gqDwGJAJL6MsLpkLSZkk8
3QN9O11UCRTJDkURjoEWPgRRuIu9HZ4GKNhiblDy6gNM/jDE/m5OG4OYfiMhojgc
KlMhsPzypSpeApL55lvZ+AzxH8mtwuUGwm8lnIdZ2kIse1iMwapxdWXWq9wQr8eW
jLWHgyZ6rcg=
=4B89
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"This was a fairly quiet cycle for the locking subsystem:
- lockdep: Fix a handful of the more complex lockdep_init_map_*()
primitives that can lose the lock_type & cause false reports. No
such mishap was observed in the wild.
- jump_label improvements: simplify the cross-arch support of initial
NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous"
* tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix lockdep_init_map_*() confusion
jump_label: make initial NOP patching the special case
jump_label: mips: move module NOP patching into arch code
jump_label: s390: avoid pointless initial NOP patching
KVM/s390, KVM/x86 and common infrastructure changes for 5.20
x86:
* Permit guests to ignore single-bit ECC errors
* Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache
* Intel IPI virtualization
* Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS
* PEBS virtualization
* Simplify PMU emulation by just using PERF_TYPE_RAW events
* More accurate event reinjection on SVM (avoid retrying instructions)
* Allow getting/setting the state of the speaker port data bit
* Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent
* "Notify" VM exit (detect microarchitectural hangs) for Intel
* Cleanups for MCE MSR emulation
s390:
* add an interface to provide a hypervisor dump for secure guests
* improve selftests to use TAP interface
* enable interpretive execution of zPCI instructions (for PCI passthrough)
* First part of deferred teardown
* CPU Topology
* PV attestation
* Minor fixes
Generic:
* new selftests API using struct kvm_vcpu instead of a (vm, id) tuple
x86:
* Use try_cmpxchg64 instead of cmpxchg64
* Bugfixes
* Ignore benign host accesses to PMU MSRs when PMU is disabled
* Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior
* x86/MMU: Allow NX huge pages to be disabled on a per-vm basis
* Port eager page splitting to shadow MMU as well
* Enable CMCI capability by default and handle injected UCNA errors
* Expose pid of vcpu threads in debugfs
* x2AVIC support for AMD
* cleanup PIO emulation
* Fixes for LLDT/LTR emulation
* Don't require refcounted "struct page" to create huge SPTEs
x86 cleanups:
* Use separate namespaces for guest PTEs and shadow PTEs bitmasks
* PIO emulation
* Reorganize rmap API, mostly around rmap destruction
* Do not workaround very old KVM bugs for L0 that runs with nesting enabled
* new selftests API for CPUID
With generic entry in place switch the nmi handler to use
the generic entry helper functions.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Temporary unsetting of the prefix page in memcpy_absolute() routine
poses a risk of executing code path with unexpectedly disabled prefix
page. This rework avoids the prefix page uninstalling and disabling
of normal and machine check interrupts when accessing the absolute
zero memory.
Although memcpy_absolute() routine can access the whole memory, it is
only used to update the absolute zero lowcore. This rework therefore
introduces a new mechanism for the absolute zero lowcore access and
scraps memcpy_absolute() routine for good.
Instead, an area is reserved in the virtual memory that is used for
the absolute lowcore access only. That area holds an array of 8KB
virtual mappings - one per CPU. Whenever a CPU is brought online, the
corresponding item is mapped to the real address of the previously
installed prefix page.
The absolute zero lowcore access works like this: a CPU calls the
new primitive get_abs_lowcore() to obtain its 8KB mapping as a
pointer to the struct lowcore. Virtual address references to that
pointer get translated to the real addresses of the prefix page,
which in turn gets swapped with the absolute zero memory addresses
due to prefixing. Once the pointer is not needed it must be released
with put_abs_lowcore() primitive:
struct lowcore *abs_lc;
unsigned long flags;
abs_lc = get_abs_lowcore(&flags);
abs_lc->... = ...;
put_abs_lowcore(abs_lc, flags);
To ensure the described mechanism works large segment- and region-
table entries must be avoided for the 8KB mappings. Failure to do
so results in usage of Region-Frame Absolute Address (RFAA) or
Segment-Frame Absolute Address (SFAA) large page fields. In that
case absolute addresses would be used to address the prefix page
instead of the real ones and the prefixing would get bypassed.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Make the absolute lowcore assignments immediately follow
the boot CPU lowcore same member assignments. This way
readability improves when reading from up to down, with
no out of order mcck stack allocation in-between.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
As result of commit 915fea04f9 ("s390/smp: enable DAT before
CPU restart callback is called") the low-address protection bit
gets mistakenly unset in control register 0 save area of the
absolute zero memory. That area is used when manual PSW restart
happened to hit an offline CPU. In this case the low-address
protection for that CPU will be dropped.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Fixes: 915fea04f9 ("s390/smp: enable DAT before CPU restart callback is called")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Pull changes that finalize switching of copy_oldmem_page() callback
to iov_iter interface. These changes were pulled in work.iov_iter of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Make it possible to handle not only single-, but also multi-
segment iterators in copy_oldmem_iter() callback. Change the
semantics of called functions to match the iterator model -
instead of an error code the exact number of bytes copied is
returned.
The swap page used to copy data to user space is adopted for
kernel space too. That does not bring any performance impact.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Fixes: cc02e6e21a ("s390/crash: add missing iterator advance in copy_oldmem_page()")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Link: https://lore.kernel.org/r/5af6da3a0bffe48a90b0b7139ecf6a818b2d18e8.1658206891.git.agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Currently a temporary page-size buffer is allocated for copying
oldmem to user space. That limits copy_to_user_real() operation
only to stages when virtual memory is available and still makes
it possible to fail while the system is being dumped.
Instead of reallocating single page on each copy_oldmem_page()
iteration use a statically allocated buffer.
This also paves the way for a further memcpy_real() rework where
no swap buffer is needed altogether.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Link: https://lore.kernel.org/r/334ed359680c4d45dd32feb104909f610312ef0f.1658206891.git.agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Function copy_to_user_real() does not really belong to maccess.c.
It is only used for copying oldmem to user space, so let's move
it to the friends.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Link: https://lore.kernel.org/r/e8de968d40202d87caa09aef12e9c67ec23a1c1a.1658206891.git.agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The number of bytes in a chunk is correctly calculated, but instead
the total number of bytes is passed to copy_to_user_real() function.
Reported-by: Matthew Wilcox <willy@infradead.org>
Fixes: df9694c797 ("s390/dump: streamline oldmem copy functions")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Make save_area_alloc() return classic NULL on allocation failure.
The only caller smp_save_dump_cpus() does check the return value
already and panics if NULL is returned.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Make sure the uvdevice driver will be automatically loaded when
facility 158 is available.
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20220713125644.16121-4-seiden@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Allow for facility bits to be used in cpu features.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20220713125644.16121-3-seiden@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Rework cpufeature implementation to allow for various cpu feature
indications, which is not only limited to hwcap bits. This is achieved
by adding a sequential list of cpu feature numbers, where each of them
is mapped to an entry which indicates what this number is about.
Each entry contains a type member, which indicates what feature
name space to look into (e.g. hwcap, or cpu facility). If wanted this
allows also to automatically load modules only in e.g. z/VM
configurations.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220713125644.16121-2-seiden@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
When RDRAND was introduced, there was much discussion on whether it
should be trusted and how the kernel should handle that. Initially, two
mechanisms cropped up, CONFIG_ARCH_RANDOM, a compile time switch, and
"nordrand", a boot-time switch.
Later the thinking evolved. With a properly designed RNG, using RDRAND
values alone won't harm anything, even if the outputs are malicious.
Rather, the issue is whether those values are being *trusted* to be good
or not. And so a new set of options were introduced as the real
ones that people use -- CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu".
With these options, RDRAND is used, but it's not always credited. So in
the worst case, it does nothing, and in the best case, maybe it helps.
Along the way, CONFIG_ARCH_RANDOM's meaning got sort of pulled into the
center and became something certain platforms force-select.
The old options don't really help with much, and it's a bit odd to have
special handling for these instructions when the kernel can deal fine
with the existence or untrusted existence or broken existence or
non-existence of that CPU capability.
Simplify the situation by removing CONFIG_ARCH_RANDOM and using the
ordinary asm-generic fallback pattern instead, keeping the two options
that are actually used. For now it leaves "nordrand" for now, as the
removal of that will take a different route.
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
commit e23a8020ce ("s390/kexec_file: Signature verification prototype")
adds support for KEXEC_SIG verification with keys from platform keyring
but the built-in keys and secondary keyring are not used.
Add support for the built-in keys and secondary keyring as x86 does.
Fixes: e23a8020ce ("s390/kexec_file: Signature verification prototype")
Cc: stable@vger.kernel.org
Cc: Philipp Rudo <prudo@linux.ibm.com>
Cc: kexec@lists.infradead.org
Cc: keyrings@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: "Lee, Chun-Yi" <jlee@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Due to upcoming changes, it will be possible to temporarily have
multiple protected VMs in the same address space, although only one
will be actually active.
In that scenario, it is necessary to perform an export of every page
that is to be imported, since the hardware does not allow a page
belonging to a protected guest to be imported into a different
protected guest.
This also applies to pages that are shared, and thus accessible by the
host.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20220628135619.32410-7-imbrenda@linux.ibm.com
Message-Id: <20220628135619.32410-7-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
A secure storage violation is triggered when a protected guest tries to
access secure memory that has been mapped erroneously, or that belongs
to a different protected guest or to the ultravisor.
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation. This happens for example if a
protected guest is rebooted with deferred destroy enabled and the new
guest is also protected.
When the new protected guest touches pages that have not yet been
destroyed, and thus are accounted to the previous protected guest, a
secure storage violation is raised.
This patch adds handling of secure storage violations for protected
guests.
This exception is handled by first trying to destroy the page, because
it is expected to belong to a defunct protected guest where a destroy
should be possible. Note that a secure page can only be destroyed if
its protected VM does not have any CPUs, which only happens when the
protected VM is being terminated. If that fails, a normal export of
the page is attempted.
This means that pages that trigger the exception will be made
non-secure (in one way or another) before attempting to use them again
for a different secure guest.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20220628135619.32410-3-imbrenda@linux.ibm.com
Message-Id: <20220628135619.32410-3-imbrenda@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
We have information about the supported attestation header version
and plaintext attestation flag bits.
Let's expose it via the sysfs files.
Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/lkml/20220601100245.3189993-1-seiden@linux.ibm.com/
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
s390x appears to present two RNG interfaces:
- a "TRNG" that gathers entropy using some hardware function; and
- a "DRBG" that takes in a seed and expands it.
Previously, the TRNG was wired up to arch_get_random_{long,int}(), but
it was observed that this was being called really frequently, resulting
in high overhead. So it was changed to be wired up to arch_get_random_
seed_{long,int}(), which was a reasonable decision. Later on, the DRBG
was then wired up to arch_get_random_{long,int}(), with a complicated
buffer filling thread, to control overhead and rate.
Fortunately, none of the performance issues matter much now. The RNG
always attempts to use arch_get_random_seed_{long,int}() first, which
means a complicated implementation of arch_get_random_{long,int}() isn't
really valuable or useful to have around. And it's only used when
reseeding, which means it won't hit the high throughput complications
that were faced before.
So this commit returns to an earlier design of just calling the TRNG in
arch_get_random_seed_{long,int}(), and returning false in arch_get_
random_{long,int}().
Part of what makes the simplification possible is that the RNG now seeds
itself using the TRNG at bootup. But this only works if the TRNG is
detected early in boot, before random_init() is called. So this commit
also causes that check to happen in setup_arch().
Cc: stable@vger.kernel.org
Cc: Harald Freudenberger <freude@linux.ibm.com>
Cc: Ingo Franzki <ifranzki@linux.ibm.com>
Cc: Juergen Christ <jchrist@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20220610222023.378448-1-Jason@zx2c4.com
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Instead of defaulting to patching NOP opcodes at init time, and leaving
it to the architectures to override this if this is not needed, switch
to a model where doing nothing is the default. This is the common case
by far, as only MIPS requires NOP patching at init time. On all other
architectures, the correct encodings are emitted by the compiler and so
no initial patching is needed.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220615154142.1574619-4-ardb@kernel.org
MIPS is the only remaining architecture that needs to patch jump label
NOP encodings to initialize them at load time. So let's move the module
patching part of that from generic code into arch/mips, and drop it from
the others.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220615154142.1574619-3-ardb@kernel.org
Patching NOPs into other NOPs at boot time serves no purpose, so let's
use the same NOP encodings at compile time and runtime.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220615154142.1574619-2-ardb@kernel.org
Two different events such as pai_crypto/KM_AES_128/ and
pai_crypto/KM_AES_192/ can be installed multiple times on the same CPU
and the events are executed concurrently:
# perf stat -e pai_crypto/KM_AES_128/ -C0 -a -- sleep 5 &
# sleep 2
# perf stat -e pai_crypto/KM_AES_192/ -C0 -a -- true
This results in the first event being installed two times with two seconds
delay. The kernel does install the second event after the first
event has been deleted and re-added, as can be seen in the traces:
13:48:47.600350 paicrypt_start event 0x1007 (event KM_AES_128)
13:48:49.599359 paicrypt_stop event 0x1007 (event KM_AES_128)
13:48:49.599198 paicrypt_start event 0x1007
13:48:49.599199 paicrypt_start event 0x1008
13:48:49.599921 paicrypt_event_destroy event 0x1008
13:48:52.601507 paicrypt_event_destroy event 0x1007
This is caused by functions event_sched_in() and event_sched_out() which
call the PMU's add() and start() functions on schedule_in and the PMU's
stop() and del() functions on schedule_out. This is correct for events
attached to processes. The pai_crypto events are system-wide events
and not attached to processes.
Since the kernel common code can not be changed easily, fix this issue
and do not reset the event count value to zero each time the event is
added and started. Instead use a flag and zero the event count value
only when called immediately after the event has been initialized.
Therefore only the first invocation of the the event's add() function
initializes the event count value to zero. The following invocations
of the event's add() function leave the current event count value
untouched.
Fixes: 39d62336f5 ("s390/pai: add support for cryptography counters")
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
The pai_crypto PMU has to check the event number. It has to be in
the supported range. This is not the case, the lower limit is not
checked. Fix this and obey the lower limit.
Fixes: 39d62336f5 ("s390/pai: add support for cryptography counters")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Events CPU_CYCLES and INSTRUCTIONS can be submitted with two different
perf_event attribute::type values:
- PERF_TYPE_HARDWARE: when invoked via perf tool predefined events name
cycles or cpu-cycles or instructions.
- pmu->type: when invoked via perf tool event name cpu_cf/CPU_CYLCES/ or
cpu_cf/INSTRUCTIONS/. This invocation also selects the PMU to which
the event belongs.
Handle both type of invocations identical for events CPU_CYLCES and
INSTRUCTIONS. They address the same hardware.
The result is different when event modifier exclude_kernel is also set.
Invocation with event modifier for user space event counting fails.
Output before:
# perf stat -e cpum_cf/cpu_cycles/u -- true
Performance counter stats for 'true':
<not supported> cpum_cf/cpu_cycles/u
0.000761033 seconds time elapsed
0.000076000 seconds user
0.000725000 seconds sys
#
Output after:
# perf stat -e cpum_cf/cpu_cycles/u -- true
Performance counter stats for 'true':
349,613 cpum_cf/cpu_cycles/u
0.000844143 seconds time elapsed
0.000079000 seconds user
0.000800000 seconds sys
#
Fixes: 6a82e23f45 ("s390/cpumf: Adjust registration of s390 PMU device drivers")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
[agordeev@linux.ibm.com corrected commit ID of Fixes commit]
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Callback copy_oldmem_page() returns either error code or zero.
Instead, it should return the error code or number of bytes copied.
Fixes: df9694c797 ("s390/dump: streamline oldmem copy functions")
Reviewed-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
In case old memory was successfully copied the passed iterator
should be advanced as well. Currently copy_oldmem_page() is
always called with single-segment iterator. Should that ever
change - copy_oldmem_user and copy_oldmem_kernel() functions
would need a rework to deal with multi-segment iterators.
Fixes: 5d8de293c2 ("vmcore: convert copy_oldmem_page() to take an iov_iter")
Reviewed-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Tested-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to show tests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+SKTgaM0CPnbq/vKEXu8gLWmHHwFAmKXf2wACgkQEXu8gLWm
HHzu1Q//WjEuOX5nBjklMUlDB2oB2+vFSyW9lE7x9m38EnFTH8QTfH695ChVoNN+
j06Fhd4ENjxqTTYs7z67tP4TSQ/LhB/GsPydKCEOnB/63+k2cnYeS3wsv19213F0
IyvpN6MkzxoktV4m1EtKhlvXGpEBoXZCczgBLj3FYlNQ7kO8RsSkF9rOnhuP9Yjh
l2876bWHWlbU0qWmRSAu0spkwHWjtyh/bnQKzXotQyrQ9bo1yMQvhe2HH8HVTSio
cjRlseWVi01rJKzKcs6D7MFMctLKr5y0onxBgGJnRh27KoBY195ICH2Jz2LfJoor
EP57YcXZqfxzKCGHTGgVYMgFeixX6nzBgqTpDIHMQzvoM1IrQKl+d5riepO03xpS
gZxHtJqZi8s+t8w0ZFBHj83VXkzFyLuCIeui9vo3cQ00K7bBrNUSw1BAdqT5HTzW
K2R4jSQaszjw8mDz3R3G1+yg6PjMS6cDEU1+G2Id7xSYTV3lJnBDVzas7aEUNCC4
LzIrD5c4dscyZzIjAp9huVwpZoCNLy6jtecRTaGhA2YiE0VMWtJlMJHwbShlSnM7
5VhEn859namvoYtN8XBaTFa/jRDOxO+LHWuOy172oaBUgaVHBjZQLyrlit1FRQvT
SVruCmgtJ7u7RD/8uVDfPNR05DTSWQYzklJoKx2avKZj5FIx7ms=
=/6Ue
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: pvdump and selftest improvements
- add an interface to provide a hypervisor dump for secure guests
- improve selftests to show tests
ordinary user mode tasks.
In commit 40966e316f ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
The commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple enough
to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code sitting
in linux-next.
Link: https://lkml.kernel.org/r/87mtfu4up3.fsf@email.froward.int.ebiederm.org
Eric W. Biederman (8):
kthread: Don't allocate kthread_struct for init and umh
fork: Pass struct kernel_clone_args into copy_thread
fork: Explicity test for idle tasks in copy_thread
fork: Generalize PF_IO_WORKER handling
init: Deal with the init process being a user mode process
fork: Explicitly set PF_KTHREAD
fork: Stop allowing kthreads to call execve
sched: Update task_tick_numa to ignore tasks without an mm
arch/alpha/kernel/process.c | 13 ++++++------
arch/arc/kernel/process.c | 13 ++++++------
arch/arm/kernel/process.c | 12 ++++++-----
arch/arm64/kernel/process.c | 12 ++++++-----
arch/csky/kernel/process.c | 15 ++++++-------
arch/h8300/kernel/process.c | 10 ++++-----
arch/hexagon/kernel/process.c | 12 ++++++-----
arch/ia64/kernel/process.c | 15 +++++++------
arch/m68k/kernel/process.c | 12 ++++++-----
arch/microblaze/kernel/process.c | 12 ++++++-----
arch/mips/kernel/process.c | 13 ++++++------
arch/nios2/kernel/process.c | 12 ++++++-----
arch/openrisc/kernel/process.c | 12 ++++++-----
arch/parisc/kernel/process.c | 18 +++++++++-------
arch/powerpc/kernel/process.c | 15 +++++++------
arch/riscv/kernel/process.c | 12 ++++++-----
arch/s390/kernel/process.c | 12 ++++++-----
arch/sh/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_32.c | 12 ++++++-----
arch/sparc/kernel/process_64.c | 12 ++++++-----
arch/um/kernel/process.c | 15 +++++++------
arch/x86/include/asm/fpu/sched.h | 2 +-
arch/x86/include/asm/switch_to.h | 8 +++----
arch/x86/kernel/fpu/core.c | 4 ++--
arch/x86/kernel/process.c | 18 +++++++++-------
arch/xtensa/kernel/process.c | 17 ++++++++-------
fs/exec.c | 8 ++++---
include/linux/sched/task.h | 8 +++++--
init/initramfs.c | 2 ++
init/main.c | 2 +-
kernel/fork.c | 46 +++++++++++++++++++++++++++++++++-------
kernel/sched/fair.c | 2 +-
kernel/umh.c | 6 +++---
33 files changed, 234 insertions(+), 160 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaR/MACgkQC/v6Eiaj
j0Aayg/7Bx66872d9c6igkJ+MPCTuh+v9QKCGwiYEmiU4Q5sVAFB0HPJO27qC14u
630X0RFNZTkPzNNEJNIW4kw6Dj8s8YRKf+FgQAVt4SzdRwT7eIPDjk1nGraopPJ3
O04pjvuTmUyidyViRyFcf2ptx/pnkrwP8jUSc+bGTgfASAKAgAokqKE5ecjewbBc
Y/EAkQ6QW7KxPjeSmpAHwI+t3BpBev9WEC4PbhRhsBCQFO2+PJiklvqdhVNBnIjv
qUezll/1xv9UYgniB15Q4Nb722SmnWSU3r8as1eFPugzTHizKhufrrpyP+KMK1A0
tdtEJNs5t2DZF7ZbGTFSPqJWmyTYLrghZdO+lOmnaSjHxK4Nda1d4NzbefJ0u+FE
tutewowvHtBX6AFIbx+H3O+DOJM2IgNMf+ReQDU/TyNyVf3wBrTbsr9cLxypIJIp
zze8npoLMlB7B4yxVo5ES5e63EXfi3iHl0L3/1EhoGwriRz1kWgVLUX/VZOUpscL
RkJHsW6bT8sqxPWAA5kyWjEN+wNR2PxbXi8OE4arT0uJrEBMUgDCzydzOv5tJB00
mSQdytxH9LVdsmxBKAOBp5X6WOLGA4yb1cZ6E/mEhlqXMpBDF1DaMfwbWqxSYi4q
sp5zU3SBAW0qceiZSsWZXInfbjrcQXNV/DkDRDO9OmzEZP4m1j0=
=x6fy
-----END PGP SIGNATURE-----
Merge tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull kthread updates from Eric Biederman:
"This updates init and user mode helper tasks to be ordinary user mode
tasks.
Commit 40966e316f ("kthread: Ensure struct kthread is present for
all kthreads") caused init and the user mode helper threads that call
kernel_execve to have struct kthread allocated for them. This struct
kthread going away during execve in turned made a use after free of
struct kthread possible.
Here, commit 343f4c49f2 ("kthread: Don't allocate kthread_struct for
init and umh") is enough to fix the use after free and is simple
enough to be backportable.
The rest of the changes pass struct kernel_clone_args to clean things
up and cause the code to make sense.
In making init and the user mode helpers tasks purely user mode tasks
I ran into two complications. The function task_tick_numa was
detecting tasks without an mm by testing for the presence of
PF_KTHREAD. The initramfs code in populate_initrd_image was using
flush_delayed_fput to ensuere the closing of all it's file descriptors
was complete, and flush_delayed_fput does not work in a userspace
thread.
I have looked and looked and more complications and in my code review
I have not found any, and neither has anyone else with the code
sitting in linux-next"
* tag 'kthread-cleanups-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
sched: Update task_tick_numa to ignore tasks without an mm
fork: Stop allowing kthreads to call execve
fork: Explicitly set PF_KTHREAD
init: Deal with the init process being a user mode process
fork: Generalize PF_IO_WORKER handling
fork: Explicity test for idle tasks in copy_thread
fork: Pass struct kernel_clone_args into copy_thread
kthread: Don't allocate kthread_struct for init and umh
- Add Eric Farman as maintainer for s390 virtio drivers.
- Improve machine check handling, and avoid incorrectly injecting a machine
check into a kvm guest.
- Add cond_resched() call to gmap page table walker in order to avoid
possible huge latencies. Also use non-quiesing sske instruction to speed
up storage key handling.
- Add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP so s390 behaves similar like
common code.
- Get sie control block address from correct stack slot in perf event
code. This fixes potential random memory accesses.
- Change uaccess code so that the exception handler sets the result of
get_user() and __get_kernel_nofault() to zero in case of a fault. Until
now this was done via input parameters for inline assemblies. Doing it
via fault handling is what most or even all other architectures are
doing.
- Couple of other small cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmKZ2IMACgkQIg7DeRsp
bsIXgg/+JeKOFIfwWWZuk2/2drITjQxWL6OwAV/vHVL9bYVGGRgFHTfF7Qslr2Hb
j/kh55hajjMt00bVS6P52I7cC+xB6PcxPK4jF+u2FXcL187QnWZg7VD7qkFO3E7Q
G5jiWub0eNfd3ijytzSO1yLwv3Rh6GEIOi7lRkk88Fe2B9l0AaXbvnr7rrIoG1SS
TCPUoCCNKEH+xPmujdN5B6CDK2ldukcPZHtAJ9Qxu6DWAWIxh+hHr/c1zW9/7kDj
Vogc3gcgApeXTsMZu38c2tFiv6wxvg37cMa0EW+l5zkyeFn+a1CLSxA/qPi0N1UY
pFcxrlWXWshr7vKn16lJoCe1nBVyfb9ohToLKQtnWB7RZ86lP79tVjOpiQ4qi+jl
54yx/rdtycEDaC0w4ab5kfZPc9/EeaY4ppaOLjh4+r/Ve3x+EMcAZ9Q2Ei3ltRBy
75kswnRvHDWMbtS8rNecdk29QNRvLpnzpGLWQ4wGCV7V9FzTtJwO5mdjuibhSoI8
9dZw4/HGMeA2P1pAE/d1YqKbdgqlNzDyU1dYpk6PVBI0I9mUu36eZXQnjrBN++ki
8TWQYkBv6vnUHJmkfEK7B/thYwljzlQJSje4ebj0aPWzBJ9An+U5ozoANlAu97MR
/EVZM67snrT6aSOjhF6SNjWqMGBOlMVecF+GCsVrFEeah+qrYq4=
=BOi4
-----END PGP SIGNATURE-----
Merge tag 's390-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
"Just a couple of small improvements, bug fixes and cleanups:
- Add Eric Farman as maintainer for s390 virtio drivers.
- Improve machine check handling, and avoid incorrectly injecting a
machine check into a kvm guest.
- Add cond_resched() call to gmap page table walker in order to avoid
possible huge latencies. Also use non-quiesing sske instruction to
speed up storage key handling.
- Add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP so s390 behaves
similar like common code.
- Get sie control block address from correct stack slot in perf event
code. This fixes potential random memory accesses.
- Change uaccess code so that the exception handler sets the result
of get_user() and __get_kernel_nofault() to zero in case of a
fault. Until now this was done via input parameters for inline
assemblies. Doing it via fault handling is what most or even all
other architectures are doing.
- Couple of other small cleanups and fixes"
* tag 's390-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/stack: add union to reflect kvm stack slot usages
s390/stack: merge empty stack frame slots
s390/uaccess: whitespace cleanup
s390/uaccess: use __noreturn instead of __attribute__((noreturn))
s390/uaccess: use exception handler to zero result on get_user() failure
s390/uaccess: use symbolic names for inline assembler operands
s390/mcck: isolate SIE instruction when setting CIF_MCCK_GUEST flag
s390/mm: use non-quiescing sske for KVM switch to keyed guest
s390/gmap: voluntarily schedule during key setting
MAINTAINERS: Update s390 virtio-ccw
s390/kexec: add __GFP_NORETRY to KEXEC_CONTROL_MEMORY_GFP
s390/Kconfig.debug: fix indentation
s390/Kconfig: fix indentation
s390/perf: obtain sie_block from the right address
s390: generate register offsets into pt_regs automatically
s390: simplify early program check handler
s390/crypto: fix scatterwalk_unmap() callers in AES-GCM
The new dump feature requires us to know how much memory is needed for
the "dump storage state" and "dump finalize" ultravisor call. These
values are reported via the UV query call.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20220517163629.3443-3-frankja@linux.ibm.com
Message-Id: <20220517163629.3443-3-frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
We have information about the supported se header version and pcf bits
so let's expose it via the sysfs files.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Link: https://lore.kernel.org/r/20220517163629.3443-2-frankja@linux.ibm.com
Message-Id: <20220517163629.3443-2-frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Add a union which describes how the empty stack slots are being used
by kvm and perf. This should help to avoid another bug like the one
which was fixed with commit c9bfb460c3 ("s390/perf: obtain sie_block
from the right address").
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Tested-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Merge empty1 and empty2 arrays within the stack frame to one single
array. This is possible since with commit 42b01a553a ("s390: always
use the packed stack layout") the alternative stack frame layout is
gone.
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit d768bd892f ("s390: add options to change branch prediction
behaviour for the kernel") introduced .Lsie_exit label - supposedly
to fence off SIE instruction. However, the corresponding address
range length .Lsie_crit_mcck_length was not updated, which led to
BPON code potentionally marked with CIF_MCCK_GUEST flag.
Both .Lsie_exit and .Lsie_crit_mcck_length were removed with commit
0b0ed657fe ("s390: remove critical section cleanup from entry.S"),
but the issue persisted - currently BPOFF and BPENTER macros might
get wrongly considered by the machine check handler as a guest.
Fixes: d768bd892f ("s390: add options to change branch prediction behaviour for the kernel")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- The majority of the changes are for fixes and clean ups.
Noticeable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to having it
embedded in initram). This is useful for embedded boards without initram
disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use more than
59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset>
instead of using the name of the function before it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYpOgXRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qjkKAQDbpemxvpFyJlZqT8KgEIXubu+ag2/q
p0XDHaPS0zF9OQEAjTxg6GMEbnFYl6fzxZtOoEbiaQ7ppfdhRI8t6sSMVA8=
=+nDD
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"The majority of the changes are for fixes and clean ups.
Notable changes:
- Rework trace event triggers code to be easier to interact with.
- Support for embedding bootconfig with the kernel (as suppose to
having it embedded in initram). This is useful for embedded boards
without initram disks.
- Speed up boot by parallelizing the creation of tracefs files.
- Allow absolute ring buffer timestamps handle timestamps that use
more than 59 bits.
- Added new tracing clock "TAI" (International Atomic Time)
- Have weak functions show up in available_filter_function list as:
__ftrace_invalid_address___<invalid-offset> instead of using the
name of the function before it"
* tag 'trace-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (52 commits)
ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function
tracing: Fix comments for event_trigger_separate_filter()
x86/traceponit: Fix comment about irq vector tracepoints
x86,tracing: Remove unused headers
ftrace: Clean up hash direct_functions on register failures
tracing: Fix comments of create_filter()
tracing: Disable kcov on trace_preemptirq.c
tracing: Initialize integer variable to prevent garbage return value
ftrace: Fix typo in comment
ftrace: Remove return value of ftrace_arch_modify_*()
tracing: Cleanup code by removing init "char *name"
tracing: Change "char *" string form to "char []"
tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQ
tracing/timerlat: Print stacktrace in the IRQ handler if needed
tracing/timerlat: Notify IRQ new max latency only if stop tracing is set
kprobes: Fix build errors with CONFIG_KRETPROBES=n
tracing: Fix return value of trace_pid_write()
tracing: Fix potential double free in create_var_ref()
tracing: Use strim() to remove whitespace instead of doing it manually
ftrace: Deal with error return code of the ftrace_process_locs() function
...
subsystems. Most notably some maintenance work in ocfs2 and initramfs.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo/6xQAKCRDdBJ7gKXxA
jkD9AQCPczLBbRWpe1edL+5VHvel9ePoHQmvbHQnufdTh9rB5QEAu0Uilxz4q9cx
xSZypNhj2n9f8FCYca/ZrZneBsTnAA8=
=AJEO
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"The non-MM patch queue for this merge window.
Not a lot of material this cycle. Many singleton patches against
various subsystems. Most notably some maintenance work in ocfs2
and initramfs"
* tag 'mm-nonmm-stable-2022-05-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (65 commits)
kcov: update pos before writing pc in trace function
ocfs2: dlmfs: fix error handling of user_dlm_destroy_lock
ocfs2: dlmfs: don't clear USER_LOCK_ATTACHED when destroying lock
fs/ntfs: remove redundant variable idx
fat: remove time truncations in vfat_create/vfat_mkdir
fat: report creation time in statx
fat: ignore ctime updates, and keep ctime identical to mtime in memory
fat: split fat_truncate_time() into separate functions
MAINTAINERS: add Muchun as a memcg reviewer
proc/sysctl: make protected_* world readable
ia64: mca: drop redundant spinlock initialization
tty: fix deadlock caused by calling printk() under tty_port->lock
relay: remove redundant assignment to pointer buf
fs/ntfs3: validate BOOT sectors_per_clusters
lib/string_helpers: fix not adding strarray to device's resource list
kernel/crash_core.c: remove redundant check of ck_cmdline
ELF, uapi: fixup ELF_ST_TYPE definition
ipc/mqueue: use get_tree_nodev() in mqueue_get_tree()
ipc: update semtimedop() to use hrtimer
ipc/sem: remove redundant assignments
...
All instances of the function ftrace_arch_modify_prepare() and
ftrace_arch_modify_post_process() return zero. There's no point in
checking their return value. Just have them be void functions.
Link: https://lkml.kernel.org/r/20220518023639.4065-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Since commit 1179f170b6 ("s390: fix fpu restore in entry.S"), the
sie_block pointer is located at empty1[1], but in sie_block() it was
taken from empty1[0].
This leads to a random pointer being dereferenced, possibly causing
system crash.
This problem can be observed when running a simple guest with an endless
loop and recording the cpu-clock event:
sudo perf kvm --guestvmlinux=<guestkernel> --guest top -e cpu-clock
With this fix, the correct guest address is shown.
Fixes: 1179f170b6 ("s390: fix fpu restore in entry.S")
Cc: stable@vger.kernel.org
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use asm offsets method to generate register offsets into pt_regs,
instead of open-coding at several places.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Due to historic reasons the base program check handler calls a
configurable function. Given that there is only the early program
check handler left, simplify the code by directly calling that
function.
The only other user was removed with commit d485235b00 ("s390:
assume diag308 set always works").
Also rename all functions and the asm file to reflect this.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
llvm's integrated assembler cannot handle immediate values which are
calculated with two local labels:
<instantiation>:3:13: error: invalid operand for instruction
clgfi %r14,.Lsie_done - .Lsie_gmap
Workaround this by adding clang specific code which reads the specific
value from memory. Since this code is within the hot paths of the kernel
and adds an additional memory reference, keep the original code, and add
ifdef'ed code.
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20220511120532.2228616-5-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
clang fails to handle ".if" statements in inline assembly which are heavily
used in the alternatives code.
To work around this remove this code, and enforce that users of
alternatives must specify original and alternative instruction sequences
which have identical sizes. Add a compile time check with two ".org"
statements similar to arm64.
In result not only clang can handle this, but also quite a lot of code can
be removed.
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1356
Link: https://lore.kernel.org/r/20220511120532.2228616-3-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Explicitly provide identical sized original/alternative instruction
sequences. This way there is no need for the s390 specific alternatives
infrastructure to generate padding sequences.
The code which generates such sequences will be removed with a follow on
patch.
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220511120532.2228616-2-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Export the extended counter set counters of the IBM z16 via sysfs.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
clock_delta is declared as unsigned long in various places. However,
the clock sync delta can be negative. This would add a huge positive
offset in clock_sync_global where clock_delta is added to clk.eitod
which is a 72 bit integer. Declare it as signed long to fix this.
Cc: stable@vger.kernel.org
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The size of the TOD offset field in the stp info response is 64 bits.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
PMU device driver perf_pai_crypto supports Processor Activity
Instrumentation (PAI), available with IBM z16:
- maps a full page to lowcore address 0x1500.
- uses CR0 bit 13 to turn PAI crypto counting on and off.
- creates a sample with raw data on each context switch out when
at context switch some mapped counters have a value of nonzero.
This device driver only supports CPU wide context, no task context
is allowed.
Support for counting:
- one or more counters can be specified using
perf stat -e pai_crypto/xxx/
where xxx stands for the counter event name. Multiple invocation
of this command is possible. The counter names are listed in
/sys/devices/pai_crypto/events directory.
- one special counters can be specified using
perf stat -e pai_crypto/CRYPTO_ALL/
which returns the sum of all incremented crypto counters.
- one event pai_crypto/CRYPTO_ALL/ is reserved for sampling.
No multiple invocations are possible. The event collects data at
context switch out and saves them in the ring buffer.
Add qpaci assembly instruction to query supported memory mapped crypto
counters. It returns the number of counters (no holes allowed in that
range).
The PAI crypto counter events are system wide and can not be executed
in parallel. Therefore some restrictions documented in function
paicrypt_busy apply.
In particular event CRYPTO_ALL for sampling must run exclusive.
Only counting events can run in parallel.
PAI crypto counter events can not be created when a CPU hot plug
add is processed. This means a CPU hot plug add does not get
the necessary PAI event to record PAI cryptography counter increments
on the newly added CPU. CPU hot plug remove removes the event and
terminates the counting of PAI counters immediately.
Co-developed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Juergen Christ <jchrist@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Link: https://lore.kernel.org/r/20220504062351.2954280-3-tmricht@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add fn and fn_arg members into struct kernel_clone_args and test for
them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER).
This allows any task that wants to be a user space task that only runs
in kernel mode to use this functionality.
The code on x86 is an exception and still retains a PF_KTHREAD test
because x86 unlikely everything else handles kthreads slightly
differently than user space tasks that start with a function.
The functions that created tasks that start with a function
have been updated to set ".fn" and ".fn_arg" instead of
".stack" and ".stack_size". These functions are fork_idle(),
create_io_thread(), kernel_thread(), and user_mode_thread().
Link: https://lkml.kernel.org/r/20220506141512.516114-4-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
With io_uring we have started supporting tasks that are for most
purposes user space tasks that exclusively run code in kernel mode.
The kernel task that exec's init and tasks that exec user mode
helpers are also user mode tasks that just run kernel code
until they call kernel execve.
Pass kernel_clone_args into copy_thread so these oddball
tasks can be supported more cleanly and easily.
v2: Fix spelling of kenrel_clone_args on h8300
Link: https://lkml.kernel.org/r/20220506141512.516114-2-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Remove various declarations from former s390 specific compat system
calls which have been removed with commit fef747bab3 ("s390: use
generic UID16 implementation"). While at it clean up the whole small
header file.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
LLVM's integrated assembler reports the following error when compiling
entry.S:
<instantiation>:38:5: error: unknown token in expression
tm %r8,0x0001 # coming from user space?
The correct instruction would have been tmhh instead of tm.
The current code is doing nothing, since (with gas) it get's
translated to a tm instruction which reads from real address 8, which
again contains always zero, and therefore the conditional code is
never executed.
Note that due to the missing displacement gas translates "%r8" into
"8(%r0)".
Also code inspection reveals that this conditional code is not needed.
Therefore remove it.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The short psw definitions are contained in compat header files, however
short psws are not compat specific. Therefore move the definitions to
ptrace header file. This also gets rid of a compat header include in kvm
code.
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Patch series "Convert vmcore to use an iov_iter", v5.
For some reason several people have been sending bad patches to fix
compiler warnings in vmcore recently. Here's how it should be done.
Compile-tested only on x86. As noted in the first patch, s390 should take
this conversion a bit further, but I'm not inclined to do that work
myself.
This patch (of 3):
Instead of passing in a 'buf' and 'userbuf' argument, pass in an iov_iter.
s390 needs more work to pass the iov_iter down further, or refactor, but
I'd be more comfortable if someone who can test on s390 did that work.
It's more convenient to convert the whole of read_from_oldmem() to take an
iov_iter at the same time, so rename it to read_from_oldmem_iter() and add
a temporary read_from_oldmem() wrapper that creates an iov_iter.
Link: https://lkml.kernel.org/r/20220408090636.560886-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20220408090636.560886-2-bhe@redhat.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As demonstrated by commit 74bdf7815d ("genirq: Speedup
show_interrupts()"), irq_desc can be accessed safely in RCU read section.
Hence here resorting to rcu read lock to get rid of irq_lock_sparse().
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Link: https://lore.kernel.org/r/20220422100212.22666-1-kernelfans@gmail.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Randomize the address of vdso if randomize_va_space is enabled.
Note that this keeps the vdso address on the same PMD as the stack
to avoid allocating an extra page table just for vdso.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In the current code vdso is mapped below the stack. This is
problematic when programs mapped to the top of the address space
are allocating a lot of memory, because the heap will clash with
the vdso. To avoid this map the vdso above the stack and move
STACK_TOP so that it all fits into three level paging.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This is a preparation patch for adding vdso randomization to s390.
It adds a function vdso_size(), which will be used later in calculating
the STACK_TOP value. It also moves the vdso mapping into a new function
vdso_map(), to keep the code similar to other architectures.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
SPX instruction called from set_prefix() expects physical
address of the lowcore to be installed, but instead the
virtual address is passed.
Note: this does not fix a bug currently, since virtual and
physical addresses are identical.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
If the facility IPL-complete-control is present then the last diag308
call made by kexec shall set the end-of-ipl flag in the subcode register
to signal the hypervisor that this is the last diag308 call made by Linux.
Only the diag308 calls made during a regular kexec need to set
the end-of-ipl flag, in all other cases the hypervisor will ignore it.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Just use absolute_pointer() like e.g. in commit 545c272232 ("alpha:
Silence -Warray-bounds warnings") to get rid of this warning:
arch/s390/kernel/machine_kexec.c:59:9: warning: ‘memcpy’ offset [0, 511] is out of the bounds [0, 0] [-Warray-bounds]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Add kretprobes framepointer verification and return address recovery
in stacktrace.
- Support control domain masks on custom zcrypt devices and filter admin
requests.
- Cleanup timer API usage.
- Rework absolute lowcore access helpers.
- Other various small improvements and fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmJG+10ACgkQjYWKoQLX
FBjvOggAhBOcGniD4LDvyPw6Mwp6x3+bnSL0Hv+UdjyTXEsj2/XpSQeAsG807mq0
hKicEY44suVYrk6ywLCuG9tA6oaxr+t6g13Z/cuNqvdljMDFdeZw9ptIeAXfoUQb
19fxDpVyAV9iWc9VfcjwhgOXep9fk6dPDpr5clE81zuAgn94hV31tHpCbMIYDxHa
mceza1rdW5DnByCz2afYcgF72pW2hnLadOPWddbuqyVZV6jjJeZDDShAzf5MCJ6p
e91yJh6RYDT1f1Yn+htz2Bw8URb9FKRnJRMHf35h98kHT3r0x6N6VUjYiQ66CAjE
k02XBdJIwgXwgGtErj6lZsQZFjzT8g==
=5yyr
-----END PGP SIGNATURE-----
Merge tag 's390-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Add kretprobes framepointer verification and return address recovery
in stacktrace.
- Support control domain masks on custom zcrypt devices and filter
admin requests.
- Cleanup timer API usage.
- Rework absolute lowcore access helpers.
- Other various small improvements and fixes.
* tag 's390-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (26 commits)
s390/alternatives: avoid using jgnop mnemonic
s390/pci: rename get_zdev_by_bus() to zdev_from_bus()
s390/pci: improve zpci_dev reference counting
s390/smp: use physical address for SIGP_SET_PREFIX command
s390: cleanup timer API use
s390/zcrypt: fix using the correct variable for sizeof()
s390/vfio-ap: fix kernel doc and signature of group notifier functions
s390/maccess: rework absolute lowcore accessors
s390/smp: cleanup control register update routines
s390/smp: cleanup target CPU callback starting
s390/test_unwind: verify __kretprobe_trampoline is replaced
s390/unwind: avoid duplicated unwinding entries for kretprobes
s390/unwind: recover kretprobe modified return address in stacktrace
s390/kprobes: enable kretprobes framepointer verification
s390/test_unwind: extend kretprobe test
s390/ap: adjust whitespace
s390/ap: use insn format for new instructions
s390/alternatives: use insn format for new instructions
s390/alternatives: use instructions instead of byte patterns
s390/traps: improve panic message for translation-specification exception
...
- Add new environment variables, USERCFLAGS and USERLDFLAGS to allow
additional flags to be passed to user-space programs.
- Fix missing fflush() bugs in Kconfig and fixdep
- Fix a minor bug in the comment format of the .config file
- Make kallsyms ignore llvm's local labels, .L*
- Fix UAPI compile-test for cross-compiling with Clang
- Extend the LLVM= syntax to support LLVM=<suffix> form for using a
particular version of LLVm, and LLVM=<prefix> form for using custom
LLVM in a particular directory path.
- Clean up Makefiles
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmJFGloVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGH0kP/j6Vx5BqEv3tP2Q+UANxLqITleJs
IFpbSesz/BhlG7I/IapWmCDSqFbYd5uJTO4ko8CsPmZHcxr6Gw3y+DN5yQACKaG/
p9xiF6GjPyKR8+VdcT2tV50+dVY8ANe/DxCyzKrJd/uyYxgARPKJh0KRMNz+d9lj
ixUpCXDhx/XlKzPIlcxrvhhjevKz+NnHmN0fe6rzcOw9KzBGBTsf20Q3PqUuBOKa
rWHsRGcBPA8eKLfWT1Us1jjic6cT2g4aMpWjF20YgUWKHgWVKcNHpxYKGXASVo/z
ewdDnNfmwo7f7fKMCDDro9iwFWV/BumGtn43U00tnqdBcTpFojPlEOga37UPbZDF
nmTblGVUhR0vn4PmfBy8WkAkbW+IpVatKwJGV4J3KjSvdWvZOmVj9VUGLVAR0TXW
/YcgRs6EtG8Hn0IlCj0fvZ5wRWoDLbP2DSZ67R/44EP0GaNQPwUe4FI1izEE4EYX
oVUAIxcKixWGj4RmdtmtMMdUcZzTpbgS9uloMUmS3u9LK0Ir/8tcWaf2zfMO6Jl2
p4Q31s1dUUKCnFnj0xDKRyKGUkxYebrHLfuBqi0RIc0xRpSlxoXe3Dynm9aHEQoD
ZSV0eouQJxnaxM1ck5Bu4AHLgEebHfEGjWVyUHno7jFU5EI9Wpbqpe4pCYEEDTm1
+LJMEpdZO0dFvpF+
=84rW
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.18-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add new environment variables, USERCFLAGS and USERLDFLAGS to allow
additional flags to be passed to user-space programs.
- Fix missing fflush() bugs in Kconfig and fixdep
- Fix a minor bug in the comment format of the .config file
- Make kallsyms ignore llvm's local labels, .L*
- Fix UAPI compile-test for cross-compiling with Clang
- Extend the LLVM= syntax to support LLVM=<suffix> form for using a
particular version of LLVm, and LLVM=<prefix> form for using custom
LLVM in a particular directory path.
- Clean up Makefiles
* tag 'kbuild-v5.18-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: Make $(LLVM) more flexible
kbuild: add --target to correctly cross-compile UAPI headers with Clang
fixdep: use fflush() and ferror() to ensure successful write to files
arch: syscalls: simplify uapi/kapi directory creation
usr/include: replace extra-y with always-y
certs: simplify empty certs creation in certs/Makefile
certs: include certs/signing_key.x509 unconditionally
kallsyms: ignore all local labels prefixed by '.L'
kconfig: fix missing '# end of' for empty menu
kconfig: add fflush() before ferror() check
kbuild: replace $(if A,A,B) with $(or A,B)
kbuild: Add environment variables for userprogs flags
kbuild: unify cmd_copy and cmd_shipped
$(shell ...) expands to empty. There is no need to assign it to _dummy.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was
around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled
making the semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where now anything left in tracehook.h is
some weird strange thing that is difficult to understand.
Eric W. Biederman (15):
ptrace: Move ptrace_report_syscall into ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Remove tracehook_signal_handler
task_work: Remove unnecessary include from posix_timers.h
task_work: Introduce task_work_pending
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
resume_user_mode: Move to resume_user_mode.h
tracehook: Remove tracehook.h
ptrace: Move setting/clearing ptrace_message into ptrace_stop
ptrace: Return the signal to continue with from ptrace_stop
Jann Horn (1):
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
Yang Li (1):
ptrace: Remove duplicated include in ptrace.c
MAINTAINERS | 1 -
arch/Kconfig | 5 +-
arch/alpha/kernel/ptrace.c | 5 +-
arch/alpha/kernel/signal.c | 4 +-
arch/arc/kernel/ptrace.c | 5 +-
arch/arc/kernel/signal.c | 4 +-
arch/arm/kernel/ptrace.c | 12 +-
arch/arm/kernel/signal.c | 4 +-
arch/arm64/kernel/ptrace.c | 14 +--
arch/arm64/kernel/signal.c | 4 +-
arch/csky/kernel/ptrace.c | 5 +-
arch/csky/kernel/signal.c | 4 +-
arch/h8300/kernel/ptrace.c | 5 +-
arch/h8300/kernel/signal.c | 4 +-
arch/hexagon/kernel/process.c | 4 +-
arch/hexagon/kernel/signal.c | 1 -
arch/hexagon/kernel/traps.c | 6 +-
arch/ia64/kernel/process.c | 4 +-
arch/ia64/kernel/ptrace.c | 6 +-
arch/ia64/kernel/signal.c | 1 -
arch/m68k/kernel/ptrace.c | 5 +-
arch/m68k/kernel/signal.c | 4 +-
arch/microblaze/kernel/ptrace.c | 5 +-
arch/microblaze/kernel/signal.c | 4 +-
arch/mips/kernel/ptrace.c | 5 +-
arch/mips/kernel/signal.c | 4 +-
arch/nds32/include/asm/syscall.h | 2 +-
arch/nds32/kernel/ptrace.c | 5 +-
arch/nds32/kernel/signal.c | 4 +-
arch/nios2/kernel/ptrace.c | 5 +-
arch/nios2/kernel/signal.c | 4 +-
arch/openrisc/kernel/ptrace.c | 5 +-
arch/openrisc/kernel/signal.c | 4 +-
arch/parisc/kernel/ptrace.c | 7 +-
arch/parisc/kernel/signal.c | 4 +-
arch/powerpc/kernel/ptrace/ptrace.c | 8 +-
arch/powerpc/kernel/signal.c | 4 +-
arch/riscv/kernel/ptrace.c | 5 +-
arch/riscv/kernel/signal.c | 4 +-
arch/s390/include/asm/entry-common.h | 1 -
arch/s390/kernel/ptrace.c | 1 -
arch/s390/kernel/signal.c | 5 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/sh/kernel/signal_32.c | 4 +-
arch/sparc/kernel/ptrace_32.c | 5 +-
arch/sparc/kernel/ptrace_64.c | 5 +-
arch/sparc/kernel/signal32.c | 1 -
arch/sparc/kernel/signal_32.c | 4 +-
arch/sparc/kernel/signal_64.c | 4 +-
arch/um/kernel/process.c | 4 +-
arch/um/kernel/ptrace.c | 5 +-
arch/x86/kernel/ptrace.c | 1 -
arch/x86/kernel/signal.c | 5 +-
arch/x86/mm/tlb.c | 1 +
arch/xtensa/kernel/ptrace.c | 5 +-
arch/xtensa/kernel/signal.c | 4 +-
block/blk-cgroup.c | 2 +-
fs/coredump.c | 1 -
fs/exec.c | 1 -
fs/io-wq.c | 6 +-
fs/io_uring.c | 11 +-
fs/proc/array.c | 1 -
fs/proc/base.c | 1 -
include/asm-generic/syscall.h | 2 +-
include/linux/entry-common.h | 47 +-------
include/linux/entry-kvm.h | 2 +-
include/linux/posix-timers.h | 1 -
include/linux/ptrace.h | 81 ++++++++++++-
include/linux/resume_user_mode.h | 64 ++++++++++
include/linux/sched/signal.h | 17 +++
include/linux/task_work.h | 5 +
include/linux/tracehook.h | 226 -----------------------------------
include/uapi/linux/ptrace.h | 2 +-
kernel/entry/common.c | 19 +--
kernel/entry/kvm.c | 9 +-
kernel/exit.c | 3 +-
kernel/livepatch/transition.c | 1 -
kernel/ptrace.c | 47 +++++---
kernel/seccomp.c | 1 -
kernel/signal.c | 62 +++++-----
kernel/task_work.c | 4 +-
kernel/time/posix-cpu-timers.c | 1 +
mm/memcontrol.c | 2 +-
security/apparmor/domain.c | 1 -
security/selinux/hooks.c | 1 -
85 files changed, 372 insertions(+), 495 deletions(-)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmJCQkoACgkQC/v6Eiaj
j0DCWQ/5AZVFU+hX32obUNCLackHTwgcCtSOs3JNBmNA/zL/htPiYYG0ghkvtlDR
Dw5J5DnxC6P7PVAdAqrpvx2uX2FebHYU0bRlyLx8LYUEP5dhyNicxX9jA882Z+vw
Ud0Ue9EojwGWS76dC9YoKUj3slThMATbhA2r4GVEoof8fSNJaBxQIqath44t0FwU
DinWa+tIOvZANGBZr6CUUINNIgqBIZCH/R4h6ArBhMlJpuQ5Ufk2kAaiWFwZCkX4
0LuuAwbKsCKkF8eap5I2KrIg/7zZVgxAg9O3cHOzzm8OPbKzRnNnQClcDe8perqp
S6e/f3MgpE+eavd1EiLxevZ660cJChnmikXVVh8ZYYoefaMKGqBaBSsB38bNcLjY
3+f2dB+TNBFRnZs1aCujK3tWBT9QyjZDKtCBfzxDNWBpXGLhHH6j6lA5Lj+Cef5K
/HNHFb+FuqedlFZh5m1Y+piFQ70hTgCa2u8b+FSOubI2hW9Zd+WzINV0ANaZ2LvZ
4YGtcyDNk1q1+c87lxP9xMRl/xi6rNg+B9T2MCo4IUnHgpSVP6VEB3osgUmrrrN0
eQlUI154G/AaDlqXLgmn1xhRmlPGfmenkxpok1AuzxvNJsfLKnpEwQSc13g3oiZr
disZQxNY0kBO2Nv3G323Z6PLinhbiIIFez6cJzK5v0YJ2WtO3pY=
=uEro
-----END PGP SIGNATURE-----
Merge tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
Signal processor SIGP_SET_PREFIX command expects physical
address of the lowcore to be installed, but instead the
virtual address is provided.
Note: this does not fix a bug currently, since virtual and
physical addresses are identical.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Macro mem_assign_absolute() is able to access the whole memory, but
is only used and makes sense when updating the absolute lowcore.
Instead, introduce get_abs_lowcore() and put_abs_lowcore() macros
that limit access to absolute lowcore addresses only.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Get rid of duplicate code and redundant data.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Macro mem_assign_absolute() is used to initialize a target
CPU lowcore callback parameters. But despite the macro name
it writes to the absolute lowcore only if the target CPU is
offline. In case the CPU is online the macro does implicitly
write to the normal memory.
That behaviour is correct, but extremely subtle. Sacrifice
few program bits in favour of clarity and distinguish between
online vs offline CPUs and normal vs absolute lowcore pointer.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently when unwinding starts from pt_regs or encounters pt_regs along
the way unwinder tries to yield 2 unwinding entries:
1. (reliable) ip1: pt_regs->psw.addr, sp1: regs->gprs[15]
2. (non-reliable) ip2: sp1->gprs[8] (r14), sp2: regs->gprs[15]
In case of kretprobes those are identical and serves no other purpose
than causing confusion over duplicated entries and cause kprobes tests
to fail. So, skip a duplicate non-reliable entry in this case.
With that kretprobes and unwinder implementation now comply with
ARCH_CORRECT_STACKTRACE_ON_KRETPROBE.
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Based on commit cd9bc2c925 ("arm64: Recover kretprobe modified return
address in stacktrace").
"""
Since the kretprobe replaces the function return address with
the __kretprobe_trampoline on the stack, stack unwinder shows it
instead of the correct return address.
This checks whether the next return address is the
__kretprobe_trampoline(), and if so, try to find the correct
return address from the kretprobe instance list.
"""
Original patch series:
https://lore.kernel.org/all/163163030719.489837.2236069935502195491.stgit@devnote2/
Reviewed-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Use insn format with instruction format specifier instead of plain
longs. This way it is also more obvious that code instead of data is
generated.
The generated code is identical.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There are many different types of translation exceptions but only a
translation-specification exception leads to a kernel panic since it
indicates corrupted page tables, which must never happen.
Improve the panic message so it is a bit more obvious what this is about.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Merge yet more updates from Andrew Morton:
"This is the material which was staged after willystuff in linux-next.
Subsystems affected by this patch series: mm (debug, selftests,
pagecache, thp, rmap, migration, kasan, hugetlb, pagemap, madvise),
and selftests"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (113 commits)
selftests: kselftest framework: provide "finished" helper
mm: madvise: MADV_DONTNEED_LOCKED
mm: fix race between MADV_FREE reclaim and blkdev direct IO read
mm: generalize ARCH_HAS_FILTER_PGPROT
mm: unmap_mapping_range_tree() with i_mmap_rwsem shared
mm: warn on deleting redirtied only if accounted
mm/huge_memory: remove stale locking logic from __split_huge_pmd()
mm/huge_memory: remove stale page_trans_huge_mapcount()
mm/swapfile: remove stale reuse_swap_page()
mm/khugepaged: remove reuse_swap_page() usage
mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page()
mm: streamline COW logic in do_swap_page()
mm: slightly clarify KSM logic in do_swap_page()
mm: optimize do_wp_page() for fresh pages in local LRU pagevecs
mm: optimize do_wp_page() for exclusive pages in the swapcache
mm/huge_memory: make is_transparent_hugepage() static
userfaultfd/selftests: enable hugetlb remap and remove event testing
selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
mm: enable MADV_DONTNEED for hugetlb mappings
kasan: disable LOCKDEP when printing reports
...
- Raise minimum supported machine generation to z10, which comes with
various cleanups and code simplifications (usercopy/spectre
mitigation/etc).
- Rework extables and get rid of anonymous out-of-line fixups.
- Page table helpers cleanup. Add set_pXd()/set_pte() helper
functions. Covert pte_val()/pXd_val() macros to functions.
- Optimize kretprobe handling by avoiding extra kprobe on
__kretprobe_trampoline.
- Add support for CEX8 crypto cards.
- Allow to trigger AP bus rescan via writing to /sys/bus/ap/scans.
- Add CONFIG_EXPOLINE_EXTERN option to build the kernel without COMDAT
group sections which simplifies kpatch support.
- Always use the packed stack layout and extend kernel unwinder tests.
- Add sanity checks for ftrace code patching.
- Add s390dbf debug log for the vfio_ap device driver.
- Various virtual vs physical address confusion fixes.
- Various small fixes and improvements all over the code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmI94dsACgkQjYWKoQLX
FBiaCggAm9xYJ06Qt9c+T9B7aA4Lt50w7Bnxqx1/Q7UHQQgDpkNhKzI1kt/xeKY4
JgZQ9lJC4YRLlyfIVzffLI2DWGbl8BcTpuRWVLhPI5D2yHZBXr2ARe7IGFJueddy
MVqU/r+U3H0r3obQeUc4TSrHtSRX7eQZWIoVuDU75b9fCniee/bmGZqs6yXPXXh4
pTZQ/gsIhF/o6eBJLEXLjUAcIasxCk15GXWXmkaSwKHAhfYiintwGmtKqQ8etCvw
17vdlTjA4ce+3ooD/hXGPa8TqeiGKsIB2Xr89x/48f1eJyp2zPJZ1ZvAUBHJBCNt
b4sF4ql8303Lj7Be+LeqdlbXfa5PZg==
=meZf
-----END PGP SIGNATURE-----
Merge tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Raise minimum supported machine generation to z10, which comes with
various cleanups and code simplifications (usercopy/spectre
mitigation/etc).
- Rework extables and get rid of anonymous out-of-line fixups.
- Page table helpers cleanup. Add set_pXd()/set_pte() helper functions.
Covert pte_val()/pXd_val() macros to functions.
- Optimize kretprobe handling by avoiding extra kprobe on
__kretprobe_trampoline.
- Add support for CEX8 crypto cards.
- Allow to trigger AP bus rescan via writing to /sys/bus/ap/scans.
- Add CONFIG_EXPOLINE_EXTERN option to build the kernel without COMDAT
group sections which simplifies kpatch support.
- Always use the packed stack layout and extend kernel unwinder tests.
- Add sanity checks for ftrace code patching.
- Add s390dbf debug log for the vfio_ap device driver.
- Various virtual vs physical address confusion fixes.
- Various small fixes and improvements all over the code.
* tag 's390-5.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (69 commits)
s390/test_unwind: add kretprobe tests
s390/kprobes: Avoid additional kprobe in kretprobe handling
s390: convert ".insn" encoding to instruction names
s390: assume stckf is always present
s390/nospec: move to single register thunks
s390: raise minimum supported machine generation to z10
s390/uaccess: Add copy_from/to_user_key functions
s390/nospec: align and size extern thunks
s390/nospec: add an option to use thunk-extern
s390/nospec: generate single register thunks if possible
s390/pci: make zpci_set_irq()/zpci_clear_irq() static
s390: remove unused expoline to BC instructions
s390/irq: use assignment instead of cast
s390/traps: get rid of magic cast for per code
s390/traps: get rid of magic cast for program interruption code
s390/signal: fix typo in comments
s390/asm-offsets: remove unused defines
s390/test_unwind: avoid build warning with W=1
s390: remove .fixup section
s390/bpf: encode register within extable entry
...
Rename kasan_free_shadow to kasan_free_module_shadow and
kasan_module_alloc to kasan_alloc_module_shadow.
These functions are used to allocate/free shadow memory for kernel modules
when KASAN_VMALLOC is not enabled. The new names better reflect their
purpose.
Also reword the comment next to their declaration to improve clarity.
Link: https://lkml.kernel.org/r/36db32bde765d5d0b856f77d2d806e838513fe84.1643047180.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... and call node_dev_init() after memory_dev_init() from driver_init(),
so before any of the existing arch/subsys calls. All online nodes should
be known at that point: early during boot, arch code determines node and
zone ranges and sets the relevant nodes online; usually this happens in
setup_arch().
This is in line with memory_dev_init(), which initializes the memory
device subsystem and creates all memory block devices.
Similar to memory_dev_init(), panic() if anything goes wrong, we don't
want to continue with such basic initialization errors.
The important part is that node_dev_init() gets called after
memory_dev_init() and after cpu_dev_init(), but before any of the relevant
archs call register_cpu() to register the new cpu device under the node
device. The latter should be the case for the current users of
topology_init().
Link: https://lkml.kernel.org/r/20220203105212.30385-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Anatoly Pugachev <matorola@gmail.com> (sparc64)
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that all of the definitions have moved out of tracehook.h into
ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in
tracehook.h so remove it.
Update the few files that were depending upon tracehook.h to bring in
definitions to use the headers they need directly.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Always handle TIF_NOTIFY_SIGNAL in get_signal. With commit 35d0b389f3
("task_work: unconditionally run task_work from get_signal()") always
calling task_work_run all of the work of tracehook_notify_signal is
already happening except clearing TIF_NOTIFY_SIGNAL.
Factor clear_notify_signal out of tracehook_notify_signal and use it in
get_signal so that get_signal only needs one call of task_work_run.
To keep the semantics in sync update xfer_to_guest_mode_work (which
does not call get_signal) to call tracehook_notify_signal if either
_TIF_SIGPENDING or _TIF_NOTIFY_SIGNAL.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-8-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
So far, s390 registered a krobe on __kretprobe_trampoline which is
called everytime a kretprobe fires. This kprobe would then determine
the correct return address and adjust the psw accordingly, such that
the kretprobe would branch to the appropriate address after completion.
Some other archs handle kretprobes without such an additional kprobe.
This approach is adopted to s390 with this patch.
Furthermore, the __kretprobe_trampoline now uses an assembler function
to correctly gather the register and psw content to be passed to the
registered kretprobe handler as struct pt_regs. After completion, the
register content and the psw are set based on the contents of said
pt_regs struct.
Note that a change to the psw address in struct pt_regs will not have
an impact, as the probe will still return to the original return
address of the probed function.
The return address is now recovered by using the appropriate function
arch_kretprobe_fixup_return.
The no longer needed kprobe is removed.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With z10 as minimum supported machine generation many ".insn" encodings
could be now converted to instruction names. There are couple of exceptions
- stfle is used from the als code built for z900 and cannot be converted
- few ".insn" directives encode unsupported instruction formats
The generated code is identical before/after this change.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With z10 as minimum supported machine generation the store-clock-fast
facility (25) is always present and checked in als code.
Drop alternatives and always use stckf.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Machine generations up to z9 (released in May 2006) have been officially
out of service for several years now (z9 end of service - January 31, 2019).
No distributions build kernels supporting those old machine generations
anymore, except Debian, which seems to pick the oldest supported
generation. The team supporting Debian on s390 has been notified about
the change.
Raising minimum supported machine generation to z10 helps to reduce
maintenance cost and effectively remove code, which is not getting
enough testing coverage due to lack of older hardware and distributions
support. Besides that this unblocks some optimization opportunities and
allows to use wider instruction set in asm files for future features
implementation. Due to this change spectre mitigation and usercopy
implementations could be drastically simplified and many newer instructions
could be converted from ".insn" encoding to instruction names.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This reverts commit 6deaa3bbca ("s390: extend expoline to BC
instructions"). Expolines to BC instructions were added to be utilized
by commit de5cb6eb51 ("s390: use expoline thunks in the BPF JIT"). But
corresponding code has been removed by commit e1cf4befa2 ("bpf, s390x:
remove ld_abs/ld_ind"). And compiler does not generate such expolines as
well.
Compared to regular expolines, expolines to BC instructions contain
displacement and all possible variations cannot be generated in advance,
making kpatch support more complicated. So, remove those to avoid
future usages.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Change struct ext_code to contain a union which allows to simply
assign the int_code instead of using a cast.
In order to keep the patch small the anonymous union is embedded
within the existing struct instead of changing the struct ext_code to
a union.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add a proper union in lowcore to reflect architecture and get rid of a
"magic" cast in order to read the full per code.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add a proper union in lowcore to reflect architecture and get rid of a
"magic" cast in order to read the full program interruption code.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The only user is gone. Remove the section.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add and use fixup_exception helper function in order to remove the
duplicated exception handler fixup code at several places.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pass pt_regs to early program check handler like it is done for every
other interrupt and exception handler.
Also the passed pt_regs can be changed by the called function and the
changes register contents and psw contents will be taken into account
when returning. In addition the return psw will not be copied to the
program check old psw in lowcore, but to the usual return psw
location, like it is also done by the regular program check handler.
This allows also to get rid of the code that disabled lowcore
protection when changing the return address.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Follow arm64 and riscv and move the EX_TABLE define to asm-extable.h
which is a lot less generic than the current linkage.h.
Also make sure that all files which contain EX_TABLE usages actually
include the new header file. This should make sure that the files
always compile and there won't be any random compile breakage due to
other header file dependencies.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The early program check handler is active before the amode31 extable
is sorted. Therefore in case a program check happens early within the
amode31 code the extable entry might not be found.
Fix this by sorting the amode31 extable early.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Finally use epsw to create a complete psw mask within pt_regs. Without
this only some bits are correct, while other bits are (incorrectly)
always zero.
The epsw instruction is quite heavy weight, however given that this
only effects ftrace_regs_caller this seems to be the right thing, so
we finally get a complete psw mask for ftrace kprobed functions.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove opencoded offsetof and use offsetof instead.
The generated code is identical before/after this change.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With commit 5789284710 ("s390/smp: reallocate IPL CPU lowcore")
virtual addresses are wrongly passed to memblock_free_late() and
SPX instructions on IPL CPU reinitialization.
Note: this does not fix a bug currently, since virtual and
physical addresses are identical.
Fixes: 5789284710 ("s390/smp: reallocate IPL CPU lowcore")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
-mpacked-stack option has been supported by both minimum
gcc and clang versions for a while. With commit e2bc3e91d9
("scripts/min-tool-version.sh: Raise minimum clang version to 13.0.0
for s390") minimum clang version now also supports a combination
of flags -mpacked-stack -mbackchain -pg -mfentry and fulfills
all requirements to always enable the packed stack layout.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This helps to avoid several merge conflicts later.
* fixes:
s390/extable: fix exception table sorting
s390/ftrace: fix arch_ftrace_get_regs implementation
s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE
s390/cio: verify the driver availability for path_event call
s390/module: fix building test_modules_helpers.o with clang
MAINTAINERS: downgrade myself to Reviewer for s390
MAINTAINERS: add Alexander Gordeev as maintainer for s390
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
arch_ftrace_get_regs is supposed to return a struct pt_regs pointer
only if the pt_regs structure contains all register contents, which
means it must have been populated when created via ftrace_regs_caller.
If it was populated via ftrace_caller the contents are not complete
(the psw mask part is missing), and therefore a NULL pointer needs be
returned.
The current code incorrectly always returns a struct pt_regs pointer.
Fix this by adding another pt_regs flag which indicates if the
contents are complete, and fix arch_ftrace_get_regs accordingly.
Fixes: 894979689d ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reported-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
ftrace_caller was used for both ftrace_caller and ftrace_regs_caller,
which means that the target address of the hotpatch trampoline was
never updated.
With commit 894979689d ("s390/ftrace: provide separate
ftrace_caller/ftrace_regs_caller implementations") a separate
ftrace_regs_caller entry point was implemeted, however it was
forgotten to implement the necessary changes for ftrace_modify_call
and ftrace_make_call, where the branch target has to be modified
accordingly.
Therefore add the missing code now.
Fixes: 894979689d ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
We need to preserve the values at OLDMEM_BASE and OLDMEM_SIZE which are
used by zgetdump in case when kdump crashes. In that case zgetdump will
attempt to read OLDMEM_BASE and OLDMEM_SIZE in order to find out where
the memory range [0 - OLDMEM_SIZE] belonging to the production kernel is.
Fixes: f1a5469474 ("s390/setup: don't reserve memory that occupied decompressor's head")
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There is a confusion with regard to the source address of
memcpy_real() and calling functions. While the declared
type for a source assumes a virtual address, in fact it
always called with physical address of the source.
This confusion led to bugs in copy_oldmem_kernel() and
copy_oldmem_user() functions, where __pa() macro applied
mistakenly to physical addresses. It does not lead to a
real issue, since virtual and physical addresses are
currently the same.
Fix both the bugs and memcpy_real() prototype by making
type of source address consistent to the function name
and the way it actually used.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Virtual addresses of vmcore_info and os_info members are
wrongly passed to copy_oldmem_kernel(), while the function
expects physical address of the source. Instead, __pa()
macro should have been applied.
Yet, use of __pa() macro could be somehow confusing, since
copy_oldmem_kernel() may treat the source as an offset, not
as a direct physical address (that depens from the oldmem
availability and location).
Fix the virtual vs physical address confusion and make the
way the old lowcore is read consistent across all sources.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
It is quite pointless to use memcpy to copy two bytes, besides that
this construct will also partially remove type and size sanity checks.
Therefore simply use an assignment.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Due to historical reasons os_info handling functions misuse
the notion of physical vs virtual addresses difference.
Note: this does not fix a bug currently, since virtual
and physical addresses are identical.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
commit 72b3942a17 ("scripts: ftrace - move the
sort-processing in ftrace_init") had the unexpected
side effect that wrong code locations were patched.
To prevent this from happening again, verify the
opcode before patching it.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove my old invalid email address which can be found in a couple of
files. Instead of updating it, just remove my contact data completely
from source files.
We have git and other tools which allow to figure out who is responsible
for what with recent contact data.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
If the size of the PLT entries generated by apply_rela() exceeds
64KiB, the first ones can no longer reach __jump_r1 with brc. Fix by
using brcl. An alternative solution is to add a __jump_r1 copy after
every 64KiB, however, the space savings are quite small and do not
justify the additional complexity.
Fixes: f19fbd5ed6 ("s390: introduce execute-trampolines for branches")
Cc: stable@vger.kernel.org
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The machine check validity bit tells about the context. If a KVM guest
was running the bit tells about the guest validity and the host state is
not affected. As a guest can disable the guest validity this might
result in unwanted host errors on machine checks.
Cc: stable@vger.kernel.org
Fixes: c929500d7a ("s390/nmi: s390: New low level handling for machine check happening in guest")
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
machine check validity bits reflect the state of the machine check. If a
guest does not make use of guarded storage, the validity bit might be
off. We can not use the host CR bit to decide if the validity bit must
be on. So ignore "invalid" guarded storage controls for KVM guests in
the host and rely on the machine check being forwarded to the guest. If
no other errors happen from a host perspective everything is fine and no
process must be killed and the host can continue to run.
Cc: stable@vger.kernel.org
Fixes: c929500d7a ("s390/nmi: s390: New low level handling for machine check happening in guest")
Reported-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Tested-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- add Sven Schnelle as reviewer for s390 code
- make uaccess code more readable
- change cpu measurement facility code to also support counter second
version number 7, and add discard support for limited samples
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmHprPAACgkQIg7DeRsp
bsLqkQ//ZWeZpN8YUS1VRoV0yPl2FX1LC19DXu5kat5UeUgeCSAG4COVv1XejD33
RP6zLFVBTVncdA4qrAsMJZnPpT/RSUS6fk0t0zETj6n0orjJYqekRnGuhQSATlzK
yceIamg9tyqZgTCeBlCLF0ThFB5tsHVDQnrqRLECsY/Q24/2q04/97BFIak7iVAv
D+0xivhf6rLufbw1SfxO7xXvtUBtdZcJUC1y9OhRp5Io1tGNkaKUziYwsnBicePg
6RGFtYv95QqQ1XqC47sFyntp7FK3RFK0DnQx7cWcAknAEOqNN/IUT/GnJlywSNK+
4ZtCG7kIIBmCXZbPiF5uhrf5vrRCv9zCoxHmZvubpeNF06SKLVl5Nx9Wbqe4eC0w
5+CmSX+oO4JNJ4GN6hHURtgf0veYCZPDTtQ4pOuIGYxRtOmeFYlNcrCC3imgbZfx
JRRFgaaX7mbUkq95acgbWowLMbWJR/TWC/caA9hh7awOzSlkhmAmnHg2s5kTnDjE
n6+WTH9a9qn7k6mMFaA7Vfot/GYHValgl5FGQO5LXN+Y2/xMi3zS6hhYGi+JXMyR
NlsQO9CRehUU4ApkyHDqH2q7G04Ko63DJ2DUZAHixrCM+c76EGzEN90bTGtVDwvk
X72WpRpMvoD5Aqu2RVq0GyrlHH7MTFmURyz9Sqy8T5CtwAa/6As=
=IQZX
-----END PGP SIGNATURE-----
Merge tag 's390-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- add Sven Schnelle as reviewer for s390 code
- make uaccess code more readable
- change cpu measurement facility code to also support counter second
version number 7, and add discard support for limited samples
* tag 's390-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: add Sven Schnelle as reviewer
s390/uaccess: introduce bit field for OAC specifier
s390/cpumf: Support for CPU Measurement Sampling Facility LS bit
s390/cpumf: Support for CPU Measurement Facility CSVN 7
Adds support for the CPU Measurement Sampling Facility limit sampling
bit in the sampling device driver.
Limited samples have no valueable information are not collected.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Adds support for the CPU Measurement Counter Facility second version
number 7.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
enabled(without KASAN_VMALLOC) on x86[1].
When the module area allocates memory, it's kmemleak_object is created
successfully, but the KASAN shadow memory of module allocation is not
ready, so when kmemleak scan the module's pointer, it will panic due to
no shadow memory with KASAN check.
module_alloc
__vmalloc_node_range
kmemleak_vmalloc
kmemleak_scan
update_checksum
kasan_module_alloc
kmemleak_ignore
Note, there is no problem if KASAN_VMALLOC enabled, the modules area
entire shadow memory is preallocated. Thus, the bug only exits on ARCH
which supports dynamic allocation of module area per module load, for
now, only x86/arm64/s390 are involved.
Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
kmemleak in module_alloc() to fix this issue.
[1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/
[wangkefeng.wang@huawei.com: fix build]
Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
[akpm@linux-foundation.org: simplify ifdefs, per Andrey]
Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com
Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
Fixes: 793213a82d ("s390/kasan: dynamic shadow mem allocation for modules")
Fixes: 39d114ddc6 ("arm64: add KASAN support")
Fixes: bebf56a1b1 ("kasan: enable instrumentation of global variables")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: Yongqiang Liu <liuyongqiang13@huawei.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- add fast vector/SIMD implementation of the ChaCha20 stream cipher,
which mainly adapts Andy Polyakov's code for the kernel
- add status attribute to AP queue device so users can easily figure
out its status
- fix race in page table release code, and and lots of documentation
- remove uevent suppress from cio device driver, since it turned out
that it generated more problems than it solved problems
- quite a lot of virtual vs physical address confusion fixes
- various other small improvements and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmHbyk8ACgkQIg7DeRsp
bsIizA/5AS6ZSDsoiOyuRtBpEJk/lmgRLYcjgqJHVE7ShQVu+CERG1+L6R5Wgecw
/nKXqsxqt2p8ql/IyaKMzep1I8xQKi1XUW2Nq3ntbJV0NkEfMf/ZNf0mtTfERVP3
OwB0kHMujFrymLhJvlRFdwuPbdGan5ZsUhoBoQuBW4DZ8ly3tpsgMr5ycPMfICiZ
0e2zuC84keEp0xYbkAQ1u48u2r7LTrT/8F77WzGYW06JzjscZMQE62i7NCD+RR4Y
D04IH4EA2fT6CpyIBgZRJia+t5BzEQlASBVjczoT7C16sHY4o239iMhnGemQC2Hz
TwmXQwjop6eIS1XJ2gF6tvnIrbSNF/3fEV9UHasrF3PuWbWsspHJmz9ciDJqiUCs
i+FRBdqhe4L6lR4LjTfi1+VQNEIDEFKJ41jpOKiSVWlBVcpX6XTd5bjuWI3YD4O7
Jz5s0q1go0P0Xg0qY/JdptCAU1VYFYUhrGsvDKtAmLHRgoWjk6D02CF/FFgCyiPK
hshWikxfFrU0K1lfNf6248PnjTjPbguxDDJlCD6xkCmxWPPaYFf9pR2XJXy9pePB
9qriNhcflDqSgRs/c2AykERaQymZfuFypNNNIrDoY0tzgxfIa0af+RZl93XwqdiP
SnVc94381ccHKj0DUq+7Pa0VTx9Q1jBZecVPpE7bXDx5g+IrqiM=
=Iy+v
-----END PGP SIGNATURE-----
Merge tag 's390-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
"Besides all the small improvements and cleanups the most notable part
is the fast vector/SIMD implementation of the ChaCha20 stream cipher,
which is an adaptation of Andy Polyakov's code for the kernel.
Summary:
- add fast vector/SIMD implementation of the ChaCha20 stream cipher,
which mainly adapts Andy Polyakov's code for the kernel
- add status attribute to AP queue device so users can easily figure
out its status
- fix race in page table release code, and and lots of documentation
- remove uevent suppress from cio device driver, since it turned out
that it generated more problems than it solved problems
- quite a lot of virtual vs physical address confusion fixes
- various other small improvements and cleanups all over the place"
* tag 's390-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
s390/dasd: use default_groups in kobj_type
s390/sclp_sd: use default_groups in kobj_type
s390/pci: simplify __pciwb_mio() inline asm
s390: remove unused TASK_SIZE_OF
s390/crash_dump: fix virtual vs physical address handling
s390/crypto: fix compile error for ChaCha20 module
s390/mm: check 2KB-fragment page on release
s390/mm: better annotate 2KB pagetable fragments handling
s390/mm: fix 2KB pgtable release race
s390/sclp: release SCLP early buffer after kernel initialization
s390/nmi: disable interrupts on extended save area update
s390/zcrypt: CCA control CPRB sending
s390/disassembler: update opcode table
s390/uv: fix memblock virtual vs physical address confusion
s390/smp: fix memblock_phys_free() vs memblock_free() confusion
s390/sclp: fix memblock_phys_free() vs memblock_free() confusion
s390/exit: remove dead reference to do_exit from copy_thread
s390/ap: add missing virt_to_phys address conversion
s390/pgalloc: use pointers instead of unsigned long values
s390/pgalloc: add virt/phys address handling to base asce functions
...
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME support
(scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and reciprocal
square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP == 1
(potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmHXNtYACgkQa9axLQDI
XvHBGw/+OVGdbORxwrU+uRb7N6qIJkrW/mmM4x1KLo1i+REZLb8/VlXm0xC60FG+
39x6FSVkRr+lLDfTqpQsOez5FpdsvOe9Fc4L3bwniDg+EPo7x65VmP2dw/Ae2q0i
87xyWCczx5hFEPF/1sb1R1pm3bTXjeklBkdv+OXhwflLOwpCp1J8z8WJK8qJVFX6
CmuE6Q4fDQr0ghl9Nf8DiAr20mHDh8wMKNUJOg4waaQOOCta6q1oJ3qfz6E9z1eW
zEE3dfZgBCx7HCRc3KGgzT7H4Ces3BYvhBYP6bJRliVI88XdPiM4MfdGL4UIb27Q
NLAdr+FVzk/YLzMHtxSfkT10nBqoOPWUTckLu9jIIl5cpBX73Wiz7jfzBvqFmC/y
opSFMZ3lwQPM5WAPtAlZptA3GPPySeInVmvUgB7IQ+1Q1T1n8ri1y5hzTYC4Sc/g
amJI1rXf1Al8+2zFBggr6Up+EOnfV9nAwrzLXkRlASsfmvY4dnVWg3NWfBqtEHAq
VuZCecSgawxuSlpmJ4VGbLrBFaz18bn9EzujR5fFvi5Qcg1CMFOROi2+6IynopNV
IS0R8j6fwgQPA5lcnNIPeJRRkQoqO4l8bPDzeXEny0BSw313EgBSo9aQtnjyIJbp
BTuDHARKs+/NvDPvd8GQkxNPgwJnVOL9pdgNAolEu1/k7JtnIS0=
=ecyi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME
support (scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and
reciprocal square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP ==
1 (potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
arm64: Use correct method to calculate nomap region boundaries
arm64: Drop outdated links in comments
arm64: perf: Don't register user access sysctl handler multiple times
drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check
perf/smmuv3: Fix unused variable warning when CONFIG_OF=n
arm64: errata: Fix exec handling in erratum 1418040 workaround
arm64: Unhash early pointer print plus improve comment
asm-generic: introduce io_stop_wc() and add implementation for ARM64
arm64: Ensure that the 'bti' macro is defined where linkage.h is included
arm64: remove __dma_*_area() aliases
docs/arm64: delete a space from tagged-address-abi
arm64: Enable KCSAN
kselftest/arm64: Add pidbench for floating point syscall cases
arm64/fp: Add comments documenting the usage of state restore functions
kselftest/arm64: Add a test program to exercise the syscall ABI
kselftest/arm64: Allow signal tests to trigger from a function
kselftest/arm64: Parameterise ptrace vector length information
arm64/sve: Minor clarification of ABI documentation
arm64/sve: Generalise vector length configuration prctl() for SME
arm64/sve: Make sysctl interface for SVE reusable by SME
...
Signal processor STORE STATUS requires a physical address where register
contents are supposed to be written to, however the kernel must read the
data via the corresponding virtual address.
Also the allocated save_area, where register contents are copied to,
resides in virtual address space.
Fix this by using proper __pa() conversion, or correct memblock_alloc()
invocation.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Updating of the pointer to machine check extended save area
on the IPL CPU needs the lowcore protection to be disabled.
Disable interrupts while the protection is off to avoid
unnoticed writes to the lowcore.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Sync with binutils: update opcode table to reflect the
instruction format update of the lpswey instruction, and
add the qpaci instruction.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
memblock_alloc_try_nid() returns a virtual address, however in error
case the allocated memory is incorrectly freed with memblock_phys_free().
Properly use memblock_free() instead, and pass a physical address to
uv_init() to fix this.
Note: this doesn't fix a bug currently, since virtual and physical
addresses are identical.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
memblock_phys_free() is used on a virtual address. Fix this by using
memblock_free().
Note: this doesn't fix a bug currently, since virtual and physical
addresses are identical.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
My s390 assembly is not particularly good so I have read the history
of the reference to do_exit copy_thread and have been able to
verify that do_exit is not used.
The general argument is that s390 has been changed to use the generic
kernel_thread and kernel_execve and the generic versions do not call
do_exit. So it is strange to see a do_exit reference sitting there.
The history of the do_exit reference in s390's version of copy_thread
seems conclusive that the do_exit reference is something that lingers
and should have been removed several years ago.
Up through 8d19f15a60be ("[PATCH] s390 update (1/27): arch.") the
s390 code made a call to the exit(2) system call when a kernel thread
finished. Then kernel_thread_starter was added which branched
directly to the value in register 11 when the kernel thread finshed.
The value in register 11 was set in kernel_thread to
"regs.gprs[11] = (unsigned long) do_exit"
In commit 37fe5d41f6 ("s390: fold kernel_thread_helper() into
ret_from_fork()") kernel_thread_starter was moved into entry.S and
entry64.S unchanged (except for the syntax differences between inline
assemly and in the assembly file).
In commit f9a7e025df ("s390: switch to generic kernel_thread()") the
assignment to "gprs[11]" was moved into copy_thread from the old
kernel_thread. The helper kernel_thread_starter was still being used
and was still branching to "%r11" at the end.
In commit 30dcb0996e ("s390: switch to saner kernel_execve()
semantics") kernel_thread_starter was changed to unconditionally
branch to sysc_tracenogo instead to %r11 which held the value of
do_exit. Unfortunately copy_thread was not updated to stop passing
do_exit in "gprs[11]".
In commit 56e62a7370 ("s390: convert to generic entry")
kernel_thread_starter was replaced by __ret_from_fork. And the code
still continued to pass do_exit in "gprs[11]" despite __ret_from_fork
not caring in the slightest.
Remove this dead reference to do_exit to make it clear that s390 is
not doing anything with do_exit in copy_thread.
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 30dcb0996e ("s390: switch to saner kernel_execve() semantics")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://lore.kernel.org/r/20211208202532.16409-1-ebiederm@xmission.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
My s390 assembly is not particularly good so I have read the history
of the reference to do_exit copy_thread and have been able to
verify that do_exit is not used.
The general argument is that s390 has been changed to use the generic
kernel_thread and kernel_execve and the generic versions do not call
do_exit. So it is strange to see a do_exit reference sitting there.
The history of the do_exit reference in s390's version of copy_thread
seems conclusive that the do_exit reference is something that lingers
and should have been removed several years ago.
Up through 8d19f15a60be ("[PATCH] s390 update (1/27): arch.") the
s390 code made a call to the exit(2) system call when a kernel thread
finished. Then kernel_thread_starter was added which branched
directly to the value in register 11 when the kernel thread finshed.
The value in register 11 was set in kernel_thread to
"regs.gprs[11] = (unsigned long) do_exit"
In commit 37fe5d41f6 ("s390: fold kernel_thread_helper() into
ret_from_fork()") kernel_thread_starter was moved into entry.S and
entry64.S unchanged (except for the syntax differences between inline
assemly and in the assembly file).
In commit f9a7e025df ("s390: switch to generic kernel_thread()") the
assignment to "gprs[11]" was moved into copy_thread from the old
kernel_thread. The helper kernel_thread_starter was still being used
and was still branching to "%r11" at the end.
In commit 30dcb0996e ("s390: switch to saner kernel_execve()
semantics") kernel_thread_starter was changed to unconditionally
branch to sysc_tracenogo instead to %r11 which held the value of
do_exit. Unfortunately copy_thread was not updated to stop passing
do_exit in "gprs[11]".
In commit 56e62a7370 ("s390: convert to generic entry")
kernel_thread_starter was replaced by __ret_from_fork. And the code
still continued to pass do_exit in "gprs[11]" despite __ret_from_fork
not caring in the slightest.
Remove this dead reference to do_exit to make it clear that s390 is
not doing anything with do_exit in copy_thread.
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 30dcb0996e ("s390: switch to saner kernel_execve() semantics")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In the current code, when exiting from idle, rcu_irq_enter() is
called twice during irq entry:
irq_entry_enter()-> rcu_irq_enter()
irq_enter() -> rcu_irq_enter()
This may lead to wrong results from rcu_is_cpu_rrupt_from_idle()
because of a wrong dynticks nmi nesting count. Fix this by only
calling irq_enter_rcu().
Cc: <stable@vger.kernel.org> # 5.12+
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 56e62a7370 ("s390: convert to generic entry")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Starting with gcc 11.3, the C compiler will generate PLT-relative function
calls even if they are local and do not require it. Later on during linking,
the linker will replace all PLT-relative calls to local functions with
PC-relative ones. Unfortunately, the purgatory code of kexec/kdump is
not being linked as a regular executable or shared library would have been,
and therefore, all PLT-relative addresses remain in the generated purgatory
object code unresolved. This leads to the situation where the purgatory
code is being executed during kdump with all PLT-relative addresses
unresolved. And this results in endless loops within the purgatory code.
Furthermore, the clang C compiler has always behaved like described above
and this commit should fix kdump for kernels built with the latter.
Because the purgatory code is no regular executable or shared library,
contains only calls to local functions and has no PLT, all R_390_PLT32DBL
relocation entries can be resolved just like a R_390_PC32DBL one.
* https://refspecs.linuxfoundation.org/ELF/zSeries/lzsabi0_zSeries/x1633.html#AEN1699
Relocation entries of purgatory code generated with gcc 11.3
------------------------------------------------------------
$ readelf -r linux/arch/s390/purgatory/purgatory.o
Relocation section '.rela.text' at offset 0x370 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000005c 000c00000013 R_390_PC32DBL 0000000000000000 purgatory_sha_regions + 2
00000000007a 000d00000014 R_390_PLT32DBL 0000000000000000 sha256_update + 2
00000000008c 000e00000014 R_390_PLT32DBL 0000000000000000 sha256_final + 2
000000000092 000800000013 R_390_PC32DBL 0000000000000000 .LC0 + 2
0000000000a0 000f00000014 R_390_PLT32DBL 0000000000000000 memcmp + 2
Relocation entries of purgatory code generated with gcc 11.2
------------------------------------------------------------
$ readelf -r linux/arch/s390/purgatory/purgatory.o
Relocation section '.rela.text' at offset 0x368 contains 5 entries:
Offset Info Type Sym. Value Sym. Name + Addend
00000000005c 000c00000013 R_390_PC32DBL 0000000000000000 purgatory_sha_regions + 2
00000000007a 000d00000013 R_390_PC32DBL 0000000000000000 sha256_update + 2
00000000008c 000e00000013 R_390_PC32DBL 0000000000000000 sha256_final + 2
000000000092 000800000013 R_390_PC32DBL 0000000000000000 .LC0 + 2
0000000000a0 000f00000013 R_390_PC32DBL 0000000000000000 memcmp + 2
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reported-by: Tao Liu <ltao@redhat.com>
Suggested-by: Philipp Rudo <prudo@redhat.com>
Reviewed-by: Philipp Rudo <prudo@redhat.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20211209073817.82196-1-egorenar@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
It looks like commit ce5e48036c ("ftrace: disable preemption
when recursion locked") missed a spot in kprobe_ftrace_handler() in
arch/s390/kernel/ftrace.c.
Remove the superfluous preempt_disable/enable_notrace() there too.
Fixes: ce5e48036c ("ftrace: disable preemption when recursion locked")
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Link: https://lore.kernel.org/r/20211208151503.1510381-1-jmarchan@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
arch_kexec_apply_relocations_add currently ignores all errors returned
by arch_kexec_do_relocs. This means that every unknown relocation is
silently skipped causing unpredictable behavior while the relocated code
runs. Fix this by checking for errors and fail kexec_file_load if an
unknown relocation type is encountered.
The problem was found after gcc changed its behavior and used
R_390_PLT32DBL relocations for brasl instruction and relied on ld to
resolve the relocations in the final link in case direct calls are
possible. As the purgatory code is only linked partially (option -r)
ld didn't resolve the relocations leaving them for arch_kexec_do_relocs.
But arch_kexec_do_relocs doesn't know how to handle R_390_PLT32DBL
relocations so they were silently skipped. This ultimately caused an
endless loop in the purgatory as the brasl instructions kept branching
to itself.
Fixes: 71406883fd ("s390/kexec_file: Add kexec_file_load system call")
Reported-by: Tao Liu <ltao@redhat.com>
Signed-off-by: Philipp Rudo <prudo@redhat.com>
Link: https://lore.kernel.org/r/20211208130741.5821-3-prudo@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Make arch_stack_walk() available for ARCH_STACKWALK architectures
without it being entangled in STACKTRACE.
Link: https://lore.kernel.org/lkml/20211022152104.356586621@infradead.org/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Mark: rebase, drop unnecessary arm change]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Add missing __pa/__va address conversion of machine check extended
save area designation, which is an absolute address.
Note: this currently doesn't fix a real bug, since virtual addresses
are indentical to physical ones.
Reported-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Tested-by: Vineeth Vijayan <vneethv@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Add missing Kconfig option for ftrace direct multi sample, so it can
be compiled again, and also add s390 support for this sample.
- Update Christian Borntraeger's email address.
- Various fixes for memory layout setup. Besides other this makes it
possible to load shared DCSS segments again.
- Fix copy to user space of swapped kdump oldmem.
- Remove -mstack-guard and -mstack-size compile options when building
vdso binaries. This can happen when CONFIG_VMAP_STACK is disabled
and results in broken vdso code which causes more or less random
exceptions. Also remove the not needed -nostdlib option.
- Fix memory leak on cpu hotplug and return code handling in kexec
code.
- Wire up futex_waitv system call.
- Replace snprintf with sysfs_emit where appropriate.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmGZBmIACgkQIg7DeRsp
bsLevQ//XfCEcvJ1sB4OEiN97xyy5me4FoOo5rWuzG/ZN/YmUH0CkzJHIhjDcCg3
2FslxH5doOA3zLEBCQKXtcW4uaLSgJcqDgFgpE0TZk/6VKB9RD5q2eSjd+akFMGh
HFge54pfgpR7pYYwWRvbqOJRyzkU5oHAjMmt2UweOoX3qwynhMhTrT/03Y9pGMgK
VBHhp+ocfdLGQk3nbehAWsh7AWItWwOtKblsTFoyJ6BW0pxb7Yc6+wrpyxLYCaRK
rCbyXDStvDqjeBSdx2GZDrA7HbVsrZTHA7sSStIW8yIss1/YJXTP0J2PMXmYNbeE
ou2WCg/iti1DNwN7AOR0OdPu1NfPQkyW6NmV8814Haa8Ub3GUc6RCo+U4wlCXAbo
ZcHWlb8sgWgfQMzho3WfgkeXuEohO+nOV/x/JFt+NFcwidNTQKO7FQ8GsyylUcYo
fBhElbn7p44eS1ivMFEwzptBbpH1JVbb30iV7tMWxyjJQ9TkzpsC3Ph14JimSChk
oZuUnmgMztss/ikEMFcDLhd3DNedXfz10Boq6FucD8x46cW5j7o0scwIomcNtxmx
C3Y9JCsDdiXAfS6Et6KGbsuWbigT3NjNKETK0+Be65GYNP/NPD5pXLeKywU++cHe
e+Lucqiej9polcGN3X97lORMDEx0dXpGkM6ZK2rtX66e7rBbB7M=
=n7BA
-----END PGP SIGNATURE-----
Merge tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Add missing Kconfig option for ftrace direct multi sample, so it can
be compiled again, and also add s390 support for this sample.
- Update Christian Borntraeger's email address.
- Various fixes for memory layout setup. Besides other this makes it
possible to load shared DCSS segments again.
- Fix copy to user space of swapped kdump oldmem.
- Remove -mstack-guard and -mstack-size compile options when building
vdso binaries. This can happen when CONFIG_VMAP_STACK is disabled and
results in broken vdso code which causes more or less random
exceptions. Also remove the not needed -nostdlib option.
- Fix memory leak on cpu hotplug and return code handling in kexec
code.
- Wire up futex_waitv system call.
- Replace snprintf with sysfs_emit where appropriate.
* tag 's390-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
ftrace/samples: add s390 support for ftrace direct multi sample
ftrace/samples: add missing Kconfig option for ftrace direct multi sample
MAINTAINERS: update email address of Christian Borntraeger
s390/kexec: fix memory leak of ipl report buffer
s390/kexec: fix return code handling
s390/dump: fix copying to user-space of swapped kdump oldmem
s390: wire up sys_futex_waitv system call
s390/vdso: filter out -mstack-guard and -mstack-size
s390/vdso: remove -nostdlib compiler flag
s390: replace snprintf in show functions with sysfs_emit
s390/boot: simplify and fix kernel memory layout setup
s390/setup: re-arrange memblock setup
s390/setup: avoid using memblock_enforce_memory_limit
s390/setup: avoid reserving memory above identity mapping
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
kexec_file_add_ipl_report ignores that ipl_report_finish may fail and
can return an error pointer instead of a valid pointer.
Fix this and simplify by returning NULL in case of an error and let
the only caller handle this case.
Fixes: 99feaa717e ("s390/kexec_file: Create ipl report and pass to next kernel")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This commit fixes a bug introduced by commit e9e7870f90 ("s390/dump:
introduce boot data 'oldmem_data'").
OLDMEM_BASE was mistakenly replaced by oldmem_data.size instead of
oldmem_data.start.
This bug caused the following error during kdump:
kdump.sh[878]: No program header covering vaddr 0x3434f5245found kexec bug?
Fixes: e9e7870f90 ("s390/dump: introduce boot data 'oldmem_data'")
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When CONFIG_VMAP_STACK is disabled, the user can enable CONFIG_STACK_CHECK,
which adds a stack overflow check to each C function in the kernel. This is
also done for functions in the vdso page. These functions are run in user
context and user stack sizes are usually different to what the kernel uses.
This might trigger the stack check although the stack size is valid.
Therefore filter the -mstack-guard and -mstack-size flags when compiling
vdso C files.
Cc: stable@kernel.org # 5.10+
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Reported-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The -nostdlib option requests the compiler to not use the standard
system startup files or libraries when linking. It is effective only
when $(CC) is used as a linker driver.
Since commit 2b2a25845d ("s390/vdso: Use $(LD) instead of $(CC) to
link vDSO"), $(LD) is directly used, hence -nostdlib is unneeded.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20211107162111.323701-1-masahiroy@kernel.org
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Avoid using ULONG_MAX in memblock_remove, it has no functional change
but makes memblock_dbg output a range which makes sense.
- Actually finish memblock memory setup before doing amode31/cr/uv
setup.
- Move memblock_dump_all() debug output after memblock memory setup is
complete. This gives us final "memory" regions if they were trimmed
due to addressing limits and still "physmem" regions as original info
which came from mem_detect.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is a difference in how architectures treat "mem=" option. For some
that is an amount of online memory, for s390 and x86 this is the limiting
max address. Some memblock api like memblock_enforce_memory_limit()
take limit argument and explicitly treat it as the size of online memory,
and use __find_max_addr to convert it to an actual max address. Current
s390 usage:
memblock_enforce_memory_limit(memblock_end_of_DRAM());
yields different results depending on presence of memory holes (offline
memory blocks in between online memory). If there are no memory holes
limit == max_addr in memblock_enforce_memory_limit() and it does trim
online memory and reserved memory regions. With memory holes present it
actually does nothing.
Since we already use memblock_remove() explicitly to trim online memory
regions to potential limit (think mem=, kdump, addressing limits, etc.)
drop the usage of memblock_enforce_memory_limit() altogether. Trimming
reserved regions should not be required, since we now use
memblock_set_current_limit() to limit allocations and any explicit memory
reservations above the limit is an actual problem we should not hide.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Such reserved memory region, if not cleaned up later causes problems when
memblock_free_all() is called to release free pages to the buddy allocator
and those reserved regions are carried over to reserve_bootmem_region()
which marks the pages as PageReserved.
Instead use memblock_set_current_limit() to make sure memblock allocations
do not go over identity mapping (which could happen when "mem=" option
is used or during kdump).
Cc: stable@vger.kernel.org
Fixes: 73045a08cf ("s390: unify identity mapping limits handling")
Reported-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Add PCI automatic error recovery.
- Fix tape driver timer initialization broken during timers api cleanup.
- Fix bogus CPU measurement counters values on CPUs offlining.
- Check the validity of subchanel before reading other fields in
the schib in cio code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmGPwDMACgkQjYWKoQLX
FBgCoQf/VBel0vDex9NNVo59OmGTNh9NPPT2cUU8vYEmwHfaBeInZVEx5WOxXijl
8MIbEgi6Wt3EwnIghjouC50nk8jCiNOJ8Z/wG+01zZpVpLk2GvKGjxoYxKg+5E6T
sOSr7TMeKOtOp23xKAXGVIzzkrDTSyr3qruTKg/m6TFhQ0XSm/ld2k6tR5AARbuB
UtsxBOtPWyHm1xPyuhjr+c6riK2vGQwJwYya4vGtIW8ix9uZoPabdqqzWsD3meBc
B6fe96YQGxA8Tt80FtyJ6pHEhNDr8CE656aJZNJCnd7q1RmLWC1R/aUme+9wqQtO
i9YmMvc+uzgQonpo+YgWqu9fIQqcXA==
=OjW1
-----END PGP SIGNATURE-----
Merge tag 's390-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Add PCI automatic error recovery.
- Fix tape driver timer initialization broken during timers api
cleanup.
- Fix bogus CPU measurement counters values on CPUs offlining.
- Check the validity of subchanel before reading other fields in the
schib in cio code.
* tag 's390-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cio: check the subchannel validity for dev_busid
s390/cpumf: cpum_cf PMU displays invalid value after hotplug remove
s390/tape: fix timer initialization in tape_std_assign()
s390/pci: implement minimal PCI error recovery
PCI: Export pci_dev_lock()
s390/pci: implement reset_slot for hotplug slot
s390/pci: refresh function handle in iomap
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
When a CPU is hotplugged while the perf stat -e cycles command is
running, a wrong (very large) value is displayed immediately after the
CPU removal:
Check the values, shouldn't be too high as in
time counts unit events
1.001101919 29261846 cycles
2.002454499 17523405 cycles
3.003659292 24361161 cycles
4.004816983 18446744073638406144 cycles
5.005671647 <not counted> cycles
...
The CPU hotplug off took place after 3 seconds.
The issue is the read of the event count value after 4 seconds when
the CPU is not available and the read of the counter returns an
error. This is treated as a counter value of zero. This results
in a very large value (0 - previous_value).
Fix this by detecting the hotplugged off CPU and report 0 instead
of a very large number.
Cc: stable@vger.kernel.org
Fixes: a029a4eab3 ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility")
Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Add support for ftrace with direct call and ftrace direct call samples.
- Add support for kernel command lines longer than current 896 bytes and
make its length configurable.
- Add support for BEAR enhancement facility to improve last breaking
event instruction tracking.
- Add kprobes sanity checks and testcases to prevent kprobe in the mid
of an instruction.
- Allow concurrent access to /dev/hwc for the CPUMF users.
- Various ftrace / jump label improvements.
- Convert unwinder tests to KUnit.
- Add s390_iommu_aperture kernel parameter to tweak the limits on
concurrently usable DMA mappings.
- Add ap.useirq AP module option which can be used to disable interrupt
use.
- Add add_disk() error handling support to block device drivers.
- Drop arch specific and use generic implementation of strlcpy and strrchr.
- Several __pa/__va usages fixes.
- Various cio, crypto, pci, kernel doc and other small fixes and
improvements all over the code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmGFW6EACgkQjYWKoQLX
FBg20Qf/UbohgnKnE6vxbbH3sNTlI2dk3Cw4z3IobcsZgqXAu6AFLgLQGLk/X07F
DIyUdrgSgCzLIEKLqrLrFXIOMIK44zAGaurIltNt7IrnWWlA+/YVD+YeL2gHwccq
wT7KXRcrVMZQ1z18djJQ45DpPUC8ErBdL6+P+ftHck90YGFZsfMA5S7jf8X1h08U
IlqdPTmY8t4unKHWVpHbxx9b+xrUuV6KTEXADsllpMV2jQoTLdDECd3vmefYR6tR
3lssgop1m/RzH5OCqvia5Sy2D5fOQObNWDMakwOkVMxOD43lmGCTHstzS2Uo2OFE
QcY79lfZ5NrzKnenUdE5Fd0XJ9kSwQ==
=k0Ab
-----END PGP SIGNATURE-----
Merge tag 's390-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Add support for ftrace with direct call and ftrace direct call
samples.
- Add support for kernel command lines longer than current 896 bytes
and make its length configurable.
- Add support for BEAR enhancement facility to improve last breaking
event instruction tracking.
- Add kprobes sanity checks and testcases to prevent kprobe in the mid
of an instruction.
- Allow concurrent access to /dev/hwc for the CPUMF users.
- Various ftrace / jump label improvements.
- Convert unwinder tests to KUnit.
- Add s390_iommu_aperture kernel parameter to tweak the limits on
concurrently usable DMA mappings.
- Add ap.useirq AP module option which can be used to disable interrupt
use.
- Add add_disk() error handling support to block device drivers.
- Drop arch specific and use generic implementation of strlcpy and
strrchr.
- Several __pa/__va usages fixes.
- Various cio, crypto, pci, kernel doc and other small fixes and
improvements all over the code.
[ Merge fixup as per https://lore.kernel.org/all/YXAqZ%2FEszRisunQw@osiris/ ]
* tag 's390-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (63 commits)
s390: make command line configurable
s390: support command lines longer than 896 bytes
s390/kexec_file: move kernel image size check
s390/pci: add s390_iommu_aperture kernel parameter
s390/spinlock: remove incorrect kernel doc indicator
s390/string: use generic strlcpy
s390/string: use generic strrchr
s390/ap: function rework based on compiler warning
s390/cio: make ccw_device_dma_* more robust
s390/vfio-ap: s390/crypto: fix all kernel-doc warnings
s390/hmcdrv: fix kernel doc comments
s390/ap: new module option ap.useirq
s390/cpumf: Allow multiple processes to access /dev/hwc
s390/bitops: return true/false (not 1/0) from bool functions
s390: add support for BEAR enhancement facility
s390: introduce nospec_uses_trampoline()
s390: rename last_break to pgm_last_break
s390/ptrace: add last_break member to pt_regs
s390/sclp: sort out physical vs virtual pointers usage
s390/setup: convert start and end initrd pointers to virtual
...
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
We want to specify flags when hotplugging memory. Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.
Note that when hotplugging memory the system is already up and running
and we might have concurrent memblock users: for example, while we're
hotplugging memory, kexec_file code might search for suitable memory
regions to place kexec images. It's important to add the memory
directly to memblock via a single call with the right flags, instead of
adding the memory first and apply flags later: otherwise, concurrent
memblock users might temporarily stumble over memblocks with wrong
flags, which will be important in a follow-up patch that introduces a
new flag to properly handle add_memory_driver_managed().
Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.com
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Shahab Vahedi <shahab@synopsys.com> [arch/arc]
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since memblock_free() operates on a physical range, make its name
reflect it and rename it to memblock_phys_free(), so it will be a
logical counterpart to memblock_phys_alloc().
The callers are updated with the below semantic patch:
@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);
Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memblock_free_early_nid() is unused and memblock_free_early() is an
alias for memblock_free().
Replace calls to memblock_free_early() with calls to memblock_free() and
remove memblock_free_early() and memblock_free_early_nid().
Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* More progress on the protected VM front, now with the full
fixed feature set as well as the limitation of some hypercalls
after initialisation.
* Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
* Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
* More memcg accounting of the memory allocated on behalf of a guest
* Timer and vgic selftests
* Workarounds for the Apple M1 broken vgic implementation
* KConfig cleanups
* New kvmarm.mode=none option, for those who really dislike us
RISC-V:
* New KVM port.
x86:
* New API to control TSC offset from userspace
* TSC scaling for nested hypervisors on SVM
* Switch masterclock protection from raw_spin_lock to seqcount
* Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
* Convey the exit reason to userspace on emulation failure
* Configure time between NX page recovery iterations
* Expose Predictive Store Forwarding Disable CPUID leaf
* Allocate page tracking data structures lazily (if the i915
KVM-GT functionality is not compiled in)
* Cleanups, fixes and optimizations for the shadow MMU code
s390:
* SIGP Fixes
* initial preparations for lazy destroy of secure VMs
* storage key improvements/fixes
* Log the guest CPNC
Starting from this release, KVM-PPC patches will come from
Michael Ellerman's PPC tree.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGBOiEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNowwf/axlx3g9sgCwQHr12/6UF/7hL/RwP
9z+pGiUzjl2YQE+RjSvLqyd6zXh+h4dOdOKbZDLSkSTbcral/8U70ojKnQsXM0XM
1LoymxBTJqkgQBLm9LjYreEbzrPV4irk4ygEmuk3CPOHZu8xX1ei6c5LdandtM/n
XVUkXsQY+STkmnGv4P3GcPoDththCr0tBTWrFWtxa0w9hYOxx0ay1AZFlgM4FFX0
QFuRc8VBLoDJpIUjbkhsIRIbrlHc/YDGjuYnAU7lV/CIME8vf2BW6uBwIZJdYcDj
0ejozLjodEnuKXQGnc8sXFioLX2gbMyQJEvwCgRvUu/EU7ncFm1lfs7THQ==
=UxKM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- More progress on the protected VM front, now with the full fixed
feature set as well as the limitation of some hypercalls after
initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
RISC-V:
- New KVM port.
x86:
- New API to control TSC offset from userspace
- TSC scaling for nested hypervisors on SVM
- Switch masterclock protection from raw_spin_lock to seqcount
- Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
- Convey the exit reason to userspace on emulation failure
- Configure time between NX page recovery iterations
- Expose Predictive Store Forwarding Disable CPUID leaf
- Allocate page tracking data structures lazily (if the i915 KVM-GT
functionality is not compiled in)
- Cleanups, fixes and optimizations for the shadow MMU code
s390:
- SIGP Fixes
- initial preparations for lazy destroy of secure VMs
- storage key improvements/fixes
- Log the guest CPNC
Starting from this release, KVM-PPC patches will come from Michael
Ellerman's PPC tree"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
RISC-V: KVM: fix boolreturn.cocci warnings
RISC-V: KVM: remove unneeded semicolon
RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions
RISC-V: KVM: Factor-out FP virtualization into separate sources
KVM: s390: add debug statement for diag 318 CPNC data
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: Fix handle_sske page fault handling
KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
KVM: x86: Clarify the kvm_run.emulation_failure structure layout
KVM: s390: Add a routine for setting userspace CPU state
KVM: s390: Simplify SIGP Set Arch handling
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: add macros for UVC CC values
s390/mm: optimize reset_guest_reference_bit()
s390/mm: optimize set_guest_storage_key()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANdUUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOmihAAgKSTv4Jf0s4yopdcxfuLweiyqHX1
719QJzdLZohmllrJPq/83FZL9qodCzxy87nAm67Ht0baSKiEjtVgRaVCqJWEE+l6
oQL+wUsGLP7CmExOP503Uh6tW35AhETQA4Uwu6QtiUYLYG17kAgeR3cTFuekUsJS
iL4K65PXE2bBxMe7Ta1YIZqcxptbknMgpqYkdne7xs7RS+UiVj8TyRle6ACrfzEX
IVy4LTk+spHCy1a494g9pt/21xOnbiLHr/FpckALscnvJiUThxbfQHGSQeMpM4uM
BnwCqFrj860vMeh52M11/GAAXmdPh6AjoLhaSIW2I3M2GbV8ZP2hu1HYUz3osmrT
f+aeMPJ4feX1xVj6qAC+1G83XRO83tP/YIEuocGiwyepImB25NHPin21xepf6Ru0
wJX+aXC9O1eG6E2ghT6tBim/MpeNH5OT0hNO3uhGmEQ6xZpArRVVaBwlEdufJiCx
ZljqEFUT7wA9nGEQif6GdLnGezGr/aNL65caTkIAzHKamd79QIr7VZXYjYIfHSqE
p74Aro6E8qoQJjsTSkvZceM0u1LRzwS4wPRroE6eGz98oYDpiDm1RPb+9Gw5jyJf
JN7UjJKO9+iPGAi3KivGBqpBskw4cCp2y/nHrMYmpGUPELcr5kQtDfQ6yp59tVZ8
Dwo5GeSlG6khmiI=
=WrEw
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Add some additional audit logging to capture the openat2() syscall
open_how struct info.
Previous variations of the open()/openat() syscalls allowed audit
admins to inspect the syscall args to get the information contained in
the new open_how struct used in openat2()"
* tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: return early if the filter rule has a lower priority
audit: add OPENAT2 record to list "how" info
audit: add support for the openat2 syscall
audit: replace magic audit syscall class numbers with macros
lsm_audit: avoid overloading the "key" audit field
audit: Convert to SPDX identifier
audit: rename struct node to struct audit_node to prevent future name collisions
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Now that force_fatal_sig exists it is unnecessary and a bit confusing
to use force_sigsegv in cases where the simpler force_fatal_sig is
wanted. So change every instance we can to make the code clearer.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reading the history it is unclear why default_trap_handler calls
do_exit. It is not even menthioned in the commit where the change
happened. My best guess is that because it is unknown why the
exception happened it was desired to guarantee the process never
returned to userspace.
Using do_exit(SIGSEGV) has the problem that it will only terminate one
thread of a process, leaving the process in an undefined state.
Use force_sigsegv(SIGSEGV) instead which effectively has the same
behavior except that is uses the ordinary signal mechanism and
terminates all threads of a process and is generally well defined.
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Fixes: ca2ab03237ec ("[PATCH] s390: core changes")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lkml.kernel.org/r/20211020174406.17889-11-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.
The PG_arch_1 flag is always allowed to overindicate; using the new
functions introduced here allows to reduce the extent of overindication
and thus improve performance.
These new functions can only be called on pages for which a reference
is already being held.
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210920132502.36111-7-imbrenda@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Currently s390 supports a fixed maximum command line length of 896
bytes. This isn't enough as some installers are trying to pass all
configuration data via kernel command line, and even with zfcp alone
it is easy to generate really long command lines. Therefore extend
the command line to 4 kbytes.
In the parm area where the command line is stored there is no indication
of the maximum allowed length, so a new field which contains the maximum
length is added.
The parm area has always been initialized to zero, so with old kernels
this field would read zero. This is important because tools like zipl
could read this field. If it contains a number larger than zero zipl
knows the maximum length that can be stored in the parm area, otherwise
it must assume that it is booting a legacy kernel and only 896 bytes are
available.
The removing of trailing whitespace in head.S is also removed because
code to do this is already present in setup_boot_command_line().
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In preparation of adding support for command lines with variable
sizes on s390, the check whether the new kernel image is at least HEAD_END
bytes long isn't correct. Move the check to kexec_file_add_components()
so we can get the size of the parm area and check the size there.
The '.org HEAD_END' directive can now also be removed from head.S. This
was used in the past to reserve space for the early sccb buffer, but with
commit 9a5131b87cac1 ("s390/boot: move sclp early buffer from fixed address
in asm to C") this is no longer required.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Commit a029a4eab3 ("s390/cpumf: Allow concurrent access for CPU Measurement Counter Facility")
added CPU Measurement counter facility access to multiple consumers.
It allows concurrent access to the CPU Measurement counter facility
via several perf_event_open() system call invocations and via ioctl()
system call of device /dev/hwc. However the access via device /dev/hwc
was exclusive, only one process was able to open this device.
The patch removes this restriction. Now multiple invocations of lshwc
can execute in parallel. They can access different CPUs and counter
sets or CPUs and counter set can overlap.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The Breaking-Event-Address-Register (BEAR) stores the address of the
last breaking event instruction. Breaking events are usually instructions
that change the program flow - for example branches, and instructions
that modify the address in the PSW like lpswe. This is useful for debugging
wild branches, because one could easily figure out where the wild branch
was originating from.
What is problematic is that lpswe is considered a breaking event, and
therefore overwrites BEAR on kernel exit. The BEAR enhancement facility
adds new instructions that allow to save/restore BEAR and also an lpswey
instruction that doesn't cause a breaking event. So we can save BEAR on
kernel entry and restore it on exit to user space.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
and replace all of the "__is_defined(CC_USING_EXPOLINE) && !nospec_disable"
occurrences.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With the upcoming BEAR enhancements last_break isn't really
unique, so rename it to pgm_last_break. This way it should
be more obvious that this is the last_break value that is
written by the hardware when a program check occurs.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Instead of using args[0] for the value of the last breaking event
address register, add a member to make things more obvious.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Provide physical addresses whenever the hardware interface
expects it or a 32-bit value used for tracking.
Variable sclp_early_sccb gets initialized in the decompressor
and points to an address in physcal memory. Yet, it is used
as virtual memory pointer and therefore should be converted.
Note, the other two __bootdata variables sclp_info_sccb and
sclp_info_sccb_valid contain plain data, but no pointers and
do need any special care.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Variables initrd_start and initrd_end are expected to hold
virtual memory pointers, not physical.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
memblock_reserve() function accepts physcal address of a memory
block to be reserved, but provided with virtual memory pointers.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Improve make_secure_pte to avoid stalls when the system is heavily
overcommitted. This was especially problematic in kvm_s390_pv_unpack,
because of the loop over all pages that needed unpacking.
Due to the locks being held, it was not possible to simply replace
uv_call with uv_call_sched. A more complex approach was
needed, in which uv_call is replaced with __uv_call, which does not
loop. When the UVC needs to be executed again, -EAGAIN is returned, and
the caller (or its caller) will try again.
When -EAGAIN is returned, the path is the same as when the page is in
writeback (and the writeback check is also performed, which is
harmless).
Fixes: 214d9bbcd3 ("s390/mm: provide memory management functions for protected KVM guests")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Link: https://lore.kernel.org/r/20210920132502.36111-5-imbrenda@linux.ibm.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f260 ("mm: mmap: zap pages
with read mmap_sem in munmap").
find_vma() does not check if the address is >= the VMA start address;
use vma_lookup() instead.
Fixes: 214d9bbcd3 ("s390/mm: provide memory management functions for protected KVM guests")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://lore.kernel.org/r/20210909162248.14969-6-david@redhat.com
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
On nds32, openrisc, s390, sh, and xtensa the function die never
returns. Mark die __noreturn so that no one expects die to return.
Remove the do_exit calls after die as they will never be reached.
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: openrisc@lists.librecores.org
Cc: Nick Hu <nickhu@andestech.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Fixes: 2.3.16
Fixes: 2.3.99-pre8
Fixes: 3f65ce4d14 ("[PATCH] xtensa: Architecture support for Tensilica Xtensa Part 5")
Fixes: 664eec400b ("nds32: MMU fault handling and page table management")
Fixes: 61e85e3675 ("OpenRISC: Memory management")
Link: https://lkml.kernel.org/r/20211020174406.17889-2-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Make STACK_FRAME_OVERHEAD available via asm-offsets.h. This allows to
add s390 specific asm code to e.g. ftrace samples, without requiring
to add random header files, which might cause all sort of problems on
other architectures. asm-offsets.h can be assumed to be non-problematic.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20211012133802.2460757-3-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
ftrace_regs_caller is an alias to ftrace_caller - making ftrace_caller
quite heavyweight. Split the function and provide an ftrace_caller
implementation which comes with fewer instructions. Especially getting
rid of 'stosm' on each function entry should help here, e.g. to
have less performance impact on live patched functions.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Trivial patch just to get rid of the leading underscores.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Specify HAVE_JUMP_LABEL_BATCH in header file. This allows to make use
of the arch_jump_label_transform_queue()/arch_jump_label_transform_apply()
mechanism.
However unlike on x86, which currently is the only user of this
mechanism, the to be patched instructions are still directly
modified. The only difference to before is that serialization is only
done after all instructions have been modified. This way the number of
serialization/synchronization events is reduced.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
CPUs must be serialized also when ftrace_graph_caller gets patched.
This is missing since ftrace function graph support was added on s390.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Introduce a text_poke_sync() similar to what x86 has. This can be
used to execute a serializing instruction on all CPUs (including
the current one).
Note: according to the Principles of Operation an IPI (= interrupt)
will already serialize a CPU, however it is better to be explicit. In
addition on_each_cpu() makes sure that also the current CPU get
serialized - just to make sure that possible preemption can prevent
some theoretical case where a CPU will not be serialized.
Therefore text_poke_sync() has to be used whenever code got modified,
just to avoid to rely on implicit serialization.
Also introduce text_poke_sync_lock() which will also disable CPU
hotplug, to prevent that any CPU is just going online with a
prefetched old version of a modified instruction.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
ftrace_dyn_arch_init() to cleanup them.
Link: https://lkml.kernel.org/r/20210909090216.1955240-1-o451686892@gmail.com
Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
Acked-by: Helge Deller <deller@gmx.de> (parisc)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The memory for amode31 section is allocated from the decompressed
kernel. Instead, allocate that memory from the decompressor. This
is a prerequisite to allow initialization of the virtual memory
before the decompressed kernel takes over.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Partially revert commit 4555b9f34296 ("s390/boot: move
dma sections from decompressor to decompressed kernel").
This is a prerequisite to allow initialization of virtual
memory in decompressor and avoid overwriting of ASCEs in
the decompressed kernel otherwise.
Since the control registers 2, 5 and 15 are reinitialized
in the decompressed kernel again, this change does not
prevent relocating of amode31 section in any way.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Check whether the specified address points to the start of an
instruction to prevent users from setting a kprobe in the mid of
an instruction which would crash the kernel.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
ftrace_shared_hotpatch_trampoline() never returns NULL,
therefore quite a bit of code can be removed.
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Replace audit syscall class magic numbers with macros.
This required putting the macros into new header file
include/linux/audit_arch.h since the syscall macros were
included for both 64 bit and 32 bit in any compat code, causing
redefinition warnings.
Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: renamed header to audit_arch.h after consulting with Richard]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Since now there is kretprobe_trampoline_addr() for referring the
address of kretprobe trampoline code, we don't need to access
kretprobe_trampoline directly.
Make it harder to refer by renaming it to __kretprobe_trampoline().
Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:
dereference_kernel_function_descriptor(kretprobe_trampoline)
In fact we had bugs in arch code where it wasn't replicated correctly.
So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.
Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This clean up the error/notification messages in kprobes related code.
Basically this defines 'pr_fmt()' macros for each files and update
the messages which describes
- what happened,
- what is the kernel going to do or not do,
- is the kernel fine,
- what can the user do about it.
Also, if the message is not needed (e.g. the function returns unique
error code, or other error message is already shown.) remove it,
and replace the message with WARN_*() macros if suitable.
Link: https://lkml.kernel.org/r/163163036568.489837.14085396178727185469.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Fix topology update on cpu hotplug, so notifiers see expected masks. This bug
was uncovered with SCHED_CORE support.
- Fix stack unwinding so that the correct number of entries are omitted like
expected by common code. This fixes KCSAN selftests.
- Add kmemleak annotation to stack_alloc to avoid false positive kmemleak
warnings.
- Avoid layering violation in common I/O code and don't unregister subchannel
from child-drivers.
- Remove xpram device driver for which no real use case exists since the kernel
is 64 bit only. Also all hypervisors got required support removed in the
meantime, which means the xpram device driver is dead code.
- Fix -ENODEV handling of clp_get_state in our PCI code.
- Enable KFENCE in debug defconfig.
- Cleanup hugetlbfs s390 specific Kconfig dependency.
- Quite a lot of trivial fixes to get rid of "W=1" warnings, and and other
simple cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmE56jEACgkQIg7DeRsp
bsI1sQ/+L91zvpjlWGEPjZhQmFJgDufuObLWJlhwOSPsOlezzJTujNscoisTe6Wm
hfS1I/GzGsgcY3695xgBLgkPS37nrDdDLAgM4CnajOOalEZjbHgH5gcPiCPHfPAD
QkvVFv2PjCQnaPx81kEIeK6tMFkvi6IRhfwhtGTf1fwoKDyw4IQT1couBsiuAy3n
28/7NqMidS4gbv5X/BLK1Ez4as9d3PoecNre1debRPOZcdxIjCVDy7OW5MotI3ol
ENsOHtNJe/orIDCc+QbsEP2xZJZdbZ0D0Zr/RQ4KEue42wKtGLzp/ZuG+UfTPyyx
vlEDgMRgPHAGnceEImcMwK0XQwOn05sm13jOkbmpIwhmiE46rksAPf3cGL4DjlBP
3rznDXoLYELX2OAHz2G4jfbrqFWDxbh5rp1NMr8tELvJV5xbdsMC11QFQY28swod
/sUE39fX+zynwHSSttq0PXtKX4gr/d5ZMDdlhjl7lxlOgwEwDodBL3/xL81+C0qx
jkQWDsJ6OpZ7iJpGvxaCUhFjlgihdi2InZ942inRGo/A/EaM6/7diExLiyqfaab5
WEQ2BOlITUey85Fiu2WxeeweRChUwu+XNQt+Nx4hDF454K51htU/GJCUBW5Z5qtN
Dm+/DolXkPY+joR7xBLHNzivob3ShcsoFiZjoBpTc/Hd18dhSQg=
=fpJz
-----END PGP SIGNATURE-----
Merge tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
"Except for the xpram device driver removal it is all about fixes and
cleanups.
- Fix topology update on cpu hotplug, so notifiers see expected
masks. This bug was uncovered with SCHED_CORE support.
- Fix stack unwinding so that the correct number of entries are
omitted like expected by common code. This fixes KCSAN selftests.
- Add kmemleak annotation to stack_alloc to avoid false positive
kmemleak warnings.
- Avoid layering violation in common I/O code and don't unregister
subchannel from child-drivers.
- Remove xpram device driver for which no real use case exists since
the kernel is 64 bit only. Also all hypervisors got required
support removed in the meantime, which means the xpram device
driver is dead code.
- Fix -ENODEV handling of clp_get_state in our PCI code.
- Enable KFENCE in debug defconfig.
- Cleanup hugetlbfs s390 specific Kconfig dependency.
- Quite a lot of trivial fixes to get rid of "W=1" warnings, and and
other simple cleanups"
* tag 's390-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
hugetlbfs: s390 is always 64bit
s390/ftrace: remove incorrect __va usage
s390/zcrypt: remove incorrect kernel doc indicators
scsi: zfcp: fix kernel doc comments
s390/sclp: add __nonstring annotation
s390/hmcdrv_ftp: fix kernel doc comment
s390: remove xpram device driver
s390/pci: read clp_list_pci_req only once
s390/pci: fix clp_get_state() handling of -ENODEV
s390/cio: fix kernel doc comment
s390/ctrlchar: fix kernel doc comment
s390/con3270: use proper type for tasklet function
s390/cpum_cf: move array from header to C file
s390/mm: fix kernel doc comments
s390/topology: fix topology information when calling cpu hotplug notifiers
s390/unwind: use current_frame_address() to unwind current task
s390/configs: enable CONFIG_KFENCE in debug_defconfig
s390/entry: make oklabel within CHKSTG macro local
s390: add kmemleak annotation in stack_alloc()
s390/cio: dont unregister subchannel from child-drivers
These are all handled correctly when calling the native system call entry
point, so remove the special cases.
Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The address of ftrace_graph_caller is already virtual.
Using __va() to translate the address is wrong.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move array from header to C file to avoid that it gets defined in
every C file where the header is included:
In file included from arch/s390/kernel/perf_cpum_cf_common.c:19:
./arch/s390/include/asm/cpu_mcf.h:27:18: warning: ‘cpumf_ctr_ctl’ defined but not used [-Wunused-const-variable=]
27 | static const u64 cpumf_ctr_ctl[CPUMF_CTR_SET_MAX] = {
| ^~~~~~~~~~~~~
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The cpu hotplug notifiers are called without updating the core/thread
masks when a new CPU is added. This causes problems with code setting
up data structures in a cpu hotplug notifier, and relying on that later
in normal code.
This caused a crash in the new core scheduling code (SCHED_CORE),
where rq->core was set up in a notifier depending on cpu masks.
To fix this, add a cpu_setup_mask which is used in update_cpu_masks()
instead of the cpu_online_mask to determine whether the cpu masks should
be set for a certain cpu. Also move update_cpu_masks() to update the
masks before calling notify_cpu_starting() so that the notifiers are
seeing the updated masks.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Cc: <stable@vger.kernel.org>
[hca@linux.ibm.com: get rid of cpu_online_mask handling]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.
memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.
Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.
This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().
Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI]
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv]
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull exit cleanups from Eric Biederman:
"In preparation of doing something about PTRACE_EVENT_EXIT I have
started cleaning up various pieces of code related to do_exit. Most of
that code I did not manage to get tested and reviewed before the merge
window opened but a handful of very useful cleanups are ready to be
merged.
The first change is simply the removal of the bdflush system call. The
code has now been disabled long enough that even the oldest userspace
working userspace setups anyone can find to test are fine with the
bdflush system call being removed.
Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
calling do_exit directly is interesting only in that it is nearly the
most difficult of the incorrect uses of do_exit to remove.
The change to the seccomp code to simply send a signal instead of
calling do_coredump directly is a very nice little cleanup made
possible by realizing the existing signal sending helpers were missing
a little bit of functionality that is easy to provide"
* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal/seccomp: Dump core when there is only one live thread
signal/seccomp: Refactor seccomp signal and coredump generation
signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
exit/bdflush: Remove the deprecated bdflush system call
Make the oklabel within the CHKSTG macro local. This makes sure that
tools like objdump and the crash debugging tool still disassemble full
functions where the macro has been used instead of stopping half way
where such a global label is used and one has to guess how to
disassemble the rest of such a function:
E.g.:
0000000000cb0270 <mcck_int_handler>:
cb0270: b2 05 03 20 stck 800
...
cb0354: a7 74 00 97 jne cb0482 <oklabel270+0xe2>
0000000000cb0358 <oklabel243>:
cb0358: c0 e0 00 22 4e 8f larl %r14,10fa076 <opcode+0x2558>
...
Fixes: d35925b349 ("s390/mcck: move storage error checks to assembler")
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
kmemleak with enabled auto scanning reports that our stack allocation is
lost. This is because we're saving the pointer + STACK_INIT_OFFSET to
lowcore. When kmemleak now scans the objects, it thinks that this one is
lost because it can't find a corresponding pointer.
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Improve ftrace code patching so that stop_machine is not required anymore.
This requires a small common code patch acked by Steven Rostedt:
https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
- Enable KCSAN for s390. This comes with a small common code change to fix a
compile warning. Acked by Marco Elver:
https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
- Add KFENCE support for s390. This also comes with a minimal x86 patch from
Marco Elver who said also this can be carried via the s390 tree:
https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
- More changes to prepare the decompressor for relocation.
- Enable DAT also for CPU restart path.
- Final set of register asm removal patches; leaving only three locations where
needed and sane.
- Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO support to
hwcaps flags.
- Cleanup hwcaps implementation.
- Add new instructions to in-kernel disassembler.
- Various QDIO cleanups.
- Add SCLP debug feature.
- Various other cleanups and improvements all over the place.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmEs1GsACgkQIg7DeRsp
bsJ9+A/9FApCECNPgu6jOX4Ee+no+LxpCPUF8rvt56TFTLv7+Dhm7fJl0xQ9utsZ
FyLMDAr1/FKdm2wBW23QZH4vEIt1bd6e/03DwwK+6IjHKZHRIfB8eGJMsLj/TDzm
K6/+FI7qXjvpNXxgkCqXf5yESi/y5Dgr+16kTBhPZj5awRiwe5puPamji3uiQ45V
r4MdGCCC9BnTZvtPpUrr8ImnUqHJ4/TMo1YYdykLbZFuAvvYUyZ5YG5kh0pMa8JZ
DGJpfLQfy7ZNscIzdVhZtfzzESVtS6/AOeBzDMO1pbM1CGXtvpJJP0Wjlr/PGwoW
fvuMHpqTlDi+TfNZiPP5lwsFC89xSd6gtZH7vAuI8kFCXgW3RMjABF6h/mzpH1WO
jXyaSEZROc/83gxPMYyOYiqrKyAFPbpZ/Rnav2bvGQGneqx7RvmpF3GgA9WEo1PW
rMDoEbLstJuHk0E2uEV+OnQd5F7MHNonzpYfp/7pyQ+PL8w2GExV9yngVc/f3TqG
HYLC9rc3K6DkxZappcJm0qTb7lDTMFI7xK3g9RiqPQBJE1v1MYE/rai48nW69ypE
bRNL76AjyXKo+zKR2wlhJVMY1I1+DarMopHhZj6fzQT5te1LLsv8OuTU2gkt6dIq
YoSYOYvModf3HbKnJul2tszQG9yl+vpE9MiCyBQSsxIYXCriq/c=
=WDRh
-----END PGP SIGNATURE-----
Merge tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Improve ftrace code patching so that stop_machine is not required
anymore. This requires a small common code patch acked by Steven
Rostedt:
https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
- Enable KCSAN for s390. This comes with a small common code change to
fix a compile warning. Acked by Marco Elver:
https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
- Add KFENCE support for s390. This also comes with a minimal x86 patch
from Marco Elver who said also this can be carried via the s390 tree:
https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
- More changes to prepare the decompressor for relocation.
- Enable DAT also for CPU restart path.
- Final set of register asm removal patches; leaving only three
locations where needed and sane.
- Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO
support to hwcaps flags.
- Cleanup hwcaps implementation.
- Add new instructions to in-kernel disassembler.
- Various QDIO cleanups.
- Add SCLP debug feature.
- Various other cleanups and improvements all over the place.
* tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (105 commits)
s390: remove SCHED_CORE from defconfigs
s390/smp: do not use nodat_stack for secondary CPU start
s390/smp: enable DAT before CPU restart callback is called
s390: update defconfigs
s390/ap: fix state machine hang after failure to enable irq
KVM: s390: generate kvm hypercall functions
s390/sclp: add tracing of SCLP interactions
s390/debug: add early tracing support
s390/debug: fix debug area life cycle
s390/debug: keep debug data on resize
s390/diag: make restart_part2 a local label
s390/mm,pageattr: fix walk_pte_level() early exit
s390: fix typo in linker script
s390: remove do_signal() prototype and do_notify_resume() function
s390/crypto: fix all kernel-doc warnings in vfio_ap_ops.c
s390/pci: improve DMA translation init and exit
s390/pci: simplify CLP List PCI handling
s390/pci: handle FH state mismatch only on disable
s390/pci: fix misleading rc in clp_set_pci_fn()
s390/boot: factor out offset_vmlinux_info() function
...
The secondary CPU start C routine uses nodat_stack as a
interim stack before finally switching to kernel_stack.
Such scheme is superfluous, since the assembler restart
interrupt handler (that secondary CPU starter is called
from) does not need to use any stack for switching into
DAT mode. Once DAT is on, any stack including virtually-
mapped one could be used.
Avoid the use of nodat_stack and smp_start_secondary()
helper. Instead, initiate kernel_stack directly from
the restart interrupt handler.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The restart interrupt is triggered whenever a secondary CPU is
brought online, a remote function call dispatched from another
CPU or a manual PSW restart is initiated and causes the system
to kdump. The handling routine is always called with DAT turned
off. It then initializes the stack frame and invokes a callback.
The existing callbacks handle DAT as follows:
* __do_restart() and __machine_kexec() turn in on upon entry;
* __ipl_run(), __reipl_run() and __dump_run() do not turn it
right away, but all of them call diag308() - which turns DAT
on, but only if kasan is enabled;
In addition to the described complexity all callbacks (and the
functions they call) should avoid kasan instrumentation while
DAT is off.
This update enables DAT in the assembler restart handler and
relieves any callbacks (which are mostly C functions) from
dealing with DAT altogether.
There are four types of CPU restart that initialize control
registers in different ways:
1. Start of secondary CPU on boot - control registers are
inherited from the IPL CPU;
2. Restart of online CPU - control registers of the CPU being
restarted are kept;
3. Hotplug of offline CPU - control registers are inherited
from the starting CPU;
4. Start of offline CPU triggered by manual PSW restart -
the control registers are read from the absolute lowcore
and contain the boot time IPL CPU values updated with all
follow-up calls of smp_ctl_set_bit() and smp_ctl_clear_bit()
routines;
In first three cases contents of the control registers is the
most recent. In the latter case control registers are good
enough to facilitate successful completion of kdump operation.
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add tracing of interactions between the SCLP base driver, firmware and
other drivers to support problem determination in case of SCLP-related
issues.
For that purpose this patch introduces two new s390dbf debug areas:
- sclp: An abbreviated log of all common interactions
- sclp_err: A full log of failed or abnormal interactions
Tracing of full SCCB contents can be enabled for the sclp area by
setting its debug level to maximum (6).
Overview of added trace events:
* Firmware interaction:
- SRV1: Service call about to be issued
- SRV2: Service call was issued
- INT: Interrupt received
* Driver interaction:
- RQAD: Request was added
- RQOK: Request success
- RQAB: Request aborted
- RQTM: Request timed out
- REG: Event listener registered
- UREG: Event listener unregistered
- EVNT: Event callback
- STCG: State-change callback
* Abnormal events:
- TMO: A timeout occurred
- UNEX: Unexpected SCCB completion
* Other (not traced at default level):
- SYN1: Synchronous wait start
- SYN2: Synchronous wait end
Since the SCLP interface is used by console drivers this patch also
moves s390dbf printks outside the critical section protected by debug
area locks to prevent a potential deadlock that would otherwise be
introduced between console_owner --> sclp_lock --> sclp_debug.lock.
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Debug areas can currently only be used after s390dbf initialization
which occurs as a postcore_initcall. This is too late for tracing
earlier code such as that related to console_init().
This patch introduces a macro for defining a statically initialized
debug area that can be used to trace very early code. The macro is made
available for built-in code only because modules are never running
during early boot.
Example usage:
1. Define static debug area:
DEFINE_STATIC_DEBUG_INFO(my_debug, "my_debug", 4, 1, 16,
&debug_hex_ascii_view);
2. Add trace entry:
debug_event(&my_debug, 0, "DATA", 4);
Note: The debug area is automatically registered in debugfs during boot.
A driver must not call any of the debug_register()/_unregister()
functions on a static debug_info_t!
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently allocation and registration of s390dbf debug areas are tied
together. As a result, a debug area cannot be unregistered and
re-registered while any process has an associated debugfs file open.
Fix this by splitting alloc/release from register/unregister.
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Any previously recorded s390dbf debug data is reset when a debug area
is resized using the 'pages' sysfs attribute. This can make
live-debugging unnecessarily complex.
Fix this by copying existing debug data to the newly allocated debug
area when resizing.
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Avoid that the "restart_part2" label, which is in the middle of a
function, appears in /proc/kallsyms.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Rename amod31 to amode31 like it was supposed to be.
Fixes: c78d0c7484 ("s390: rename dma section to amode31")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Both are no longer used since the conversion to generic
entry, therefore remove them.
Fixes: 56e62a7370 ("s390: convert to generic entry")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The memory block occupied by the SCLP early buffer that is allocated
by the decompressor and then handed over to the decompressed kernel,
must be reserved to prevent it from being reused for other purposes.
This is necessary because the SCLP early buffer is still in use
during kernel initialization.
Fixes: f1d3c53237 ("s390/boot: move sclp early buffer from fixed address in asm to C")
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reported-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The dma section name is confusing, since the code which resides within
that section has nothing to do with direct memory access. Instead the
limitation is that the code has to run in 31 bit addressing mode, and
therefore has to reside below 2GB. So the name was chosen since
ZONE_DMA is the same region.
To reduce confusion rename the section to amode31, which hopefully
describes better what this is about.
Note: this will also change vmcoreinfo strings
- SDMA=... gets renamed to SAMODE31=...
- EDMA=... gets renamed to EAMODE31=...
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Reviewed-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-5-bigeasy@linutronix.de
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390 allows hotpatching the mask of a conditional jump instruction.
Make use of this feature in order to avoid the expensive stop_machine()
call.
The new trampolines are split in 3 stages:
- A first stage is a 6-byte relative conditional long branch located at
each function's entry point. Its offset always points to the second
stage for the corresponding function, and its mask is either all 0s
(ftrace off) or all 1s (ftrace on). The code for flipping the mask is
borrowed from ftrace_{enable,disable}_ftrace_graph_caller. After
flipping, ftrace_arch_code_modify_post_process() syncs with all the
other CPUs by sending SIGPs.
- Second stages for vmlinux are stored in a separate part of the .text
section reserved by the linker script, and in dynamically allocated
memory for modules. This prevents the icache pollution. The total
size of second stages is about 1.5% of that of the kernel image.
Putting second stages in the .bss section is possible and decreases
the size of the non-compressed vmlinux, but splits the kernel 1:1
mapping, which is a bad tradeoff.
Each second stage contains a call to the third stage, a pointer to
the part of the intercepted function right after the first stage, and
a pointer to an interceptor function (e.g. ftrace_caller).
Second stages are 8-byte aligned for the future direct calls
implementation.
- There are only two copies of the third stage: in the .text section
for vmlinux and in dynamically allocated memory for modules. It can be
an expoline, which is relatively large, so inlining it into each
second stage is prohibitively expensive.
As a result of this organization, phoronix-test-suite with ftrace off
does not show any performance degradation.
Suggested-by: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Vasily Gorbik <gor@linux.ibm.com>
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20210728212546.128248-3-iii@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 7f16d7e787 ("s390: show virtualization support in /proc/cpuinfo")
introduced special handling for sie capability, saying this should not be
exposed via hwcaps, without giving a reason.
However this leads to an inconsistent /proc/cpuinfo features line
where all features except the sie capability are also present in
hwcaps. I really don't see a reason to not add that to hwcaps - it
might be quite pointless, but at least this way it is possible to get
rid of some special handling.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove the not so obvious "(elf_hwcap & (1UL << 2)" which only checks
if stfle is available. This used to be required for old code before
test_facility() was introduced. test_facility() will do the right
thing, regardless if stfle is available or not.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove a leftover from the common 31/64 bit code. z/Architecture mode
is now always active, there is no need to check.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The first six hwcap bits are initialized in a rather odd way: an array
contains the stfl(e) bits which need to be set, so that the
corresponding bit position (= array index) within hwcaps are set.
Better open code it like it is done for all other bits, making it
obvious which bit is set when.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
setup_hwcaps() is a quite large function. Make it smaller by moving
the elf platform setup code into an independent setup function.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Move setup_hwcaps() to processor.c for two reasons:
- make setup.c a bit smaller
- have allmost all of the hwcap code in one file
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add BUILD_BUG_ON() sanity checks to make sure the hwcap string array
contains a string for each hwcap.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Remove s390 part of all HWCAP defines, just to make them shorter and
easier to handle. The namespace is anyway per architecture.
This is similar to what arm64 has.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In order to support the use of enhanced PCI instructions in both kernel-
and userspace we need both hardware support and proper setup in the
kernel. The latter can be toggled off with the pci=nomio command line
option.
Thus availability of this feature in userspace depends on all of kernel
configuration (CONFIG_PCI), hardware support and the current kernel
command line and can thus not rely solely on a facility bit. Instead
let's introduce a new ELF hardware capability bit HWCAP_S390_PCI_MIO to
tell userspace whether these PCI instructions can be used.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Kernel support for the newer PCI mio instructions can be toggled off
with the pci=nomio command line option which needs to integrate with
common code PCI option parsing. However this option then toggles static
branches which can't be toggled yet in an early_param() call.
Thus commit 9964f396f1 ("s390: fix setting of mio addressing control")
moved toggling the static branches to the PCI init routine.
With this setup however we can't check for mio support outside the PCI
code during early boot, i.e. before switching the static branches, which
we need to be able to export this as an ELF HWCAP.
Improve on this by turning mio availability into a machine flag that
gets initially set based on CONFIG_PCI and the facility bit and gets
toggled off if pci=nomio is found during PCI option parsing allowing
simple access to this machine flag after early init.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add hardware capability bits and feature tags to /proc/cpuinfo
for NNPA and Vector-Packed-Decimal-Enhancement Facility 2.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is no useful information within [STARTUP_NORMAL_OFFSET, HEAD_END] now.
But the memory region [0, STARTUP_NORMAL_OFFSET] is used by:
* lowcore
* kdump for swapping memory
* stand-alone zipl dumpers for code, data, stack and heap
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
This change simplifies the task of making the decompressor relocatable.
The decompressor's image contains special DMA sections between _sdma and
_edma. This DMA segment is loaded at boot as part of the decompressor and
then simply handed over to the decompressed kernel. The decompressor itself
never uses it in any way. The primary reason for this is the need to keep
the aforementioned DMA segment below 2GB which is required by architecture,
and because the decompressor is always loaded at a fixed low physical
address, it is guaranteed that the DMA region will not cross the 2GB
memory limit. If the DMA region had been placed in the decompressed kernel,
then KASLR would make this guarantee impossible to fulfill or it would
be restricted to the first 2GB of memory address space.
This commit moves all DMA sections between _sdma and _edma from
the decompressor's image to the decompressed kernel's image. The complete
DMA region is placed in the init section of the decompressed kernel and
immediately relocated below 2GB at start-up before it is needed by other
parts of the decompressed kernel. The relocation of the DMA region happens
even if the decompressed kernel is already located below 2GB in order
to keep the first implementation simple. The relocation should not have
any noticeable impact on boot time because the DMA segment is only a couple
of pages.
After relocating the DMA sections, the kernel has to fix all references
which point into it. In order to automate this, place all variables
pointing into the DMA sections in a special .dma.refs section. All such
variables must be defined using the new __dma_ref macro. Only variables
containing addresses within the DMA sections must be placed in the new
.dma.refs section.
Furthermore, move the initialization of control registers from
the decompressor to the decompressed kernel because some control registers
reference tables that must be placed in the DMA data section to
guarantee that their addresses are below 2G. Because the decompressed
kernel relocates the DMA sections at startup, the content of control
registers CR2, CR5 and CR15 must be updated with new addresses after
the relocation. The decompressed kernel initializes all control registers
early at boot and then updates the content of CR2, CR5 and CR15
as soon as the DMA relocation has occurred. This practically reverts
the commit a80313ff91 ("s390/kernel: introduce .dma sections").
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
To reduce duplication, replace error-prone and hard-coded parameter area
offsets with auto-generated ones.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The new boot data struct shall replace global variables OLDMEM_BASE and
OLDMEM_SIZE. It is initialized in the decompressor and passed
to the decompressed kernel. In comparison to the old solution, this one
doesn't access data at fixed physical addresses which will become important
when the decompressor becomes relocatable.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The new boot data struct shall replace global variables INITRD_START and
INITRD_SIZE. It is initialized in the decompressor and passed
to the decompressed kernel. In comparison to the old solution, this one
doesn't access data at fixed physical addresses which will become important
when the decompressor becomes relocatable.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
De-duplicate checks for Protected Host Virtualization in decompressor and
kernel.
Set prot_virt_host=0 in the decompressor in *any* of the following cases
and hand it over to the decompressed kernel:
* No explicit prot_virt=1 is given on the kernel command-line
* Protected Guest Virtualization is enabled
* Hardware support not present
* kdump or stand-alone dump
The decompressed kernel needs to use only is_prot_virt_host() instead of
performing again all checks done by the decompressor.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
In case of a jump label print the real address of the piece of code
where a mismatch was detected. This is right before the system panics,
so there is nothing revealed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
* The linux/spinlock.h header was included indirectly by the decompressor
and brought unnecessary build dependencies.
* Use proper includes in files which either directly or indirectly included
cio.h and were hidden until now by the included linux/spinlock.h, e.g.
linux/string.h for memcpy() or asm/page.h for PAGE_SIZE.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The bdflush system call has been deprecated for a very long time.
Recently Michael Schmitz tested[1] and found that the last known
caller of of the bdflush system call is unaffected by it's removal.
Since the code is not needed delete it.
[1] https://lkml.kernel.org/r/36123b5d-daa0-6c2b-f2d4-a942f069fd54@gmail.com
Link: https://lkml.kernel.org/r/87sg10quue.fsf_-_@disp2133
Tested-by: Michael Schmitz <schmitzmic@gmail.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This fixes a permanent rebuild of the 32 bit vdso. The RPM build process
was first calling 'make bzImage' and 'make modules' as a second step.
This caused a recompilation of vdso32.so, which in turn also changed
the build-id of vmlinux.
Fixes: 779df22487 ("s390/vdso: add minimal compat vdso")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
- Fix preempt_count initialization.
- Rework call_on_stack() macro to add proper type handling and avoid
possible register corruption.
- More error prone "register asm" removal and fixes.
- Fix syscall restarting when multiple signals are coming in. This adds
minimalistic trampolines to vdso so we can return from signal without
using the stack which requires pgm check handler hacks when NX is
enabled.
- Remove HAVE_IRQ_EXIT_ON_IRQ_STACK since this is no longer true after
switch to generic entry.
- Fix protected virtualization secure storage access exception handling.
- Make machine check C handler always enter with DAT enabled and move
register validation to C code.
- Fix tinyconfig boot problem by avoiding MONITOR CALL without CONFIG_BUG.
- Increase asm symbols alignment to 16 to make it consistent with
compilers.
- Enable concurrent access to the CPU Measurement Counter Facility.
- Add support for dynamic AP bus size limit and rework ap_dqap to deal
with messages greater than recv buffer.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmDpo78ACgkQjYWKoQLX
FBhapAf/UkwhYLTrEEOSmrWQaJHyOkdxcexksBPvh/GEQvDkHixHROeytRb5SBMD
JXGJsWN3g9DjiE9DLFGSK2eQfKYFoPaXWv3Tf+QIPvEwvODKZHRLVJU1VTXiF4EA
+Xu4Q0o1VtX8ashEN0IHjBycPY9l+v9Ncb2erKG+CzoRa1r5C+A+1E0sLtsNN9TW
AV+57ATCc0pgNkUfzVoG2S/QdSY+fpMWtjoPeCj/DNCpFq1eRTMRGolTsdnqGpDx
ueAozHW9Pc9TQXeWpOVM7ZgWRWJThCC4vj2W/B7ygCuyJh4N0y2rthFEg3tWUf0f
+X6KDXeijFs4Gf0ewYLZjZCpBT8/NA==
=EgkG
-----END PGP SIGNATURE-----
Merge tag 's390-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix preempt_count initialization.
- Rework call_on_stack() macro to add proper type handling and avoid
possible register corruption.
- More error prone "register asm" removal and fixes.
- Fix syscall restarting when multiple signals are coming in. This adds
minimalistic trampolines to vdso so we can return from signal without
using the stack which requires pgm check handler hacks when NX is
enabled.
- Remove HAVE_IRQ_EXIT_ON_IRQ_STACK since this is no longer true after
switch to generic entry.
- Fix protected virtualization secure storage access exception
handling.
- Make machine check C handler always enter with DAT enabled and move
register validation to C code.
- Fix tinyconfig boot problem by avoiding MONITOR CALL without
CONFIG_BUG.
- Increase asm symbols alignment to 16 to make it consistent with
compilers.
- Enable concurrent access to the CPU Measurement Counter Facility.
- Add support for dynamic AP bus size limit and rework ap_dqap to deal
with messages greater than recv buffer.
* tag 's390-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
s390: preempt: Fix preempt_count initialization
s390/linkage: increase asm symbols alignment to 16
s390: rename CALL_ON_STACK_NORETURN() to call_on_stack_noreturn()
s390: add type checking to CALL_ON_STACK_NORETURN() macro
s390: remove old CALL_ON_STACK() macro
s390/softirq: use call_on_stack() macro
s390/lib: use call_on_stack() macro
s390/smp: use call_on_stack() macro
s390/kexec: use call_on_stack() macro
s390/irq: use call_on_stack() macro
s390/mm: use call_on_stack() macro
s390: introduce proper type handling call_on_stack() macro
s390/irq: simplify on_async_stack()
s390/irq: inline do_softirq_own_stack()
s390/irq: simplify do_softirq_own_stack()
s390/ap: get rid of register asm in ap_dqap()
s390: rename PIF_SYSCALL_RESTART to PIF_EXECVE_PGSTE_RESTART
s390: move restart of execve() syscall
s390/signal: remove sigreturn on stack
s390/signal: switch to using vdso for sigreturn and syscall restart
...
S390's init_idle_preempt_count(p, cpu) doesn't actually let us initialize the
preempt_count of the requested CPU's idle task: it unconditionally writes
to the current CPU's. This clearly conflicts with idle_threads_init(),
which intends to initialize *all* the idle tasks, including their
preempt_count (or their CPU's, if the arch uses a per-CPU preempt_count).
Unfortunately, it seems the way s390 does things doesn't let us initialize
every possible CPU's preempt_count early on, as the pages where this
resides are only allocated when a CPU is brought up and are freed when it
is brought down.
Let the arch-specific code set a CPU's preempt_count when its lowcore is
allocated, and turn init_idle_preempt_count() into an empty stub.
Fixes: f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://lore.kernel.org/r/20210707163338.1623014-1-valentin.schneider@arm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Lower case matches the call_on_stack() macro and is easier to read.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make on_async_stack() a bit more readable, even though as usual it
depends if one considers "!!!" readable or not.
At least the new construct to check if the async stack is in use or
not is a bit shorter and generates slightly better code.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move do_softirq_own_stack() to proper header file so it can be
inlined; saving a few cycles.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
do_softirq_own_stack() is always called from task context and
therefore it is not necessary to check if the async stack is
currently used.
Remove the check and directly switch to async stack.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
PIF_SYSCALL_RESTART is now only used to restart execve when loading
PGSTE binaries. Rename the flag to reflect that, and avoid people
thinking that this bit has anything to do with generic syscall
restarting.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
On s390, execve might have to be restarted for PGSTE binaries
like kvm. In the past this was done via the PIF_SYSCALL_RESTART
bit. However, with the recent changes, syscalls are now restarted
differently. Now that execve() is the only call that might get
restarted via PIF_SYSCALL_RESTART, move the loop to do_syscall().
This also has the advantage that the restart is no longer visible
to userspace.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
{rt_}sigreturn is now called from the vdso, so we no longer
need the svc on the stack, and therefore no hack to support that
mechanism on machines with non-executable stack.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
with generic entry, there's a bug when it comes to restarting of signals.
The failing sequence is:
a) a signal is coming in, and no handler is registered, so the lower
part of arch_do_signal_or_restart() in arch/s390/kernel/signal.c
sets PIF_SYSCALL_RESTART.
b) a second signal gets pending while the kernel is still in the exit
loop, and for that one, a handler exists.
c) The first part of arch_do_signal_or_restart() is called. That part
calls handle_signal(), which sets up stack + registers for handling
the signal.
d) __do_syscall() in arch/s390/kernel/syscall.c checks for
PIF_SYSCALL_RESTART right before leaving to userspace. If it is set,
it restart's the syscall. However, the registers are already setup
for handling a signal from c). The syscall is now restarted with the
wrong arguments.
Change the code to:
- use vdso for syscall_restart() instead of PIF_SYSCALL_RESTART because
we cannot rewind and go back to userspace on s390 because the system call
number might be encoded in the svc instruction.
- for all other syscalls we rewind the PSW and return to userspace.
Cc: <stable@kernel.org> # v5.12+ d57778feb9: s390/vdso: always enable vdso
Cc: <stable@kernel.org> # v5.12+ 686341f254: s390/vdso64: add sigreturn,rt_sigreturn and restart_syscall
Cc: <stable@kernel.org> # v5.12+ 43e1f76b0b: s390/vdso: rename VDSO64_LBASE to VDSO_LBASE
Cc: <stable@kernel.org> # v5.12+ 779df22487: s390/vdso: add minimal compat vdso
Cc: <stable@kernel.org> # v5.12+
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add a small vdso for 31 bit compat application that provides
trampolines for calls to sigreturn,rt_sigreturn,syscall_restart.
This is requird for moving these syscalls away from the signal
frame to the vdso. Note that this patch effectively disables
CONFIG_COMPAT when using clang to compile the kernel. clang
doesn't support 31 bit mode.
We want to redirect sigreturn and restart_syscall to the vdso. However,
the kernel cannot parse the ELF vdso file, so we need to generate header
files which contain the offsets of the syscall instructions in the vdso
page.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Will be used by both vdso32 and vdso64, so change the name.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add minimalistic trampolines to vdso64 so we can return from signal
without using the stack which requires pgm check handler hacks when
NX is enabled.
restart_syscall will be called from vdso to work around the architectural
limitation that the syscall number might be encoded in the svc instruction,
and therefore can not be changed.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With the upcoming move of the svc sigreturn instruction from
the signal frame to vdso we need to have vdso always enabled.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
tinyconfig fails to boot, because without CONFIG_BUG report_bug()
always returns BUG_TRAP_TYPE_BUG, which causes mc 0,0 in
test_monitor_call() to panic. Fix by skipping the test without
CONFIG_BUG.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Commit cf6acb8bdb ("s390/cpumf: Add support for complete counter set extraction")
allows access to the CPU Measurement Counter Facility via character
device /dev/hwctr. The access was exclusive via this device or
via perf_event_open() system call. Only one path at a time was
permitted. The CPU Measurement Counter Facility device driver blocked
access to other processes.
This patch removes this restriction and allows concurrent access to
the CPU Measurement Counter Facility from multiple processes at the same
time via perf_event_open() SVC and via /dev/hwctr device. The access
via /dev/hwctr device is still exclusive, only one process is allowed to
access this device.
This patch
- moves the /dev/hwctr device access from file perf_cpum_cf_diag.c.
to file perf_cpum_cf.c.
- use only one trace buffer .../s390dbf/cpum_cf.
- remove cfset_csd structure and includes its members it into the
structure cpu_cf_events. This results in one data structure and
simplifies the access.
- rework function familiy ctr_set_enable, ctr_set_disable, ctr_set_start
and ctr_set_stop which operate on a counter set number.
Now they operate on a counter set bit mask.
- move CF_DIAG event functionality to file perf_cpum_cf.c. It now
contains the complete functionality of the CPU Measurement Counter
Facility:
- Performance measurement support for counters using perf stat.
- Support for complete counter set extraction with device /dev/hwctr.
- Support for counter set extraction event CF_DIAG attached to
samples using perf record.
- removes file perf_cpum_cf_diag.c
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This update partially reverts commit 3037a52f98 ("s390/nmi:
do register validation as early as possible").
Storage error checks and control registers validation are left
in the assembler code, since correct ASCEs and page tables are
required to enable DAT - which is done before the C handler is
entered.
System damage, kernel instruction address and PSW MWP checks
are left in the assembler code as well, since there is no way
to proceed if one of these checks is failed.
The getcpu vdso syscall reads CPU number from the programmable
field of the TOD clock. Disregard the TOD programmable register
validity bit and load the CPU number into the TOD programmable
field unconditionally.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The magic string "S390EP" at offset 0x10008 indicated to the decompressed
kernel that it was booted by the decompressor. Introduce a new bootdata
flag instead which conveys the same information in an explicit and
a cleaner way. But keep the magic string because it is a kernel ABI.
Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current storage errors tackling is wrong - the DAT is
enabled in assembler code before the actual storage checks
in C half are executed. In case the page tables themselves
are damaged such approach is not going to work.
With this update unrecoverable storage errors are not
passed to C code for handling, but rather the machine
is stopped right away. The only exception to this flow
is when a machine check occurred in KVM guest - in this
case the errors are reinjected by the handler.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The machine check handler must be entered with DAT disabled
in case control registers are corrupted or a storage error
happened and we can not tell if such error corresponds to a
page table.
Both of described conditions end up in stopping all CPUs and
entering the disabled wait in C half of the handler. However,
the storage errors are still checked after the DAT is enabled
and C code is entered. In case a page table is damaged such
flow is not expected to work.
This update paves the way for moving the storage error checks
from C to assembler half. All fatal errors that can only be
handled with DAT disabled are handled in assembler half also.
As result, the C half is only entered if the DAT is secured.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In case of the !CONFIG_KVM use "jz" instead of "jnz" when
detecting user mode and get rid of unnecessary jump as result.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Christia Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Factor out SIEEXIT macro and use it instead of cleanup_sie
routine. As a side effect %r13 and %r14 are spared.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Christia Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Turns out that the bit 61 in the TEID is not always 1 and if that's
the case the address space ID and the address are
unpredictable. Without an address and its address space ID we can't
export memory and hence we can only send a SIGSEGV to the process or
panic the kernel depending on who caused the exception.
Unfortunately bit 61 is only reliable if we have the "misc" UV feature
bit.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Fixes: 084ea4d611 ("s390/mm: add (non)secure page access exceptions handlers")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
- Rework inline asm to get rid of error prone "register asm" constructs,
which are problematic especially when code instrumentation is enabled. In
particular introduce and use register pair union to allocate even/odd
register pairs. Unfortunately this breaks compatibility with older
clang compilers and minimum clang version for s390 has been raised to 13.
https://lore.kernel.org/linux-next/CAK7LNARuSmPCEy-ak0erPrPTgZdGVypBROFhtw+=3spoGoYsyw@mail.gmail.com/
- Fix gcc 11 warnings, which triggered various minor reworks all over
the code.
- Add zstd kernel image compression support.
- Rework boot CPU lowcore handling.
- De-duplicate and move kernel memory layout setup logic earlier.
- Few fixes in preparation for FORTIFY_SOURCE performing compile-time
and run-time field bounds checking for mem functions.
- Remove broken and unused power management support leftovers in s390
drivers.
- Disable stack-protector for decompressor and purgatory to fix buildroot
build.
- Fix vt220 sclp console name to match the char device name.
- Enable HAVE_IOREMAP_PROT and add zpci_set_irq()/zpci_clear_irq() in
zPCI code.
- Remove some implausible WARN_ON_ONCEs and remove arch specific counter
transaction call backs in favour of default transaction handling in
perf code.
- Extend/add new uevents for online/config/mode state changes of
AP card / queue device in zcrypt.
- Minor entry and ccwgroup code improvements.
- Other small various fixes and improvements all over the code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmDhuTEACgkQjYWKoQLX
FBjVlggAgDFBkDjlyfvrm4xzmHi7BJMmhrTJIONsSz+3tcA4/u5kE+Hrdrqxm0Uh
ZH4MXBxn4q4Fmoomhu5w5ZDe8o2ip0aN9fFNdsBoP8hurmQbL/IbdTnBETKMrKpV
XpogU2G7p+2nQ0+9+o6PS/vWlZhI88NVh8dWyRd2+5/XdMycgLv2Qm7NpQoACVw1
CbUvxP2PlpZ0wltLvNBKPg1xXMZa3GS0wbVUsS2jiWcr/3VzCqfTHenZJ/RadoE6
axG99QXCbLDMsJgVQcXtlI8K6Z461fAwbNtWZWC+Uq7o5pYuUFW1dovMg9WWF+7T
lFNqXyyNy5wwITRkvuzjlVTE8yzYYg==
=ADZ4
-----END PGP SIGNATURE-----
Merge tag 's390-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Rework inline asm to get rid of error prone "register asm"
constructs, which are problematic especially when code
instrumentation is enabled.
In particular introduce and use register pair union to allocate
even/odd register pairs. Unfortunately this breaks compatibility with
older clang compilers and minimum clang version for s390 has been
raised to 13.
https://lore.kernel.org/linux-next/CAK7LNARuSmPCEy-ak0erPrPTgZdGVypBROFhtw+=3spoGoYsyw@mail.gmail.com/
- Fix gcc 11 warnings, which triggered various minor reworks all over
the code.
- Add zstd kernel image compression support.
- Rework boot CPU lowcore handling.
- De-duplicate and move kernel memory layout setup logic earlier.
- Few fixes in preparation for FORTIFY_SOURCE performing compile-time
and run-time field bounds checking for mem functions.
- Remove broken and unused power management support leftovers in s390
drivers.
- Disable stack-protector for decompressor and purgatory to fix
buildroot build.
- Fix vt220 sclp console name to match the char device name.
- Enable HAVE_IOREMAP_PROT and add zpci_set_irq()/zpci_clear_irq() in
zPCI code.
- Remove some implausible WARN_ON_ONCEs and remove arch specific
counter transaction call backs in favour of default transaction
handling in perf code.
- Extend/add new uevents for online/config/mode state changes of AP
card / queue device in zcrypt.
- Minor entry and ccwgroup code improvements.
- Other small various fixes and improvements all over the code.
* tag 's390-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (91 commits)
s390/dasd: use register pair instead of register asm
s390/qdio: get rid of register asm
s390/ioasm: use symbolic names for asm operands
s390/ioasm: get rid of register asm
s390/cmf: get rid of register asm
s390/lib,string: get rid of register asm
s390/lib,uaccess: get rid of register asm
s390/string: get rid of register asm
s390/cmpxchg: use register pair instead of register asm
s390/mm,pages-states: get rid of register asm
s390/lib,xor: get rid of register asm
s390/timex: get rid of register asm
s390/hypfs: use register pair instead of register asm
s390/zcrypt: Switch to flexible array member
s390/speculation: Use statically initialized const for instructions
virtio/s390: get rid of open-coded kvm hypercall
s390/pci: add zpci_set_irq()/zpci_clear_irq()
scripts/min-tool-version.sh: Raise minimum clang version to 13.0.0 for s390
s390/ipl: use register pair instead of register asm
s390/mem_detect: fix tprot() program check new psw handling
...
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k
QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2
Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ
P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru
62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq
SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu
+u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw==
=tn0h
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull misc fs updates from Jan Kara:
"The new quotactl_fd() syscall (remake of quotactl_path() syscall that
got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
isofs, and writeback fixes and cleanups"
* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
writeback: fix obtain a reference to a freeing memcg css
quota: remove unnecessary oom message
isofs: remove redundant continue statement
quota: Wire up quotactl_fd syscall
quota: Change quotactl_path() systcall to an fd-based one
reiserfs: Remove unneed check in reiserfs_write_full_page()
udf: Fix NULL pointer dereference in udf_symlink function
reiserfs: add check for invalid 1st journal block
free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c. Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree(). So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.
Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Qi Liu <liuqi115@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow
the flexible utilization of SMT siblings, without exposing
untrusted domains to information leaks & side channels, plus
to ensure more deterministic computing performance on SMT
systems used by heterogenous workloads.
There's new prctls to set core scheduling groups, which
allows more flexible management of workloads that can share
siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve
'memcache'-like workloads.
- "Age" (decay) average idle time, to better track & improve workloads
such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked
via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable
it at runtime if tooling needs it. Use static keys and
other optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
+U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
UmG7bt94Trk=
=3VDr
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.
There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.
- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaxMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hPgw//f9SnGzFoP1uR5TBqM8j/QHulMewew/iD
dM5lh2emdmqHWYPBeRxUHgag38K2Golr3Y+NxLA3R+RMx+OZQe8Mz/wYvPQcBvsV
k1HHImU3GRMn4GM7GwxH3vPIottDUx3mNS2J6pzlw3kwRUVqrxUdj/0/pSY/4eJ7
ZT4uq4yLV83Jd3qioU7o7e/u6MrdNIIcAXRpVDdE9Mm1+kWXSVN7/h3Vsiz4tj5E
iS+UXEtSc1a2mnmekv63pYkJHHNUb6guD8jgI/wrm1KIFGjDRifM+3TV6R/kB96/
TfD2LhCcTShfSp8KI191pgV7/NQbB/PmLdSYmff3rTBiii4cqXuCygJCHInZ09z0
4fTSSqM6aHg7kfTQyOCp+DUQ+9vNVXWo8mxt9c6B8xA0GyCI3zhjQ4UIiSUWRpjs
Be5ZyF0kNNuPxYrKFnGnBf8+51DURpCz3sDdYRuK4KNkj1+4ZvJo/KzGTMUUIE4B
IDQG6wDP5Kb388eRDtKrG5X7IXg+L5F/kezin60j0QF5MwDgxirT217teN8H1lNn
YgWMjRK8Tw0flUJsbCxa51/nl93UtByB+fIRIc88MSeLxcI6/ORW+TxBBEqkYm5Z
6BLFtmHSuAqAXUuyZXSGLcW7XLJvIaDoHgvbDn6l4g7FMWHqPOIq6nJQY3L8ben2
e+fQrGh4noI=
=20Vc
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix task context PMU for Hetero
perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
perf/x86/intel: Fix fixed counter check warning for some Alder Lake
perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
kprobes: Do not increment probe miss count in the fault handler
x86,kprobes: WARN if kprobes tries to handle a fault
kprobes: Remove kprobe::fault_handler
uprobes: Update uprobe_write_opcode() kernel-doc comment
perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
perf/core: Fix DocBook warnings
perf/core: Make local function perf_pmu_snapshot_aux() static
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
confusing the checks when using a static const source.
Move the static const array into a variable so the compiler can perform
appropriate bounds checking.
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210616201823.1245603-1-keescook@chromium.org
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current code doesn't clear the thread/group maps for offline
CPUs. This may cause kernel crashes like the one bewlow in common
code that assumes if a CPU has sibblings it is online.
Unable to handle kernel pointer dereference in virtual kernel address space
Call Trace:
[<000000013a4b8c3c>] blk_mq_map_swqueue+0x10c/0x388
([<000000013a4b8bcc>] blk_mq_map_swqueue+0x9c/0x388)
[<000000013a4b9300>] blk_mq_init_allocated_queue+0x448/0x478
[<000000013a4b9416>] blk_mq_init_queue+0x4e/0x90
[<000003ff8019d3e6>] loop_add+0x106/0x278 [loop]
[<000003ff801b8148>] loop_init+0x148/0x1000 [loop]
[<0000000139de4924>] do_one_initcall+0x3c/0x1e0
[<0000000139ef449a>] do_init_module+0x6a/0x2a0
[<0000000139ef61bc>] __do_sys_finit_module+0xa4/0xc0
[<0000000139de9e6e>] do_syscall+0x7e/0xd0
[<000000013a8e0aec>] __do_syscall+0xbc/0x110
[<000000013a8ee2e8>] system_call+0x78/0xa0
Fixes: 52aeda7acc ("s390/topology: remove offline CPUs from CPU topology masks")
Cc: <stable@kernel.org> # 5.7+
Reported-by: Marius Hillenbrand <mhillen@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The current irq entry code doesn't initialize pt_regs::flags. On exit to
user mode arch_do_signal_or_restart() tests whether PIF_SYSCALL is set,
which might yield wrong results.
Fix this by clearing pt_regs::flags in the entry.S irq handler
code.
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Fixes: 56e62a7370 ("s390: convert to generic entry")
Cc: <stable@vger.kernel.org> # 5.12
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
glibc complained with "The futex facility returned an unexpected error
code.". It turned out that the futex syscall returned -ERESTARTSYS because
a signal is pending. arch_do_signal_or_restart() restored the syscall
parameters (nameley regs->gprs[2]) and set PIF_SYSCALL_RESTART. When
another signal is made pending later in the exit loop
arch_do_signal_or_restart() is called again. This function clears
PIF_SYSCALL_RESTART and checks the return code which is set in
regs->gprs[2]. However, regs->gprs[2] was restored in the previous run
and no longer contains -ERESTARTSYS, so PIF_SYSCALL_RESTART isn't set
again and the syscall is skipped.
Fix this by not clearing PIF_SYSCALL_RESTART - it is already cleared in
__do_syscall() when the syscall is restarted.
Reported-by: Bjoern Walk <bwalk@linux.ibm.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Fixes: 56e62a7370 ("s390: convert to generic entry")
Cc: <stable@vger.kernel.org> # 5.12
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove register asm usage from diag8_noresponse() since it wasn't
needed at all. There is no requirement for even/odd register pairs for
diag 0x8.
For diag_response() use register pairs to fulfill the rx+1 and ry+1
requirements as required if a response buffer is specified. Also
change the inline asm to return the condition code of the diagnose
instruction and do the conditional handling of response length
calculation in C.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When read via debugfs, s390dbf debug-views print the kernel address of
the call-site that created a trace entry. The kernel's %p pointer
hashing feature obfuscates this address, and commit 860ec7c6e2
("s390/debug: use pK for kernel pointers") made this obfuscation
configurable via the kptr_restrict sysctl.
Obfuscation of kernel address data printed via s390dbf debug-views does
not add any additional protection since the associated debugfs files are
only accessible to the root user that typically has enough other means
to obtain kernel address data.
Also trace payload data may contain binary representations of kernel
addresses as part of logged data structues. Requiring such payload data
to be obfuscated as well would be impractical and greatly diminish the
use of s390dbf.
Therefore completely remove pointer obfuscation from s390dbf
debug-views.
Reviewed-by: Steffen Maier <maier@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since OLDMEM_BASE/OLDMEM_SIZE is already taken into consideration and is
reflected in ident_map_size. reserve/remove_oldmem() is no longer needed
and could be removed.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently there are two separate places where kernel memory layout has
to be known and adjusted:
1. early kasan setup.
2. paging setup later.
Those 2 places had to be kept in sync and adjusted to reflect peculiar
technical details of one another. With additional factors which influence
kernel memory layout like ultravisor secure storage limit, complexity
of keeping two things in sync grew up even more.
Besides that if we look forward towards creating identity mapping and
enabling DAT before jumping into uncompressed kernel - that would also
require full knowledge of and control over kernel memory layout.
So, de-duplicate and move kernel memory layout setup logic into
the decompressor.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
Introduce OUTSIDE macro that checks whether an instruction
address is inside or outside of a block of instructions.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
cleanup_sie_mcck label is called from a single location only
and thus does not need to be a subroutine. Move the labelled
code to the caller - by doing that the SIE critical section
checks appear next to each other and the SIE cleanup becomes
bit more readable.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Per-CPU pointer to lowcore is stored in global lowcore_ptr[]
array and duplicated in struct pcpu::lowcore member. This
update removes the redundancy.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Once the kernel is running the boot CPU lowcore becomes
freeable and does not differ from the secondary CPU ones
in any way. Make use of it and do not preserve the boot
CPU lowcore on unplugging. That allows returning unused
memory when the boot CPU is offline and makes the code
more clear.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The lowcore for IPL CPU is special. It is allocated early
in the boot process using memblock and never freed since.
The reason is pcpu_alloc_lowcore() and pcpu_free_lowcore()
routines use page allocator which is not available when
the IPL CPU is getting initialized.
Similar problem is already addressed for stacks - once the
virtual memory is available the early boot stacks get re-
allocated. Doing the same for lowcore will allow freeing
the IPL CPU lowcore and make no difference between the
boot and secondary CPUs.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since commit 9a965ea95135 ("s390/kexec_file: Simplify parmarea
access") we have struct parmarea which describes the layout of the
kernel parameter area.
Make the kernel parameter area available as global variable parmarea
of type struct parmarea, which allows to easily access its members.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Console name reported in /proc/consoles:
ttyS1 -W- (EC p ) 4:65
does not match the char device name:
crw--w---- 1 root root 4, 65 May 17 12:18 /dev/ttysclp0
so debian-installer inside a QEMU s390x instance gets confused and fails
to start with the following error:
steal-ctty: No such file or directory
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Link: https://lore.kernel.org/r/20210427194010.9330-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
gcc-11 warns:
arch/s390/kernel/traps.c: In function __do_pgm_check:
arch/s390/kernel/traps.c:319:17: warning: memcpy reading 256 bytes from a region of size 0 [-Wstringop-overread]
319 | memcpy(¤t->thread.trap_tdb, &S390_lowcore.pgm_tdb, 256);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by adding a struct pgm_tdb to struct lowcore and copy that.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
gcc-11 warns:
arch/s390/kernel/irq.c: In function do_ext_irq:
arch/s390/kernel/irq.c:175:9: warning: memcpy reading 4 bytes from a region of size 0 [-Wstringop-overread]
175 | memcpy(®s->int_code, &S390_lowcore.ext_cpu_addr, 4);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by adding a struct for int_code to struct lowcore.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
With gcc-11, there are a lot of warnings because the facility functions
are accessing lowcore through a null pointer. Fix this by moving the
facility arrays away from lowcore.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
arch/s390/kernel/syscall.c: In function __do_syscall:
arch/s390/kernel/syscall.c:147:9: warning: memcpy reading 64 bytes from a region of size 0 [-Wstringop-overread]
147 | memcpy(®s->gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned long));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/s390/kernel/syscall.c:148:9: warning: memcpy reading 4 bytes from a region of size 0 [-Wstringop-overread]
148 | memcpy(®s->int_code, &S390_lowcore.svc_ilc, sizeof(regs->int_code));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by moving the gprs restore from C to assembly, and use a assignment
for int_code instead of memcpy.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove some WARN_ON_ONCE() warnings when a counter is started. Each
counter is installed function calls
event_sched_in() --> cpumf_pmu_add(..., PERF_EF_START).
This is done after the event has been created using
perf_pmu_event_init() which verifies the counter is valid.
Member hwc->config must be valid at this point.
Function cpumf_pmu_start(..., PERF_EF_RELOAD) is called from
function cpumf_pmu_add() for counter events. All other invocations of
cpumf_pmu_start(..., PERF_EF_RELOAD) are from the performance subsystem
for sampling events.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The command 'perf stat -e cycles ...' triggers the following function
sequence in the CPU Measurement Facility counter device driver:
perf_pmu_event_init()
__hw_perf_event_init()
validate_ctr_auth()
validate_ctr_version()
During event creation, the counter number is checked in functions
validate_ctr_auth() and validate_ctr_version() to verify it is a valid
counter and supported by the hardware. If this is not the case, both
functions return an error and the event is not created. System call
perf_event_open() returns an error in this case.
Later on the event is installed in the kernel event subsystem and the
driver functions cpumf_pmu_add() and cpumf_pmu_commit_txn() are called
to install the counter event by the hardware.
Since both events have been verified at event creation, there is no need
to re-evaluate the authorization state. This can not change since on
* LPARs the authorization change requires a restart of the LPAR (and
thus a reboot of the kernel)
* DPMs can not take resources away, just add them.
Also the sequence of CPU Measurement facility counter device driver
calls is
cpumf_pmu_start_txn
cpumf_pmu_add
cpumf_pmu_start
cpumf_pmu_commit_txn
for every single event. Which means the condition in cpumf_pmu_add()
is never met and validate_ctr_auth() is never called.
This leaves the counter device driver transaction functions with
just one task:
start_txn: Verify a transaction is not in flight and call
perf_pmu_disable()
cancel_txn, commit_txn: Verify a transaction is in flight and call
perf_pmu_enable()
The same functionality is provided by the default transaction handling
functions in kernel/events/core.c. Use those by removing the
counter device driver private call back functions.
Suggested-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Wrong condition check is used to decide if a machine check hit
while in KVM guest. As result of this check the instruction
following the SIE critical section might be considered as still
in KVM guest and _CIF_MCCK_GUEST CPU flag mistakenly set as
result.
Fixes: c929500d7a ("s390/nmi: s390: New low level handling for machine check happening in guest")
Cc: <stable@vger.kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The size of SIE critical section is calculated wrongly
as result of a missed subtraction in commit 0b0ed657fe
("s390: remove critical section cleanup from entry.S")
Fixes: 0b0ed657fe ("s390: remove critical section cleanup from entry.S")
Cc: <stable@vger.kernel.org>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Kprobes has a counter 'nmissed', that is used to count the number of
times a probe handler was not called. This generally happens when we hit
a kprobe while handling another kprobe.
However, if one of the probe handlers causes a fault, we are currently
incrementing 'nmissed'. The comment in fault handler indicates that this
can be used to account faults taken by the probe handlers. But, this has
never been the intention as is evident from the comment above 'nmissed'
in 'struct kprobe':
/*count the number of times this probe was temporarily disarmed */
unsigned long nmissed;
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lkml.kernel.org/r/20210601120150.672652-1-naveen.n.rao@linux.vnet.ibm.com
The reason for kprobe::fault_handler(), as given by their comment:
* We come here because instructions in the pre/post
* handler caused the page_fault, this could happen
* if handler tries to access user space by
* copy_from_user(), get_user() etc. Let the
* user-specified handler try to fix it first.
Is just plain bad. Those other handlers are ran from non-preemptible
context and had better use _nofault() functions. Also, there is no
upstream usage of this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210525073213.561116662@infradead.org
In commit fa8b90070a ("quota: wire up quotactl_path") we have wired up
new quotactl_path syscall. However some people in LWN discussion have
objected that the path based syscall is missing dirfd and flags argument
which is mostly standard for contemporary path based syscalls. Indeed
they have a point and after a discussion with Christian Brauner and
Sascha Hauer I've decided to disable the syscall for now and update its
API. Since there is no userspace currently using that syscall and it
hasn't been released in any major release, we should be fine.
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Sascha Hauer <s.hauer@pengutronix.de>
Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein
Signed-off-by: Jan Kara <jack@suse.cz>
As pointed out by commit
de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")
init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.
As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().
Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().
Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().
Secondary startups were patched via coccinelle:
@begone@
@@
-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
- add support for system call stack randomization.
- handle stale PCI deconfiguration events.
- couple of defconfig updates.
- some fixes and cleanups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmCURRgACgkQIg7DeRsp
bsLiVw//ThqXjgP7koJtawL0MFvSo1V69KTw1QNoMmUvrCynZ8nJlt4sHj1LIuEN
m7kHoWUsvNcg8r8QbxL1eZ2f/Qf43qrFjXIKi5iTdOtO/LF9NNNQYFnA3cT3h9oE
7hfycj8o5yi+KYY3Ca2HjlQ0i7zKYfPul1+Yms5h0nAgcvOXuPltVAlyYrrtddrM
cfpolZZd1IB/lMHSa8/qLviRB5ADlrNx4N6Y1ROeCPCWDbO8flrnDOPTDG8a8sCN
llQ0/vBTmenkGyT7UjG5bx9P/gX1FsMShBtyZMa8t8leIJfruDiwdo87wvSDf5IT
I612xdbLpMfGy6i/LnJHhnw61FkpwBKJZ3UrVVkrmjY8IVN8tVdAjy5s4Fplhgjj
BUbk9Ep03YCqfO6fpqh5DkBxCF0dnj4dZrcHA881/DnZuUkxpMJhyNjJDIx1OLup
PC+y9eILFAnDveFvhZJZeMpH7wAheyrW/WgKZsZNLYZ+61pKPyGn9RrE5UBgI7ra
CSIi9Km/lAuNCd9o4n5/5wCd3a9dW47kCrRe3S20oF57v5RU3AVtbC2YSUqPYahf
NR4ZgL+zDhByLPRVij5FJ1LLeaJJftKM9uUO9egHOk1JnSxDc9EyU41x4838SKYv
CfhQJw1ISTTRNCGeflE7+CBfEYCKX7h+DySVCtxTsg6PelcQHLs=
=C1HZ
-----END PGP SIGNATURE-----
Merge tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- add support for system call stack randomization
- handle stale PCI deconfiguration events
- couple of defconfig updates
- some fixes and cleanups
* tag 's390-5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: fix detection of vector enhancements facility 1 vs. vector packed decimal facility
s390/entry: add support for syscall stack randomization
s390/configs: change CONFIG_VIRTIO_CONSOLE to "m"
s390/cio: remove invalid condition on IO_SCH_UNREG
s390/cpumf: remove call to perf_event_update_userpage
s390/cpumf: move counter set size calculation to common place
s390/cpumf: beautify if-then-else indentation
s390/configs: enable CONFIG_PCI_IOV
s390/pci: handle stale deconfiguration events
s390/pci: rename zpci_configure_device()
The PoP documents:
134: The vector packed decimal facility is installed in the
z/Architecture architectural mode. When bit 134 is
one, bit 129 is also one.
135: The vector enhancements facility 1 is installed in
the z/Architecture architectural mode. When bit 135
is one, bit 129 is also one.
Looks like we confuse the vector enhancements facility 1 ("EXT") with the
Vector packed decimal facility ("BCD"). Let's fix the facility checks.
Detected while working on QEMU/tcg z14 support and only unlocking
the vector enhancements facility 1, but not the vector packed decimal
facility.
Fixes: 2583b848ca ("s390: report new vector facilities")
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20210503121244.25232-1-david@redhat.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEgycj0O+d1G2aycA8rZhLv9lQBTwFAmCInP4ACgkQrZhLv9lQ
BTza0g//dTeb9woC9H7qlEhK4l9yk62lTss60Q8X7m7ZSNfdL4tiEbi64SgK+iOW
OOegbrOEb8Kzh4KJJYmVlVZ5YUWyH4szgmee1wnylBdsWiWaPLPF3Cflz77apy6T
TiiBsJd7rRE29FKheaMt34B41BMh8QHESN+DzjzJWsFoi/uNxjgSs2W16XuSupKu
bpRmB1pYNXMlrkzz7taL05jndZYE5arVriqlxgAsuLOFOp/ER7zecrjImdCM/4kL
W6ej0R1fz2Geh6CsLBJVE+bKWSQ82q5a4xZEkSYuQHXgZV5eywE5UKu8ssQcRgQA
VmGUY5k73rfY9Ofupf2gCaf/JSJNXKO/8Xjg0zAdklKtmgFjtna5Tyg9I90j7zn+
5swSpKuRpilN8MQH+6GWAnfqQlNoviTOpFeq3LwBtNVVOh08cOg6lko/bmebBC+R
TeQPACKS0Q0gCDPm9RYoU1pMUuYgfOwVfVRZK1prgi2Co7ZBUMOvYbNoKYoPIydr
ENBYljlU1OYwbzgR2nE+24fvhU8xdNOVG1xXYPAEHShu+p7dLIWRLhl8UCtRQpSR
1ofeVaJjgjrp29O+1OIQjB2kwCaRdfv/Gq1mztE/VlMU/r++E62OEzcH0aS+mnrg
yzfyUdI8IFv1q6FGT9yNSifWUWxQPmOKuC8kXsKYfqfJsFwKmHM=
=uCN4
-----END PGP SIGNATURE-----
Merge tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull Landlock LSM from James Morris:
"Add Landlock, a new LSM from Mickaël Salaün.
Briefly, Landlock provides for unprivileged application sandboxing.
From Mickaël's cover letter:
"The goal of Landlock is to enable to restrict ambient rights (e.g.
global filesystem access) for a set of processes. Because Landlock
is a stackable LSM [1], it makes possible to create safe security
sandboxes as new security layers in addition to the existing
system-wide access-controls. This kind of sandbox is expected to
help mitigate the security impact of bugs or unexpected/malicious
behaviors in user-space applications. Landlock empowers any
process, including unprivileged ones, to securely restrict
themselves.
Landlock is inspired by seccomp-bpf but instead of filtering
syscalls and their raw arguments, a Landlock rule can restrict the
use of kernel objects like file hierarchies, according to the
kernel semantic. Landlock also takes inspiration from other OS
sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD
Pledge/Unveil.
In this current form, Landlock misses some access-control features.
This enables to minimize this patch series and ease review. This
series still addresses multiple use cases, especially with the
combined use of seccomp-bpf: applications with built-in sandboxing,
init systems, security sandbox tools and security-oriented APIs [2]"
The cover letter and v34 posting is here:
https://lore.kernel.org/linux-security-module/20210422154123.13086-1-mic@digikod.net/
See also:
https://landlock.io/
This code has had extensive design discussion and review over several
years"
Link: https://lore.kernel.org/lkml/50db058a-7dde-441b-a7f9-f6837fe8b69f@schaufler-ca.com/ [1]
Link: https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad0468cd@digikod.net/ [2]
* tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
landlock: Enable user space to infer supported features
landlock: Add user and kernel documentation
samples/landlock: Add a sandbox manager example
selftests/landlock: Add user space tests
landlock: Add syscall implementations
arch: Wire up Landlock syscalls
fs,security: Add sb_delete hook
landlock: Support filesystem access-control
LSM: Infrastructure management of the superblock
landlock: Add ptrace restrictions
landlock: Set up the security framework and manage credentials
landlock: Add ruleset and domain management
landlock: Add object management
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
This adds support for adding a random offset to the stack while handling
syscalls. The patch uses get_tod_clock_fast() as this is considered good
enough and has much less performance penalty compared to using
get_random_int(). The patch also adds randomization in pgm_check_handler()
as the sigreturn/rt_sigreturn system calls might be called from there.
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210429091451.1062594-1-svens@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The function cpumf_pmu_add and cpumf_pmu_del call function
perf_event_update_userpage(). This calls is obsolete, the calls add and
delete a counter event. Counter events do not sample data and the
event->rb member to access the sampling ring buffer is always NULL.
The function perf_event_update_userpage() simply returns in this case.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The function to calculate the size of counter sets is renamed from
cf_diag_ctrset_size() to cpum_cf_ctrset_size() and moved to the file
containing common functions for the CPU Measurement Counter Facility.
No functional change.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Beautify if-then-else indentation to match coding guideline.
Also use shorter pointer notation hwc instead of event->hw.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by : Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmCJU1UACgkQnJ2qBz9k
QNk62AgAgp05OIXU/AgObb7DvSyI3ycwCV8PeWBpwD8yoDAh5x0tmT7vnJu974p6
yHdnF7rr69ZzvbNCHLJ5kRykRlUao9W7cO5fdOW1uTpL7Ic60QuJMks/NfgVTHp1
2zIQmBDerfn1/LTK8r2pPGcvtcjRcr7Ep4beN0Duw57lfVMJhjsNRPnBbXGBcp0r
QzKk4/8V3DCZvOw+XNC3nto7avjvf+nU9sJmuh83546eqh0atjWivvO5aAlDOe6W
rhBiLlmP0in5u2n1fYqzI1OQvtgtleyEZT2G0CrbAZn0xjmV/if9wl+3K6TOwDvR
778xDEX7sZCaO/xkB+WK3hrd15ftKg==
=0kYE
-----END PGP SIGNATURE-----
Merge tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, ext2, reiserfs updates from Jan Kara:
- support for path (instead of device) based quotactl syscall
(quotactl_path(2))
- ext2 conversion to kmap_local()
- other minor cleanups & fixes
* tag 'for_v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fs/reiserfs/journal.c: delete useless variables
fs/ext2: Replace kmap() with kmap_local_page()
ext2: Match up ext2_put_page() with ext2_dotdot() and ext2_find_entry()
fs/ext2/: fix misspellings using codespell tool
quota: report warning limits for realtime space quotas
quota: wire up quotactl_path
quota: Add mountpath based quota support
- fix buffer size for in-kernel disassembler for ebpf programs.
- fix two memory leaks in zcrypt driver.
- expose PCI device UID as index, including an indicator if the uid is
unique.
- remove some oprofile leftovers.
- improve stack unwinder tests.
- don't use gcc atomic builtins anymore, just like all other
architectures. Even though I'm sure the current code is ok, I
totally dislike that s390 is the only architecture being special
here; especially considering that there was a lengthly discussion
about this topic and the outcome was not to use the builtins.
Therefore open-code atomic ops again with inline assembly and switch
to gcc builtins as soon as other architectures are doing.
- couple of other changes to atomic and cmpxchg, and use
atomic-instrumented.h for KASAN.
- separate zbus creation, registration, and scanning in our PCI code
which allows for cleaner and easier handling.
- a rather large change to the vfio-ap code to fix circular locking
dependencies when updating crypto masks.
- move QAOB handling from qdio layer down to drivers.
- add CRW inject facility to common I/O layer. This adds debugs files
which allow to generate artificial events from user space for
testing purposes.
- increase SCLP console line length from 80 to 320 characters to avoid
odd wrapped lines.
- add protected virtualization guest and host indication files, which
indicate either that a guest is running in pv mode or if the
hypervisor is capable of starting pv guests.
- various other small fixes and improvements all over the place.
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmCICNwACgkQIg7DeRsp
bsJgSQ/0Cn3KTymF8SJ7tXLpYBNHmZBL1sQ284pPNYmqluwephUaX643f/48RR5y
rlbaewyHCbYyR+gwkUykuUQm/d+iwihip/uFmyEktr9JutOOS1RQKd8ujeyE2BUb
aaDyE0J5VFYdd/ZA92n9FhkuNDOMZRFAK1SOfifT9jNWCl8iYz+pXV1Gx7LbJYVY
KWfJ9D/zgzLoOTWhj4jWu8LutfLEqK+hq5nqxBII8APCV/QDYnjkwpwW01LoMtOv
eHhtSz0JboFRk0FYf8oyR7AXQBz76+Ru3aivJNL7sr1S3N2yMSzNQbk/ATVBLER9
VMQX2TfGGT/Ln3P4rYEoP2vGITRn765wg4KWNB2u3pY2try12G39fmjzOwVfbxQw
BDAcLwxU7Tw0vJY+yI6ZWkPDXcs+uAWQwNiYoMtfUPfMYLEpLFffbWGwdZPKZRrH
fy4e5ZFuavZsZr8Zeu4WYILJZoDnvhbs59gPzaLBPKosR0ZGNi8q6bnztnqnrYhi
Oirt6aPOOyEoN/IT2bO1sDhIzIpCorIwCMZTRIQzerFRcjJgy0xHM9MRYRLMj6iW
xltgWNt01SbTm6pbimwMUnjN5AdMTbHlTSzD8G34eWO21cVgHGpndbcT/M3uemyy
Wf034Er2eFqlhXyhiAYTnNhVGbd+YMido7eo/CbGvqCxNvmKbA==
=e+mh
-----END PGP SIGNATURE-----
Merge tag 's390-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- fix buffer size for in-kernel disassembler for ebpf programs.
- fix two memory leaks in zcrypt driver.
- expose PCI device UID as index, including an indicator if the uid is
unique.
- remove some oprofile leftovers.
- improve stack unwinder tests.
- don't use gcc atomic builtins anymore, just like all other
architectures. Even though I'm sure the current code is ok, I totally
dislike that s390 is the only architecture being special here;
especially considering that there was a lengthly discussion about
this topic and the outcome was not to use the builtins. Therefore
open-code atomic ops again with inline assembly and switch to gcc
builtins as soon as other architectures are doing.
- couple of other changes to atomic and cmpxchg, and use
atomic-instrumented.h for KASAN.
- separate zbus creation, registration, and scanning in our PCI code
which allows for cleaner and easier handling.
- a rather large change to the vfio-ap code to fix circular locking
dependencies when updating crypto masks.
- move QAOB handling from qdio layer down to drivers.
- add CRW inject facility to common I/O layer. This adds debugs files
which allow to generate artificial events from user space for testing
purposes.
- increase SCLP console line length from 80 to 320 characters to avoid
odd wrapped lines.
- add protected virtualization guest and host indication files, which
indicate either that a guest is running in pv mode or if the
hypervisor is capable of starting pv guests.
- various other small fixes and improvements all over the place.
* tag 's390-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (53 commits)
s390/disassembler: increase ebpf disasm buffer size
s390/archrandom: add parameter check for s390_arch_random_generate
s390/zcrypt: fix zcard and zqueue hot-unplug memleak
s390/pci: expose a PCI device's UID as its index
s390/atomic,cmpxchg: always inline __xchg/__cmpxchg
s390/smp: fix do_restart() prototype
s390: get rid of oprofile leftovers
s390/atomic,cmpxchg: make constraints work with old compilers
s390/test_unwind: print test suite start/end info
s390/cmpxchg: use unsigned long values instead of void pointers
s390/test_unwind: add WARN if tests failed
s390/test_unwind: unify error handling paths
s390: update defconfigs
s390/spinlock: use R constraint in inline assembly
s390/atomic,cmpxchg: switch to use atomic-instrumented.h
s390/cmpxchg: get rid of gcc atomic builtins
s390/atomic: get rid of gcc atomic builtins
s390/atomic: use proper constraints
s390/atomic: move remaining inline assemblies to atomic_ops.h
s390/bitops: make bitops only work on longs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
+HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
/uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
=bO1i
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Borislav Petkov:
"Trivial cleanups and fixes all over the place"
* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove me from IDE/ATAPI section
x86/pat: Do not compile stubbed functions when X86_PAT is off
x86/asm: Ensure asm/proto.h can be included stand-alone
x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
x86/msr: Make locally used functions static
x86/cacheinfo: Remove unneeded dead-store initialization
x86/process/64: Move cpu_current_top_of_stack out of TSS
tools/turbostat: Unmark non-kernel-doc comment
x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
x86/fpu/math-emu: Fix function cast warning
x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
x86: Fix various typos in comments, take #2
x86: Remove unusual Unicode characters from comments
x86/kaslr: Return boolean values from a function returning bool
x86: Fix various typos in comments
x86/setup: Remove unused RESERVE_BRK_ARRAY()
stacktrace: Move documentation for arch_stack_walk_reliable() to header
x86: Remove duplicate TSC DEADLINE MSR definitions
New features:
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
- Alexandru is now a reviewer (not really a new feature...)
Fixes:
- Proper emulation of the GICR_TYPER register
- Handle the complete set of relocation in the nVHE EL2 object
- Get rid of the oprofile dependency in the PMU code (and of the
oprofile body parts at the same time)
- Debug and SPE fixes
- Fix vcpu reset
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCCpuAPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD2G8QALWQYeBggKnNmAJfuihzZ2WariBmgcENs2R2
qNZ/Py6dIF+b69P68nmgrEV1x2Kp35cPJbBwXnnrS4FCB5tk0b8YMaj00QbiRIYV
UXbPxQTmYO1KbevpoEcw8NmR4bZJ/hRYPuzcQG7CCMKIZw0zj2cMcBofzQpTOAp/
CgItdcv7at3iwamQatfU9vUmC0nDdnjdIwSxTAJOYMVV1ENwtnYSNgZVo4XLTg7n
xR/5Qx27PKBJw7GyTRAIIxKAzNXG2tDL+GVIHe4AnRp3z3La8sr6PJf7nz9MCmco
ISgeY7EGQINzmm4LahpnV+2xwwxOWo8QotxRFGNuRTOBazfARyAbp97yJ6eXJUpa
j0qlg3xK9neyIIn9BQKkKx4sY9V45yqkuVDsK6odmqPq3EE01IMTRh1N/XQi+sTF
iGrlM3ZW4AjlT5zgtT9US/FRXeDKoYuqVCObJeXZdm3sJSwEqTAs0JScnc0YTsh7
m30CODnomfR2y5X6GoaubbQ0wcZ2I20K1qtIm+2F6yzD5P1/3Yi8HbXMxsSWyYWZ
1ldoSa+ZUQlzV9Ot0S3iJ4PkphLKmmO96VlxE2+B5gQG50PZkLzsr8bVyYOuJC8p
T83xT9xd07cy+FcGgF9veZL99Y6BLHMa6ZwFUolYNbzJxqrmqyR1aiJMEBIcX+aP
ACeKW1w5
=fpey
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.13
New features:
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
- Alexandru is now a reviewer (not really a new feature...)
Fixes:
- Proper emulation of the GICR_TYPER register
- Handle the complete set of relocation in the nVHE EL2 object
- Get rid of the oprofile dependency in the PMU code (and of the
oprofile body parts at the same time)
- Debug and SPE fixes
- Fix vcpu reset
Funciton do_restart() is a callback invoked from the
restart CPU routine and passed a single parameter.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add conditional trap handlers similar to conditional system calls
(COND_SYSCALL), to reduce the number of ifdefs.
Trap handlers which may or may not exist depending on config options
are supposed to have a COND_TRAP entry, which redirects to
default_trap_handler() for non-existent trap handlers during link
time.
This allows to get rid of the secure execution trap handlers for the
!PGSTE case.
Reviewed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Convert the program check table to C. Which allows to get rid of yet
another assembler file, and also enables proper type checking for the
table.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baisong Zhong <zhongbaisong@huawei.com>
Fixes: 37564ed834 ("s390/uv: add prot virt guest/host indication files")
Link: https://lore.kernel.org/r/2f7d62a4-3e75-b2b4-951b-75ef8ef59d16@huawei.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
* fixes:
s390/entry: save the caller of psw_idle
s390/entry: avoid setting up backchain in ext|io handlers
s390/setup: use memblock_free_late() to free old stack
s390/irq: fix reading of ext_params2 field from lowcore
s390/unwind: add machine check handler stack
s390/cpcmd: fix inline assembly register clobbering
MAINTAINERS: add backups for s390 vfio drivers
s390/vdso: fix initializing and updating of vdso_data
s390/vdso: fix tod_steering_delta type
s390/vdso: copy tod_steering_delta value to vdso_data page
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently psw_idle does not allocate a stack frame and does not
save its r14 and r15 into the save area. Even though this is valid from
call ABI point of view, because psw_idle does not make any calls
explicitly, in reality psw_idle is an entry point for controlled
transition into serving interrupts. So, in practice, psw_idle stack
frame is analyzed during stack unwinding. Depending on build options
that r14 slot in the save area of psw_idle might either contain a value
saved by previous sibling call or complete garbage.
[task 0000038000003c28] do_ext_irq+0xd6/0x160
[task 0000038000003c78] ext_int_handler+0xba/0xe8
[task *0000038000003dd8] psw_idle_exit+0x0/0x8 <-- pt_regs
([task 0000038000003dd8] 0x0)
[task 0000038000003e10] default_idle_call+0x42/0x148
[task 0000038000003e30] do_idle+0xce/0x160
[task 0000038000003e70] cpu_startup_entry+0x36/0x40
[task 0000038000003ea0] arch_call_rest_init+0x76/0x80
So, to make a stacktrace nicer and actually point for the real caller of
psw_idle in this frequently occurring case, make psw_idle save its r14.
[task 0000038000003c28] do_ext_irq+0xd6/0x160
[task 0000038000003c78] ext_int_handler+0xba/0xe8
[task *0000038000003dd8] psw_idle_exit+0x0/0x6 <-- pt_regs
([task 0000038000003dd8] arch_cpu_idle+0x3c/0xd0)
[task 0000038000003e10] default_idle_call+0x42/0x148
[task 0000038000003e30] do_idle+0xce/0x160
[task 0000038000003e70] cpu_startup_entry+0x36/0x40
[task 0000038000003ea0] arch_call_rest_init+0x76/0x80
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently when interrupt arrives to cpu while in kernel context
INT_HANDLER macro (used for ext_int_handler and io_int_handler)
allocates new stack frame and pt_regs on the kernel stack and
sets up the backchain to jump over the pt_regs to the frame which has
been interrupted. This is not ideal to two reasons:
1. This hides the fact that kernel stack contains interrupt frame in it
and hence breaks arch_stack_walk_reliable(), which needs to know that to
guarantee "reliability" and checks that there are no pt_regs on the way.
2. It breaks the backchain unwinder logic, which assumes that the next
stack frame after an interrupt frame is reliable, while it is not.
In some cases (when r14 contains garbage) this leads to early unwinding
termination with an error, instead of marking frame as unreliable
and continuing.
To address that, only set backchain to 0.
Fixes: 56e62a7370 ("s390: convert to generic entry")
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use memblock_free_late() to free the old machine check stack to the
buddy allocator instead of leaking it.
Fixes: b61b159512 ("s390: add stack for machine check handler")
Cc: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Li Wang reported that clock_gettime(CLOCK_MONOTONIC_RAW, ...) returns
incorrect values when time is provided via vdso instead of system call:
vdso_ts_nsec = 4484351380985507, vdso_ts.tv_sec = 4484351, vdso_ts.tv_nsec = 380985507
sys_ts_nsec = 1446923235377, sys_ts.tv_sec = 1446, sys_ts.tv_nsec = 923235377
Within the s390 specific vdso function __arch_get_hw_counter() reads
tod clock steering values from the arch_data member of the passed in
vdso_data structure.
Problem is that only for the CS_HRES_COARSE vdso_data arch_data is
initialized and gets updated. The CS_RAW specific vdso_data does not
contain any valid tod_clock_steering information, which explains the
different values.
Fix this by initializing and updating all vdso_datas.
Reported-by: Li Wang <liwang@redhat.com>
Tested-by: Li Wang <liwang@redhat.com>
Fixes: 1ba2d6c0fd ("s390/vdso: simplify __arch_get_hw_counter()")
Link: https://lore.kernel.org/linux-s390/YFnxr1ZlMIOIqjfq@osiris
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
When converting the vdso assembler code to C it was forgotten to
actually copy the tod_steering_delta value to vdso_data page.
Which in turn means that tod clock steering will not work correctly.
Fix this by simply copying the value whenever it is updated.
Fixes: 4bff8cb545 ("s390: convert to GENERIC_VDSO")
Cc: <stable@vger.kernel.org> # 5.10
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
prot_virt_host is only available if CONFIG_KVM is enabled. So lets use
a variable initialized to zero and overwrite it when that config
option is set with prot_virt_host.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Fixes: 37564ed834 ("s390/uv: add prot virt guest/host indication files")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Let's export the prot_virt_guest and prot_virt_host variables into the
UV sysfs firmware interface to make them easily consumable by
administrators.
prot_virt_host being 1 indicates that we did the UV
initialization (opt-in)
prot_virt_guest being 1 indicates that the UV indicates the share and
unshare ultravisor calls which is an indication that we are running as
a protected guest.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Commit 152e9b8676 ("s390/vtime: steal time exponential moving average")
inadvertently changed the input value for account_steal_time() from
"cputime_to_nsecs(steal)" to just "steal", resulting in broken increased
steal time accounting.
Fix this by changing it back to "cputime_to_nsecs(steal)".
Fixes: 152e9b8676 ("s390/vtime: steal time exponential moving average")
Cc: <stable@vger.kernel.org> # 5.1
Reported-by: Sabine Forkel <sabine.forkel@de.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The following BUG message was triggered repeatedly when complete counter
sets are extracted from the CPUMF:
BUG: using smp_processor_id() in preemptible [00000000]
code: psvc-readsets/7759
caller is cf_diag_needspace+0x2c/0x100
CPU: 7 PID: 7759 Comm: psvc-readsets Not tainted 5.12.0
Hardware name: IBM 3906 M03 703 (LPAR)
Call Trace:
[<00000000c7043f78>] show_stack+0x90/0xf8
[<00000000c705776a>] dump_stack+0xba/0x108
[<00000000c705d91c>] check_preemption_disabled+0xec/0xf0
[<00000000c63eb1c4>] cf_diag_needspace+0x2c/0x100
[<00000000c63ecbcc>] cf_diag_ioctl_start+0x10c/0x240
[<00000000c63ece9a>] cf_diag_ioctl+0x19a/0x238
[<00000000c675f3f4>] __s390x_sys_ioctl+0xc4/0x100
[<00000000c63ca762>] do_syscall+0x82/0xd0
[<00000000c705bdd8>] __do_syscall+0xc0/0xd8
[<00000000c706d532>] system_call+0x72/0x98
2 locks held by psvc-readsets/7759:
#0: 00000000c75a57c0 (cpu_hotplug_lock){++++}-{0:0},
at: cf_diag_ioctl+0x44/0x238
#1: 00000000c75a3078 (cf_diag_ctrset_mutex){+.+.}-{3:3},
at: cf_diag_ioctl+0x54/0x238
This issue is a missing get_cpu_ptr/put_cpu_ptr pair in function
cf_diag_needspace. Add it.
Fixes: cf6acb8bdb ("s390/cpumf: Add support for complete counter set extraction")
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Currently arch_stack_walk_reliable() is documented with an identical
comment in both x86 and S/390 implementations which is a bit redundant.
Move this to the header and convert to kerneldoc while we're at it.
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lkml.kernel.org/r/20210309194125.652-1-broonie@kernel.org
When we intercept a DIAG_9C from the guest we verify that the
target real CPU associated with the virtual CPU designated by
the guest is running and if not we forward the DIAG_9C to the
target real CPU.
To avoid a diag9c storm we allow a maximal rate of diag9c forwarding.
The rate is calculated as a count per second defined as a new
parameter of the s390 kvm module: diag9c_forwarding_hz .
The default value of 0 is to not forward diag9c.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Link: https://lore.kernel.org/r/1613997661-22525-2-git-send-email-pmorel@linux.ibm.com
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Remove the 60 seconds read interval limit. Do not impose any limit
at all and allow read of complete counter sets.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The cpumask being checked in cpu_group_map() must have at least one
cpu set; therefore remove the check.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Get rid of unsigned long long, and use unsigned long instead
everywhere. The usage of unsigned long long is a leftover from
31 bit kernel support.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA4JRkQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoWqD/9dbbqe8L701U6May1A/4hRsqL4THTA2flx
vNCNRBl6XV3l/wBCtL6waKy6tyO4lyM8XdUdEvo3Kxl2kGPb8eVfpyYL/+77HqyH
ctT4RMrs+84Mxn+5N6cM97hS1qVI2moTxxyvOEl/JTB7BYrutz9gvAoeY3/Dto47
J66oSaPeuqJ32TyihxfQHVxQopJcqFzDjyoYHGDu6ATio1PXfaIdTu8ywVYSECAh
pWI4rwnqdurGuHMNpxyL1bA6CT/jC7s+sqU7bUYUCgtYI3eG0u3V0bp5gAQQIgl9
5sxxE3DidYGAkYZsosrelshBtzGddLdz4Qrt2ungMYv8RsGNpFQ095jDPKDwFaZj
bSvSsfplCo7iFsJByb1TtpNEOW8eAwi81PmBDVQ9Oq5P5ygTYno9GBDc/20ql0Fk
q6wcX28coE3IBw44ne0hIwvBOtXV4WJyluG/gqOxfbTH+kOy3pDsN8lWcY/P4X0U
yzdU2MLHe8BNMyYlUiBF47Amzt4ltr85P4XD3WZ4bX71iwri6HvrdGWLuuKwX+Ie
66QiIDDQIYZQ6NMMJWS9DGW3y3DBizpSXGxONbOw1J2bQdNmtToR0D2UnK/9UnKp
msnvkUNk8fkYGS4aptpJ6HxbmjMEG5YtbiGlPj6fz5/7MTvhRjPxt7A0LWrUIdqR
f88+sHUMqg==
=oc8u
-----END PGP SIGNATURE-----
Merge tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block
Pull io_uring thread rewrite from Jens Axboe:
"This converts the io-wq workers to be forked off the tasks in question
instead of being kernel threads that assume various bits of the
original task identity.
This kills > 400 lines of code from io_uring/io-wq, and it's the worst
part of the code. We've had several bugs in this area, and the worry
is always that we could be missing some pieces for file types doing
unusual things (recent /dev/tty example comes to mind, userfaultfd
reads installing file descriptors is another fun one... - both of
which need special handling, and I bet it's not the last weird oddity
we'll find).
With these identical workers, we can have full confidence that we're
never missing anything. That, in itself, is a huge win. Outside of
that, it's also more efficient since we're not wasting space and code
on tracking state, or switching between different states.
I'm sure we're going to find little things to patch up after this
series, but testing has been pretty thorough, from the usual
regression suite to production. Any issue that may crop up should be
manageable.
There's also a nice series of further reductions we can do on top of
this, but I wanted to get the meat of it out sooner rather than later.
The general worry here isn't that it's fundamentally broken. Most of
the little issues we've found over the last week have been related to
just changes in how thread startup/exit is done, since that's the main
difference between using kthreads and these kinds of threads. In fact,
if all goes according to plan, I want to get this into the 5.10 and
5.11 stable branches as well.
That said, the changes outside of io_uring/io-wq are:
- arch setup, simple one-liner to each arch copy_thread()
implementation.
- Removal of net and proc restrictions for io_uring, they are no
longer needed or useful"
* tag 'io_uring-worker.v3-2021-02-25' of git://git.kernel.dk/linux-block: (30 commits)
io-wq: remove now unused IO_WQ_BIT_ERROR
io_uring: fix SQPOLL thread handling over exec
io-wq: improve manager/worker handling over exec
io_uring: ensure SQPOLL startup is triggered before error shutdown
io-wq: make buffered file write hashed work map per-ctx
io-wq: fix race around io_worker grabbing
io-wq: fix races around manager/worker creation and task exit
io_uring: ensure io-wq context is always destroyed for tasks
arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread()
io_uring: cleanup ->user usage
io-wq: remove nr_process accounting
io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS
net: remove cmsg restriction from io_uring based send/recvmsg calls
Revert "proc: don't allow async path resolution of /proc/self components"
Revert "proc: don't allow async path resolution of /proc/thread-self components"
io_uring: move SQPOLL thread io-wq forked worker
io-wq: make io_wq_fork_thread() available to other users
io-wq: only remove worker from free_list, if it was there
io_uring: remove io_identity
io_uring: remove any grabbing of context
...
- Fix physical vs virtual confusion in some basic mm macros and
routines. Caused by __pa == __va on s390 currently.
- Get rid of on-stack cpu masks.
- Add support for complete CPU counter set extraction.
- Add arch_irq_work_raise implementation.
- virtio-ccw revision and opcode fixes.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmA5UHMACgkQjYWKoQLX
FBh/Vgf/ezaC/qx8cPAJKemWTST5tK/cc3g63L5SlvIsiKTTcO08+PYpGEo5ajU8
0DDpDZdP+DYeDgLatrFFj+MlQyIL4wd602uKiRqD1LwjTR1oA+6HDDQtE41v2Z2a
t2Kuv32dGVT5361RythnerTdMx18XG6k77JprP6b7zCFa2qCpc8DaKk7FvcqnQt+
gfebknC/hdL15IJhtpPrIoqqerDXxB+HU+dSouipQwiamtENGFOuJ5csgl9XIJuR
ILzq3Ad/6u8yKf77c+q9WRvu3DTIsXXSY8xwl3zJjNqXcAnN2sa4apc7+noNJ0YZ
UkbokSshQomfOYMhIxyLs9AKEEcltg==
=YVkj
-----END PGP SIGNATURE-----
Merge tag 's390-5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Vasily Gorbik:
- Fix physical vs virtual confusion in some basic mm macros and
routines. Caused by __pa == __va on s390 currently.
- Get rid of on-stack cpu masks.
- Add support for complete CPU counter set extraction.
- Add arch_irq_work_raise implementation.
- virtio-ccw revision and opcode fixes.
* tag 's390-5.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/cpumf: Add support for complete counter set extraction
virtio/s390: implement virtio-ccw revision 2 correctly
s390/smp: implement arch_irq_work_raise()
s390/topology: move cpumasks away from stack
s390/smp: smp_emergency_stop() - move cpumask away from stack
s390/smp: __smp_rescan_cpus() - move cpumask away from stack
s390/smp: consolidate locking for smp_rescan()
s390/mm: fix phys vs virt confusion in vmem_*() functions family
s390/mm: fix phys vs virt confusion in pgtable allocation routines
s390/mm: fix invalid __pa() usage in pfn_pXd() macros
s390/mm: make pXd_deref() macros return a pointer
s390/opcodes: rename selhhhr to selfhr
The irq stack switching was moved out of the ASM entry code in course of
the entry code consolidation. It ended up being suboptimal in various
ways.
- Make the stack switching inline so the stackpointer manipulation is not
longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got confused
about the stack pointer manipulation.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmA21OcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaX0D/9S0ud6oqbsIvI8LwhvYub63a2cjKP9
liHAJ7xwMYYVwzf0skwsPb/QE6+onCzdq0upJkgG/gEYm2KbiaMWZ4GgHdj0O7ER
qXKJONDd36AGxSEdaVzLY5kPuD/mkomGk5QdaZaTmjruthkNzg4y/N2wXUBIMZR0
FdpSpp5fGspSZCn/DXDx6FjClwpLI53VclvDs6DcZ2DIBA0K+F/cSLb1UQoDLE1U
hxGeuNa+GhKeeZ5C+q5giho1+ukbwtjMW9WnKHAVNiStjm0uzdqq7ERGi/REvkcB
LY62u5uOSW1zIBMmzUjDDQEqvypB0iFxFCpN8g9sieZjA0zkaUioRTQyR+YIQ8Cp
l8LLir0dVQivR1bHghHDKQJUpdw/4zvDj4mMH10XHqbcOtIxJDOJHC5D00ridsAz
OK0RlbAJBl9FTdLNfdVReBCoehYAO8oefeyMAG12nZeSh5XVUWl238rvzmzIYNhG
cEtkSx2wIUNEA+uSuI+xvfmwpxL7voTGvqmiRDCAFxyO7Bl/GBu9OEBFA1eOvHB+
+wTmPDMswRetQNh4QCRXzk1JzP1Wk5CobUL9iinCWFoTJmnsPPSOWlosN6ewaNXt
kYFpRLy5xt9EP7dlfgBSjiRlthDhTdMrFjD5bsy1vdm1w7HKUo82lHa4O8Hq3PHS
tinKICUqRsbjig==
=Sqr1
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 irq entry updates from Thomas Gleixner:
"The irq stack switching was moved out of the ASM entry code in course
of the entry code consolidation. It ended up being suboptimal in
various ways.
This reworks the X86 irq stack handling:
- Make the stack switching inline so the stackpointer manipulation is
not longer at an easy to find place.
- Get rid of the unnecessary indirect call.
- Avoid the double stack switching in interrupt return and reuse the
interrupt stack for softirq handling.
- A objtool fix for CONFIG_FRAME_POINTER=y builds where it got
confused about the stack pointer manipulation"
* tag 'x86-entry-2021-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix stack-swizzle for FRAME_POINTER=y
um: Enforce the usage of asm-generic/softirq_stack.h
x86/softirq/64: Inline do_softirq_own_stack()
softirq: Move do_softirq_own_stack() to generic asm header
softirq: Move __ARCH_HAS_DO_SOFTIRQ to Kconfig
x86: Select CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
x86/softirq: Remove indirection in do_softirq_own_stack()
x86/entry: Use run_sysvec_on_irqstack_cond() for XEN upcall
x86/entry: Convert device interrupts to inline stack switching
x86/entry: Convert system vectors to irq stack macro
x86/irq: Provide macro for inlining irq stack switching
x86/apic: Split out spurious handling code
x86/irq/64: Adjust the per CPU irq stack pointer by 8
x86/irq: Sanitize irq stack tracking
x86/entry: Fix instrumentation annotation
Add support to the CPU Measurement counter facility device driver
to extract complete counter sets per CPU and per counter set from user
space. This includes a new device named /dev/hwctr and support
for the device driver functions open, close and ioctl. Other
functions are not supported.
The ioctl command supports 3 subcommands:
S390_HWCTR_START: enables counter sets on a list of CPUs.
S390_HWCTR_STOP: disables counter sets on a list of CPUs.
S390_HWCTR_READ: reads counter sets on a list of CPUs.
The ioctl(..., S390_HWCTR_READ, ...) is the only subcommand which
returns data. It requires member data_bytes to be positive and
indicates the maximum amount of data available to store counter set
data. The other ioctl() subcommands do not use this member and it
should be set to zero.
The S390_HWCTR_READ subcommand returns the following data:
The cpuset data is flattened using the following scheme, stored in member
data:
0x0 0x8 0xc 0x10 0x10 0x18 0x20 0x28 0xU-1
+---------+-----+---------+-----+---------+-----+-----+------+------+
| no_cpus | cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
+---------+-----+---------+-----+---------+-----+-----+------+------+
0xU 0xU+4 0xU+8 0xU+10 0xV-1
+-----+---------+-----+-----+------+------+
| set | no_cnts | cv1 | cv2 | .... | cv_n |
+-----+---------+-----+-----+------+------+
0xV 0xV+4 0xV+8 0xV+c
+-----+---------+-----+---------+-----+-----+------+------+
| cpu | no_sets | set | no_cnts | cv1 | cv2 | .... | cv_n |
+-----+---------+-----+---------+-----+-----+------+------+
U and V denote arbitrary hexadezimal addresses.
The first integer represents the number of CPUs data was extracted
from. This is followed by CPU number and number of counter sets extracted.
Both are two integer values. This is followed by the set identifer
and number of counters extracted. Both are two integer values. This is
followed by the counter values, each element is eight bytes in size.
The S390_HWCTR_READ ioctl subcommand is also limited to one call per
minute. This ensures that an application does not read out the
counter sets too often and reduces the overall CPU performance.
The complete counter set extraction is an expensive operation.
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The immediate need to have this is to have bpf_send_signal() send the
signal ASAP instead of during the next hrtimer interrupt. However, it
should also improve irq_work_queue() latencies in general, as well as
get s390 out of the lame architectures list [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/irq_work.c?h=v5.11#n45
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make cpumasks static variables to avoid potential large stack
frames. There shouldn't be any concurrent callers since all current
callers are serialized with the cpu hotplug lock.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make "cpumask_t cpumask" a static variable to avoid a potential large
stack frame. Also protect against potential concurrent callers by
introducing a local lock.
Note: smp_emergency_stop() gets only called with irqs and machine
checks disabled, therefore a cpu local deadlock is not possible. For
concurrent callers the first cpu which enters the critical section
wins and will stop all other cpus.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Avoid a potentially large stack frame and overflow by making
"cpumask_t avail" a static variable. There is no concurrent
access due to the existing locking.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Move locking to __smp_rescan() instead of duplicating it to all call sites.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdfhttps://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
1d7b902e28
In order to support idmapped mounts, filesystems need to be changed
and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
patches to convert individual filesystem are not very large or
complicated overall as can be seen from the included fat, ext4, and
xfs ports. Patches for other filesystems are actively worked on and
will be sent out separately. The xfstestsuite can be used to verify
that port has been done correctly.
The mount_setattr() syscall is motivated independent of the idmapped
mounts patches and it's been around since July 2019. One of the most
valuable features of the new mount api is the ability to perform
mounts based on file descriptors only.
Together with the lookup restrictions available in the openat2()
RESOLVE_* flag namespace which we added in v5.6 this is the first time
we are close to hardened and race-free (e.g. symlinks) mounting and
path resolution.
While userspace has started porting to the new mount api to mount
proper filesystems and create new bind-mounts it is currently not
possible to change mount options of an already existing bind mount in
the new mount api since the mount_setattr() syscall is missing.
With the addition of the mount_setattr() syscall we remove this last
restriction and userspace can now fully port to the new mount api,
covering every use-case the old mount api could. We also add the
crucial ability to recursively change mount options for a whole mount
tree, both removing and adding mount options at the same time. This
syscall has been requested multiple times by various people and
projects.
There is a simple tool available at
https://github.com/brauner/mount-idmapped
that allows to create idmapped mounts so people can play with this
patch series. I'll add support for the regular mount binary should you
decide to pull this in the following weeks:
Here's an example to a simple idmapped mount of another user's home
directory:
u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt
u1001@f2-vm:/$ ls -al /home/ubuntu/
total 28
drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ ls -al /mnt/
total 28
drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
-rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
-rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
-rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo
u1001@f2-vm:/$ touch /mnt/my-file
u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file
u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file
u1001@f2-vm:/$ ls -al /mnt/my-file
-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file
u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file
u1001@f2-vm:/$ getfacl /mnt/my-file
getfacl: Removing leading '/' from absolute path names
# file: mnt/my-file
# owner: u1001
# group: u1001
user::rw-
user:u1001:rwx
group::rw-
mask::rwx
other::r--
u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
getfacl: Removing leading '/' from absolute path names
# file: home/ubuntu/my-file
# owner: ubuntu
# group: ubuntu
user::rw-
user:ubuntu:rwx
group::rw-
mask::rwx
other::r--"
* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
xfs: support idmapped mounts
ext4: support idmapped mounts
fat: handle idmapped mounts
tests: add mount_setattr() selftests
fs: introduce MOUNT_ATTR_IDMAP
fs: add mount_setattr()
fs: add attr_flags_to_mnt_flags helper
fs: split out functions to hold writers
namespace: only take read lock in do_reconfigure_mnt()
mount: make {lock,unlock}_mount_hash() static
namespace: take lock_mount_hash() directly when changing flags
nfs: do not export idmapped mounts
overlayfs: do not mount on top of idmapped mounts
ecryptfs: do not mount on top of idmapped mounts
ima: handle idmapped mounts
apparmor: handle idmapped mounts
fs: make helpers idmap mount aware
exec: handle idmapped mounts
would_dump: handle idmapped mounts
...
PF_IO_WORKER are kernel threads too, but they aren't PF_KTHREAD in the
sense that we don't assign ->set_child_tid with our own structure. Just
ensure that every arch sets up the PF_IO_WORKER threads like kthreads
in the arch implementation of copy_thread().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Convert to using the generic entry infrastructure.
- Add vdso time namespace support.
- Switch s390 and alpha to 64-bit ino_t. As discussed here
lkml.kernel.org/r/YCV7QiyoweJwvN+m@osiris
- Get rid of expensive stck (store clock) usages where possible. Utilize
cpu alternatives to patch stckf when supported.
- Make tod_clock usage less error prone by converting it to a union and
rework code which is using it.
- Machine check handler fixes and cleanups.
- Drop couple of minor inline asm optimizations to fix clang build.
- Default configs changes notably to make libvirt happy.
- Various changes to rework and improve qdio code.
- Other small various fixes and improvements all over the code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmAyzcwACgkQjYWKoQLX
FBjjMwgAmeY3oMkj93bnUF/OnbYTJQ0ZHmlyeboKt7SnFyvNpOVGyRfl7+fPHsNu
+t9QZQk0f7fSxbcC04gz0ZMw1YbTjWihgZJsN6s+qtrRsv/kVqKr7kvhFrcs8uSZ
rLiwIRWGVAbprnJZWCNqaGpKkOM0wPYZ5W3Mtnoxe4nTM2LwSu2RWI8ibTGYLQPy
FybKos2hYOFBTGQdrxmg1zAvpE8DJg4qQNLhYvnmHd8Bw/FNBmoyhx8rS8z06NmS
dWMk7pfvQaslIIaFC3Yo7/sJVa/JJH33FlBonc+MSO8OZz5O6vG4bk9ZHq6DfHUH
V1I38xiBdYdSXDq8QqT3N9d+CtjeMQ==
=Lt/v
-----END PGP SIGNATURE-----
Merge tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Convert to using the generic entry infrastructure.
- Add vdso time namespace support.
- Switch s390 and alpha to 64-bit ino_t. As discussed at
https://lore.kernel.org/linux-mm/YCV7QiyoweJwvN+m@osiris/
- Get rid of expensive stck (store clock) usages where possible.
Utilize cpu alternatives to patch stckf when supported.
- Make tod_clock usage less error prone by converting it to a union and
rework code which is using it.
- Machine check handler fixes and cleanups.
- Drop couple of minor inline asm optimizations to fix clang build.
- Default configs changes notably to make libvirt happy.
- Various changes to rework and improve qdio code.
- Other small various fixes and improvements all over the code.
* tag 's390-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (68 commits)
s390/qdio: remove 'merge_pending' mechanism
s390/qdio: improve handling of PENDING buffers for QEBSM devices
s390/qdio: rework q->qdio_error indication
s390/qdio: inline qdio_kick_handler()
s390/time: remove get_tod_clock_ext()
s390/crypto: use store_tod_clock_ext()
s390/hypfs: use store_tod_clock_ext()
s390/debug: use union tod_clock
s390/kvm: use union tod_clock
s390/vdso: use union tod_clock
s390/time: convert tod_clock_base to union
s390/time: introduce new store_tod_clock_ext()
s390/time: rename store_tod_clock_ext() and use union tod_clock
s390/time: introduce union tod_clock
s390,alpha: switch to 64-bit ino_t
s390: split cleanup_sie
s390: use r13 in cleanup_sie as temp register
s390: fix kernel asce loading when sie is interrupted
s390: add stack for machine check handler
s390: use WRITE_ONCE when re-allocating async stack
...
Pull ELF compat updates from Al Viro:
"Sanitizing ELF compat support, especially for triarch architectures:
- X32 handling cleaned up
- MIPS64 uses compat_binfmt_elf.c both for O32 and N32 now
- Kconfig side of things regularized
Eventually I hope to have compat_binfmt_elf.c killed, with both native
and compat built from fs/binfmt_elf.c, with -DELF_BITS={64,32} passed
by kbuild, but that's a separate story - not included here"
* 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
get rid of COMPAT_ELF_EXEC_PAGESIZE
compat_binfmt_elf: don't bother with undef of ELF_ARCH
Kconfig: regularize selection of CONFIG_BINFMT_ELF
mips compat: switch to compat_binfmt_elf.c
mips: don't bother with ELF_CORE_EFLAGS
mips compat: don't bother with ELF_ET_DYN_BASE
mips: KVM_GUEST makes no sense for 64bit builds...
mips: kill unused definitions in binfmt_elf[on]32.c
mips binfmt_elf*32.c: use elfcore-compat.h
x32: make X32, !IA32_EMULATION setups able to execute x32 binaries
[amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly
elf_prstatus: collect the common part (everything before pr_reg) into a struct
binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID