'struct ppc64_opd_entry' is somehow redundant with 'struct func_desc',
the later is more correct/complete as it includes the third
field which is unused.
So use 'struct func_desc' instead of 'struct ppc64_opd_entry'
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/34e76bac6cbe95a63ecd37df69fb7feb93b0ea7c.1644928018.git.christophe.leroy@csgroup.eu
There are three architectures with function descriptors, try to
have common names for the address they contain in order to
refactor some functions into generic functions later.
powerpc has 'entry'
ia64 has 'ip'
parisc has 'addr'
Vote for 'addr' and update 'func_descr_t' accordingly.
Move it in asm/elf.h to have it at the same place on all
three architectures, remove the typedef which hides its real
type, and change it to a smoother name 'struct func_desc'.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/529b2ba1d001e8f628ef0d30e8044c9b3d0a4921.1644928018.git.christophe.leroy@csgroup.eu
Presently PAPR doesn't support injecting smart errors on an
NVDIMM. This makes testing the NVDIMM health reporting functionality
difficult as simulating NVDIMM health related events need a hacked up
qemu version.
To solve this problem this patch proposes simulating certain set of
NVDIMM health related events in papr_scm. Specifically 'fatal' health
state and 'dirty' shutdown state. These error can be injected via the
user-space 'ndctl-inject-smart(1)' command. With the proposed patch and
corresponding ndctl patches following command flow is expected:
$ sudo ndctl list -DH -d nmem0
...
"health_state":"ok",
"shutdown_state":"clean",
...
# inject unsafe shutdown and fatal health error
$ sudo ndctl inject-smart nmem0 -Uf
...
"health_state":"fatal",
"shutdown_state":"dirty",
...
# uninject all errors
$ sudo ndctl inject-smart nmem0 -N
...
"health_state":"ok",
"shutdown_state":"clean",
...
The patch adds a new member 'health_bitmap_inject_mask' inside struct
papr_scm_priv which is then bitwise ANDed to the health bitmap fetched from the
hypervisor. The value for 'health_bitmap_inject_mask' is accessible from sysfs
at nmemX/papr/health_bitmap_inject.
A new PDSM named 'SMART_INJECT' is proposed that accepts newly
introduced 'struct nd_papr_pdsm_smart_inject' as payload thats
exchanged between libndctl and papr_scm to indicate the requested
smart-error states.
When the processing the PDSM 'SMART_INJECT', papr_pdsm_smart_inject()
constructs a pair or 'inject_mask' and 'clear_mask' bitmaps from the payload
and bit-blt it to the 'health_bitmap_inject_mask'. This ensures the after being
fetched from the hypervisor, the health_bitmap reflects requested smart-error
states.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220124202204.1488346-1-vaibhav@linux.ibm.com
When FL_SAVE_REGS is not set we get here via ftrace_caller()
which doesn't save all registers.
ftrace_caller() explicitely clears regs.msr, so we can rely
on it to know where we come from. We don't expect MSR register
to be 0 at all when involving ftrace.
Fixes: 40b035efe2 ("powerpc/ftrace: Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS")
Reported-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2f9a7e898c93cc7438ef5ccd47cb9c3a9c5b53ef.1644949750.git.christophe.leroy@csgroup.eu
Warnings in assembly must use EMIT_WARN_ENTRY in order to generate
the necessary entry in exception table.
Check in EMIT_BUG_ENTRY that flags don't include BUGFLAG_WARNING.
This change avoids problems like the one fixed by
commit fd1eaaaaa6 ("powerpc/64s: Use EMIT_WARN_ENTRY for SRR debug
warnings").
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ddcb422102a37eb45f57694c7ef0ec6187964dff.1644742951.git.christophe.leroy@csgroup.eu
Our skiroot_defconfig doesn't enable FTRACE, and so doesn't get
STACKTRACE enabled either. That leads to a build failure since commit
1614b2b11f ("arch: Make ARCH_STACKWALK independent of STACKTRACE")
made stacktrace.c build even when STACKTRACE=n.
arch/powerpc/kernel/stacktrace.c: In function ‘handle_backtrace_ipi’:
arch/powerpc/kernel/stacktrace.c:171:2: error: implicit declaration of function ‘nmi_cpu_backtrace’
171 | nmi_cpu_backtrace(regs);
| ^~~~~~~~~~~~~~~~~
arch/powerpc/kernel/stacktrace.c: In function ‘arch_trigger_cpumask_backtrace’:
arch/powerpc/kernel/stacktrace.c:226:2: error: implicit declaration of function ‘nmi_trigger_cpumask_backtrace’
226 | nmi_trigger_cpumask_backtrace(mask, exclude_self, raise_backtrace_ipi);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This happens because our headers haven't defined
arch_trigger_cpumask_backtrace, which causes lib/nmi_backtrace.c not to
build nmi_cpu_backtrace().
The code in question doesn't actually depend on STACKTRACE=y, that was
just added because arch_trigger_cpumask_backtrace() lived in
stacktrace.c for convenience. So drop the dependency on
CONFIG_STACKTRACE, that causes lib/nmi_backtrace.c to build
nmi_cpu_backtrace() etc. and fixes the build.
Fixes: 1614b2b11f ("arch: Make ARCH_STACKWALK independent of STACKTRACE")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220212111349.2806972-1-mpe@ellerman.id.au
commit: d9c2340052 ("Do not depend on MAX_ORDER when grouping pages by mobility")
introduced pageblock_order which will be used to group pages better.
The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
should be set before we call set_pageblock_order.
set_pageblock_order happens early in the boot and default hugetlb page size
should be initialized before that to compute the right pageblock_order value.
Currently, default hugetlbe page size is set via arch_initcalls which happens
late in the boot as shown via the below callstack:
[c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
[c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
[c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
[c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
[c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
and the pageblock_order initialization is done early during the boot.
[c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
[c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
[c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
[c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
[c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
[c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
delaying default hugetlb page size initialization implies the kernel will
initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
value for mobility grouping. IIUC we always had this issue. But it was not
a problem for hash translation mode because (MAX_ORDER - 1) is the same as
HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
5 instead of 8.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220211065215.101767-1-aneesh.kumar@linux.ibm.com
pseries_devicetree_update() has only one call site, in the same file in
which it is defined. Make it static.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220207221247.354454-1-nathanl@linux.ibm.com
cvdso_call_time macro is very similar to cvdso_call macro.
Add a call_time argument to cvdso_call which is 0 by default
and set to 1 when using cvdso_call to call __c_kernel_time().
Return returned value as is with CR[SO] cleared when it is used
for time().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/837a260ad86fc1ce297a562c2117fd69be5f7b5c.1642782130.git.christophe.leroy@csgroup.eu
VDSO64 cacheflush.S datapage.S gettimeofday.S and vgettimeofday.c
are very similar to their VDSO32 counterpart.
VDSO32 counterpart is already more complete than the VDSO64 version
as it supports both PPC32 vdso and 32 bits VDSO for PPC64.
Use compat macros wherever necessary in PPC32 files
so that they can also be used to build VDSO64.
vdso64/note.S is already a link to vdso32/note.S so
no change is required.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c2cbb8f046b7efc251053521dc39b752795e26b7.1642782130.git.christophe.leroy@csgroup.eu
set_memory_attr() was implemented by commit 4d1755b6a7 ("powerpc/mm:
implement set_memory_attr()") because the set_memory_xx() couldn't
be used at that time to modify memory "on the fly" as explained it
the commit.
But set_memory_attr() uses set_pte_at() which leads to warnings when
CONFIG_DEBUG_VM is selected, because set_pte_at() is unexpected for
updating existing page table entries.
The check could be bypassed by using __set_pte_at() instead,
as it was the case before commit c988cfd38e ("powerpc/32:
use set_memory_attr()") but since commit 9f7853d760 ("powerpc/mm:
Fix set_memory_*() against concurrent accesses") it is now possible
to use set_memory_xx() functions to update page table entries
"on the fly" because the update is now atomic.
For DEBUG_PAGEALLOC we need to clear and set back _PAGE_PRESENT.
Add set_memory_np() and set_memory_p() for that.
Replace all uses of set_memory_attr() by the relevant set_memory_xx()
and remove set_memory_attr().
Fixes: c988cfd38e ("powerpc/32: use set_memory_attr()")
Cc: stable@vger.kernel.org
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Maxime Bizon <mbizon@freebox.fr>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Depends-on: 9f7853d760 ("powerpc/mm: Fix set_memory_*() against concurrent accesses")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/cda2b44b55c96f9ac69fa92e68c01084ec9495c5.1640344012.git.christophe.leroy@csgroup.eu
Modern compilers are perfectly capable of extracting parallelism from
the XOR routines, provided that the prototypes reflect the nature of the
input accurately, in particular, the fact that the input vectors are
expected not to overlap. This is not documented explicitly, but is
implied by the interchangeability of the various C routines, some of
which use temporary variables while others don't: this means that these
routines only behave identically for non-overlapping inputs.
So let's decorate these input vectors with the __restrict modifier,
which informs the compiler that there is no overlap. While at it, make
the input-only vectors pointer-to-const as well.
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/563
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Modify function graph tracer to be handled directly by the standard
ftrace caller.
This is made possible as powerpc now supports
CONFIG_DYNAMIC_FTRACE_WITH_ARGS.
This change simplifies the call of function graph ftrace.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/04d196585ff81bde06a000bd9c633a33a5b21130.1640017960.git.christophe.leroy@csgroup.eu
Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS. It accelerates the call
of livepatching.
Also note that powerpc being the last one to convert to
CONFIG_DYNAMIC_FTRACE_WITH_ARGS, it will now be possible to remove
klp_arch_set_pc() on all architectures.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5831f711a778fcd6eb51eb5898f1faae4378b35b.1640017960.git.christophe.leroy@csgroup.eu
PPC64 needs some special logic to properly set up the TOC.
See commit 85baa09549 ("powerpc/livepatch: Add live patching support
on ppc64le") for details.
PPC32 doesn't have TOC so it doesn't need that logic, so adding
LIVEPATCH support is straight forward.
Add CONFIG_LIVEPATCH_64 and move livepatch stack logic into that item.
Livepatch sample modules all work.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/63cb094125b6a6038c65eeac2abaabbabe63addd.1640017960.git.christophe.leroy@csgroup.eu
arch/powerpc/include/asm/nohash/{32/64}/pgtable.h has
#define __HAVE_ARCH_PTE_SAME
#define pte_same(A,B) ((pte_val(A) ^ pte_val(B)) == 0)
include/linux/pgtable.h has
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
return pte_val(pte_a) == pte_val(pte_b);
}
#endif
Remove the powerpc version which is similar to the generic one.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/83c97bd58a3596ef1b0ff28b1e41fd492d005520.1643616989.git.christophe.leroy@csgroup.eu
At the moment KVM on PPC creates 4 types of entries under the kvm debugfs:
1) "%pid-%fd" per a KVM instance (for all platforms);
2) "vm%pid" (for PPC Book3s HV KVM);
3) "vm%u_vcpu%u_timing" (for PPC Book3e KVM);
4) "kvm-xive-%p" (for XIVE PPC Book3s KVM, the same for XICS);
The problem with this is that multiple VMs per process is not allowed for
2) and 3) which makes it possible for userspace to trigger errors when
creating duplicated debugfs entries.
This merges all these into 1).
This defines kvm_arch_create_kvm_debugfs() similar to
kvm_arch_create_vcpu_debugfs().
This defines 2 hooks in kvmppc_ops that allow specific KVM implementations
add necessary entries, this adds the _e500 suffix to
kvmppc_create_vcpu_debugfs_e500() to make it clear what platform it is for.
This makes use of already existing kvm_arch_create_vcpu_debugfs() on PPC.
This removes no more used debugfs_dir pointers from PPC kvm_arch structs.
This stops removing vcpu entries as once created vcpus stay around
for the entire life of a VM and removed when the KVM instance is closed,
see commit d56f5136b0 ("KVM: let kvm_destroy_vm_debugfs clean up vCPU
debugfs directories").
Suggested-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220111005404.162219-1-aik@ozlabs.ru
Some of them used to be used by libbfd for a.out coredump handling.
Seeing that
* libbfd has their copies anyway
* we don't export them into userland headers
* we don't support a.out coredumps anymore
let's bury the definitions. They never had in-kernel
users anyway...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The empty unmap_page_from_agp() macro causes a warning when
building with 'make W=1' on a couple of architectures:
drivers/char/agp/generic.c: In function 'agp_generic_destroy_page':
drivers/char/agp/generic.c:1265:28: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
1265 | unmap_page_from_agp(page);
Change the definitions to a 'do { } while (0)' construct to
make these more reliable.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Helge Deller <deller@gmx.de>
The L0 is storing HFSCR requested by the L1 for the L2 in struct
kvm_nested_guest when the L1 requests a vCPU enter L2. kvm_nested_guest
is not a per-vCPU structure. Hilarity ensues.
Fix it by moving the nested hfscr into the vCPU structure together with
the other per-vCPU nested fields.
Fixes: 8b210a880b ("KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible")
Cc: stable@vger.kernel.org # v5.15+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220122105530.3477250-1-npiggin@gmail.com
- A series of bpf fixes, including an oops fix and some codegen fixes.
- Fix a regression in syscall_get_arch() for compat processes.
- Fix boot failure on some 32-bit systems with KASAN enabled.
- A couple of other build/minor fixes.
Thanks to: Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa, Johan Almbladh,
Maxime Bizon, Naveen N. Rao, Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmHtOJgTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLMeD/9eBHqMy+yhXfTw9mjAf8rVlsdY6Xjq
2LYgHCJpeP0VjxBm8GcyFtAfJc778LyykU18xsi0WVsuwPmVdVwDfZpzxWJrYqeq
6AuK9uRZAbEx6gZVCsWlk3Em7GyUEqHm5PTxKsDfbzh+hBTNQ8CcKKbCVmEZ+loA
nWdaWJOZMpE4hQOx0Ph5JRAos/ZUEt2I57qZkRgvIEV+hroFrmdP1rnncdalnLHN
pXsEn380Q+rpH+48HIpaTpl5+R8CvgFnnlI6/luZC6HoA8K6jph8FdOkkI9eU1v6
TgdNMiNVIe1wRFzVp+nH7SGqL3l9X/UPS57hL52X+ccpHoioBW9CLPha06g/+UKT
kJU3mYZg3jvQGrM9H/2R3P97UH9SwHE5SH0xcXj3ItH1NpOKpMO0h5UuWx+Ze8Wo
zO80xbYLbALtEfx8JMCf95rnUVFL0n55xVJN+7+fs2hU3Be/ZpIyHfDujlHlbd1m
o3KRhkg1a0RyGsun6Gd2vwLGaBbwYNWvgHM9IDeiAFdJ7HFpp1YrVGpQRkAIUbJS
bzYiRczjbSof7c9WqLyKNatgNFO/JPI5vi1iNQ3u9Oh9fzjpv5Zok4vK534mOnqU
QnhiqLJbY+yBpoAvSePGQTCCzDAnJq0r53n0d1O/2T1GcpX9MKM4YlL+Wu4m0yLn
RcbOCTkmoqWZEw==
=KjWT
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- A series of bpf fixes, including an oops fix and some codegen fixes.
- Fix a regression in syscall_get_arch() for compat processes.
- Fix boot failure on some 32-bit systems with KASAN enabled.
- A couple of other build/minor fixes.
Thanks to Athira Rajeev, Christophe Leroy, Dmitry V. Levin, Jiri Olsa,
Johan Almbladh, Maxime Bizon, Naveen N. Rao, and Nicholas Piggin.
* tag 'powerpc-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Mask SRR0 before checking against the masked NIP
powerpc/perf: Only define power_pmu_wants_prompt_pmi() for CONFIG_PPC64
powerpc/32s: Fix kasan_init_region() for KASAN
powerpc/time: Fix build failure due to do_hard_irq_enable() on PPC32
powerpc/audit: Fix syscall_get_arch()
powerpc64/bpf: Limit 'ldbrx' to processors compliant with ISA v2.06
tools/bpf: Rename 'struct event' to avoid naming conflict
powerpc/bpf: Update ldimm64 instructions during extra pass
powerpc32/bpf: Fix codegen for bpf-to-bpf calls
bpf: Guard against accessing NULL pt_regs in bpf_get_task_stack()
-----BEGIN PGP SIGNATURE-----
iQHJBAABCgAzFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmHi+xgVHHl1cnkubm9y
b3ZAZ21haWwuY29tAAoJELFEgP06H77IxdoMAMf3E+L51Ys/4iAiyJQNVoT3aIBC
A8ZVOB9he1OA3o3wBNIRKmICHk+ovnfCWcXTr9fG/Ade2wJz88NAsGPQ1Phywb+s
iGlpySllFN72RT9ZqtJhLEzgoHHOL0CzTW07TN9GJy4gQA2h2G9CTP+OmsQdnVqE
m9Fn3PSlJ5lhzePlKfnln8rGZFgrriJakfEFPC79n/7an4+2Hvkb5rWigo7KQc4Z
9YNqYUcHWZFUgq80adxEb9LlbMXdD+Z/8fCjOrAatuwVkD4RDt6iKD0mFGjHXGL7
MZ9KRS8AfZXawmetk3jjtsV+/QkeS+Deuu7k0FoO0Th2QV7BGSDhsLXAS5By/MOC
nfSyHhnXHzCsBMyVNrJHmNhEZoN29+tRwI84JX9lWcf/OLANcCofnP6f2UIX7tZY
CAZAgVELp+0YQXdybrfzTQ8BT3TinjS/aZtCrYijRendI1GwUXcyl69vdOKqAHuk
5jy8k/xHyp+ZWu6v+PyAAAEGowY++qhL0fmszA==
=RKW4
-----END PGP SIGNATURE-----
Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux
Pull bitmap updates from Yury Norov:
- introduce for_each_set_bitrange()
- use find_first_*_bit() instead of find_next_*_bit() where possible
- unify for_each_bit() macros
* tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
vsprintf: rework bitmap_list_string
lib: bitmap: add performance test for bitmap_print_to_pagebuf
bitmap: unify find_bit operations
mm/percpu: micro-optimize pcpu_is_populated()
Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
find: micro-optimize for_each_{set,clear}_bit()
include/linux: move for_each_bit() macros from bitops.h to find.h
cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
tools: sync tools/bitmap with mother linux
all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
cpumask: use find_first_and_bit()
lib: add find_first_and_bit()
arch: remove GENERIC_FIND_FIRST_BIT entirely
include: move find.h from asm_generic to linux
bitops: move find_bit_*_le functions from le.h to find.h
bitops: protect find_first_{,zero}_bit properly
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into
KVM's 'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to
a simpler state and less shared data between EL1 and EL2 in
the nVHE case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be
unmapped from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once
the vcpu xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and
page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmHhxvsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPZkAf+Nz92UL/5nNGcdHtE4m7AToMmitE9
bYkesf9BMQvAe5wjkABLuoHGi6ay4jabo4fiGzbdkiK7lO5YgfsWiMB3/MT5fl4E
jRPzaVQabp3YZLM8UYCBmfUVuRj524S967SfSRe0AvYjDEH8y7klPf4+7sCsFT0/
Px9Vf2KGuOlf0eM78yKg4rGaF0jS22eLgXm6FfNMY8/e29ZAo/jyUmqBY+Z2xxZG
aWhceDtSheW1jwLHLj3nOlQJvHTn8LVGXBE/R8Gda3ZjrBV2rKaDi4Fh+HD+dz86
2zVXwzQ7uck2CMW73GMoXMTWoKSHMyvlBOs1BdvBm4UsnGcXR+q8IFCeuQ==
=s73m
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"RISCV:
- Use common KVM implementation of MMU memory caches
- SBI v0.2 support for Guest
- Initial KVM selftests support
- Fix to avoid spurious virtual interrupts after clearing hideleg CSR
- Update email address for Anup and Atish
ARM:
- Simplification of the 'vcpu first run' by integrating it into KVM's
'pid change' flow
- Refactoring of the FP and SVE state tracking, also leading to a
simpler state and less shared data between EL1 and EL2 in the nVHE
case
- Tidy up the header file usage for the nvhe hyp object
- New HYP unsharing mechanism, finally allowing pages to be unmapped
from the Stage-1 EL2 page-tables
- Various pKVM cleanups around refcounting and sharing
- A couple of vgic fixes for bugs that would trigger once the vcpu
xarray rework is merged, but not sooner
- Add minimal support for ARMv8.7's PMU extension
- Rework kvm_pgtable initialisation ahead of the NV work
- New selftest for IRQ injection
- Teach selftests about the lack of default IPA space and page sizes
- Expand sysreg selftest to deal with Pointer Authentication
- The usual bunch of cleanups and doc update
s390:
- fix sigp sense/start/stop/inconsistency
- cleanups
x86:
- Clean up some function prototypes more
- improved gfn_to_pfn_cache with proper invalidation, used by Xen
emulation
- add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
- completely remove potential TOC/TOU races in nested SVM consistency
checks
- update some PMCs on emulated instructions
- Intel AMX support (joint work between Thomas and Intel)
- large MMU cleanups
- module parameter to disable PMU virtualization
- cleanup register cache
- first part of halt handling cleanups
- Hyper-V enlightened MSR bitmap support for nested hypervisors
Generic:
- clean up Makefiles
- introduce CONFIG_HAVE_KVM_DIRTY_RING
- optimize memslot lookup using a tree
- optimize vCPU array usage by converting to xarray"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (268 commits)
x86/fpu: Fix inline prefix warnings
selftest: kvm: Add amx selftest
selftest: kvm: Move struct kvm_x86_state to header
selftest: kvm: Reorder vcpu_load_state steps for AMX
kvm: x86: Disable interception for IA32_XFD on demand
x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state()
kvm: selftests: Add support for KVM_CAP_XSAVE2
kvm: x86: Add support for getting/setting expanded xstate buffer
x86/fpu: Add uabi_size to guest_fpu
kvm: x86: Add CPUID support for Intel AMX
kvm: x86: Add XCR0 support for Intel AMX
kvm: x86: Disable RDMSR interception of IA32_XFD_ERR
kvm: x86: Emulate IA32_XFD_ERR for guest
kvm: x86: Intercept #NM for saving IA32_XFD_ERR
x86/fpu: Prepare xfd_err in struct fpu_guest
kvm: x86: Add emulation for IA32_XFD
x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation
kvm: x86: Enable dynamic xfeatures at KVM_SET_CPUID2
x86/fpu: Provide fpu_enable_guest_xfd_features() for KVM
x86/fpu: Add guest support to xfd_enable_feature()
...
It has been reported some configuration where the kernel doesn't
boot with KASAN enabled.
This is due to wrong BAT allocation for the KASAN area:
---[ Data Block Address Translation ]---
0: 0xc0000000-0xcfffffff 0x00000000 256M Kernel rw m
1: 0xd0000000-0xdfffffff 0x10000000 256M Kernel rw m
2: 0xe0000000-0xefffffff 0x20000000 256M Kernel rw m
3: 0xf8000000-0xf9ffffff 0x2a000000 32M Kernel rw m
4: 0xfa000000-0xfdffffff 0x2c000000 64M Kernel rw m
A BAT must have both virtual and physical addresses alignment matching
the size of the BAT. This is not the case for BAT 4 above.
Fix kasan_init_region() by using block_size() function that is in
book3s32/mmu.c. To be able to reuse it here, make it non static and
change its name to bat_block_size() in order to avoid name conflict
with block_size() defined in <linux/blkdev.h>
Also reuse find_free_bat() to avoid an error message from setbat()
when no BAT is available.
And allocate memory outside of linear memory mapping to avoid
wasting that precious space.
With this change we get correct alignment for BATs and KASAN shadow
memory is allocated outside the linear memory space.
---[ Data Block Address Translation ]---
0: 0xc0000000-0xcfffffff 0x00000000 256M Kernel rw
1: 0xd0000000-0xdfffffff 0x10000000 256M Kernel rw
2: 0xe0000000-0xefffffff 0x20000000 256M Kernel rw
3: 0xf8000000-0xfbffffff 0x7c000000 64M Kernel rw
4: 0xfc000000-0xfdffffff 0x7a000000 32M Kernel rw
Fixes: 7974c47326 ("powerpc/32s: Implement dedicated kasan_init_region()")
Cc: stable@vger.kernel.org
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7a50ef902494d1325227d47d33dada01e52e5518.1641818726.git.christophe.leroy@csgroup.eu
find_bit API and bitmap API are closely related, but inclusion paths
are different - include/asm-generic and include/linux, correspondingly.
In the past it made a lot of troubles due to circular dependencies
and/or undefined symbols. Fix this by moving find.h under include/linux.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Commit 770cec16cd ("powerpc/audit: Simplify syscall_get_arch()")
and commit 898a1ef06a ("powerpc/audit: Avoid unneccessary #ifdef
in syscall_get_arguments()")
replaced test_tsk_thread_flag(task, TIF_32BIT)) by is_32bit_task().
But is_32bit_task() applies on current task while be want the test
done on task 'task'
So add a new macro is_tsk_32bit_task() to check any task.
Fixes: 770cec16cd ("powerpc/audit: Simplify syscall_get_arch()")
Fixes: 898a1ef06a ("powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()")
Cc: stable@vger.kernel.org
Reported-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c55cddb8f65713bf5859ed675d75a50cb37d5995.1642159570.git.christophe.leroy@csgroup.eu
Johan reported the below crash with test_bpf on ppc64 e5500:
test_bpf: #296 ALU_END_FROM_LE 64: 0x0123456789abcdef -> 0x67452301 jited:1
Oops: Exception in kernel mode, sig: 4 [#1]
BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500
Modules linked in: test_bpf(+)
CPU: 0 PID: 76 Comm: insmod Not tainted 5.14.0-03771-g98c2059e008a-dirty #1
NIP: 8000000000061c3c LR: 80000000006dea64 CTR: 8000000000061c18
REGS: c0000000032d3420 TRAP: 0700 Not tainted (5.14.0-03771-g98c2059e008a-dirty)
MSR: 0000000080089000 <EE,ME> CR: 88002822 XER: 20000000 IRQMASK: 0
<...>
NIP [8000000000061c3c] 0x8000000000061c3c
LR [80000000006dea64] .__run_one+0x104/0x17c [test_bpf]
Call Trace:
.__run_one+0x60/0x17c [test_bpf] (unreliable)
.test_bpf_init+0x6a8/0xdc8 [test_bpf]
.do_one_initcall+0x6c/0x28c
.do_init_module+0x68/0x28c
.load_module+0x2460/0x2abc
.__do_sys_init_module+0x120/0x18c
.system_call_exception+0x110/0x1b8
system_call_common+0xf0/0x210
--- interrupt: c00 at 0x101d0acc
<...>
---[ end trace 47b2bf19090bb3d0 ]---
Illegal instruction
The illegal instruction turned out to be 'ldbrx' emitted for
BPF_FROM_[L|B]E, which was only introduced in ISA v2.06. Guard use of
the same and implement an alternative approach for older processors.
Fixes: 156d0e290e ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
Reported-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d1e51c6fdf572062cf3009a751c3406bda01b832.1641468127.git.naveen.n.rao@linux.vnet.ibm.com
- Optimise radix KVM guest entry/exit by 2x on Power9/Power10.
- Allow firmware to tell us whether to disable the entry and uaccess flushes on Power10
or later CPUs.
- Add BPF_PROBE_MEM support for 32 and 64-bit BPF jits.
- Several fixes and improvements to our hard lockup watchdog.
- Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on 32-bit.
- Allow building the 64-bit Book3S kernel without hash MMU support, ie. Radix only.
- Add KUAP (SMAP) support for 40x, 44x, 8xx, Book3E (64-bit).
- Add new encodings for perf_mem_data_src.mem_hops field, and use them on Power10.
- A series of small performance improvements to 64-bit interrupt entry.
- Several commits fixing issues when building with the clang integrated assembler.
- Many other small features and fixes.
Thanks to: Alan Modra, Alexey Kardashevskiy, Ammar Faizi, Anders Roxell, Arnd Bergmann,
Athira Rajeev, Cédric Le Goater, Christophe JAILLET, Christophe Leroy, Christoph Hellwig,
Daniel Axtens, David Yang, Erhard Furtner, Fabiano Rosas, Greg Kroah-Hartman, Guo Ren,
Hari Bathini, Jason Wang, Joel Stanley, Julia Lawall, Kajol Jain, Kees Cook, Laurent
Dufour, Madhavan Srinivasan, Mark Brown, Minghao Chi, Nageswara R Sastry, Naresh Kamboju,
Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Nick Child, Oliver O'Halloran, Peiwei
Hu, Randy Dunlap, Ravi Bangoria, Rob Herring, Russell Currey, Sachin Sant, Sean
Christopherson, Segher Boessenkool, Thadeu Lima de Souza Cascardo, Tyrel Datwyler, Xiang
wangx, Yang Guang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmHhVFMTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgKwzD/9UUEZzWyzMVRJvP9FPZByN2M8czxHJ
tuqEuVqnfks8ad8tfm2ebng5t8ZuVASBQU2fpPA1+lpdvprgZN5RFGMRh729vskn
2aHQPmFvFObNbXOgCoXzk+C5xYi3zoRMVM968neSPBneYo+xDicn/zN5CHAgsjhX
+baemJQ7/xzwLiZgTHe8fWw3nTk3IbPBpha59SdTvR8Moy6I4O8CDPIYEm3U3/J3
x14ZRETqjksL7YOzEBk0avm1dDZRw/johz29oRYSmCj7dyy5OqrkPwokJiRY90eA
1lVdofDc0zElaSWkVGzKdSWRUIXjKIVdtejvDeEvl6H/mI6q4TVZE8rFmn+3Rvgf
9q0iKtmw5Kn11cqgY/pgEGmxnQtIdAodNfI/t939E7+O5LbcznuYUiy0J/kTD/vl
Xduotg2dsCI+5ukf1wrk2wt9LhqZL+ziOeaBhyDM4orV8T3HBYL6zWBptun//IGO
lK6TvvCHSYnGqY4bnrAmiOnbbEtnP6nN3zbcXgSvPM0wCRHPIEqd0NRXtfISo32d
vBPq1neXWo4wrRJj9X3yOuP+5fEA4I+hB3yrCJOkcEcz+8NhlboQXU7raVsJL+bd
kze75H8hwX7kE71oJFFl13LbSNABgiLFARTBXKfvdQA2iLdR0Snvm+OouvwWRPo/
Po7Nm3zqdLc/1A==
=BxhQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Optimise radix KVM guest entry/exit by 2x on Power9/Power10.
- Allow firmware to tell us whether to disable the entry and uaccess
flushes on Power10 or later CPUs.
- Add BPF_PROBE_MEM support for 32 and 64-bit BPF jits.
- Several fixes and improvements to our hard lockup watchdog.
- Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on 32-bit.
- Allow building the 64-bit Book3S kernel without hash MMU support, ie.
Radix only.
- Add KUAP (SMAP) support for 40x, 44x, 8xx, Book3E (64-bit).
- Add new encodings for perf_mem_data_src.mem_hops field, and use them
on Power10.
- A series of small performance improvements to 64-bit interrupt entry.
- Several commits fixing issues when building with the clang integrated
assembler.
- Many other small features and fixes.
Thanks to Alan Modra, Alexey Kardashevskiy, Ammar Faizi, Anders Roxell,
Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christophe JAILLET,
Christophe Leroy, Christoph Hellwig, Daniel Axtens, David Yang, Erhard
Furtner, Fabiano Rosas, Greg Kroah-Hartman, Guo Ren, Hari Bathini, Jason
Wang, Joel Stanley, Julia Lawall, Kajol Jain, Kees Cook, Laurent Dufour,
Madhavan Srinivasan, Mark Brown, Minghao Chi, Nageswara R Sastry, Naresh
Kamboju, Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Nick Child,
Oliver O'Halloran, Peiwei Hu, Randy Dunlap, Ravi Bangoria, Rob Herring,
Russell Currey, Sachin Sant, Sean Christopherson, Segher Boessenkool,
Thadeu Lima de Souza Cascardo, Tyrel Datwyler, Xiang wangx, and Yang
Guang.
* tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (240 commits)
powerpc/xmon: Dump XIVE information for online-only processors.
powerpc/opal: use default_groups in kobj_type
powerpc/cacheinfo: use default_groups in kobj_type
powerpc/sched: Remove unused TASK_SIZE_OF
powerpc/xive: Add missing null check after calling kmalloc
powerpc/floppy: Remove usage of the deprecated "pci-dma-compat.h" API
selftests/powerpc: Add a test of sigreturning to an unaligned address
powerpc/64s: Use EMIT_WARN_ENTRY for SRR debug warnings
powerpc/64s: Mask NIP before checking against SRR0
powerpc/perf: Fix spelling of "its"
powerpc/32: Fix boot failure with GCC latent entropy plugin
powerpc/code-patching: Replace patch_instruction() by ppc_inst_write() in selftests
powerpc/code-patching: Move code patching selftests in its own file
powerpc/code-patching: Move instr_is_branch_{i/b}form() in code-patching.h
powerpc/code-patching: Move patch_exception() outside code-patching.c
powerpc/code-patching: Use test_trampoline for prefixed patch test
powerpc/code-patching: Fix patch_branch() return on out-of-range failure
powerpc/code-patching: Reorganise do_patch_instruction() to ease error handling
powerpc/code-patching: Fix unmap_patch_area() error handling
powerpc/code-patching: Fix error handling in do_patch_instruction()
...
This macro isn't used in Linux sched, now. Delete in
include/linux/sched.h and arch's include/asm.
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211228064730.2882351-5-guoren@kernel.org
The purpose of selftests is to check that instructions are
properly formed. Not to check that they properly run.
For that test it uses normal memory, not special test
memory.
In preparation of a future patch enforcing patch_instruction()
to be used only on valid text areas, implement a ppc_inst_write()
instruction which is the complement of ppc_inst_read(). This
new function writes the formated instruction in valid kernel
memory and doesn't bother about icache.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7cf5335cc07ca9b6f8cdaa20ca9887fce4df3bea.1638446239.git.christophe.leroy@csgroup.eu
The llvm integrated assembler does not recognise the ISA 2.05 tlbiel
version. Work around it by switching to .long when an old arch level
detected.
Signed-off-by: Daniel Axtens <dja@axtens.net>
[aik: did "Eventually do this more smartly"]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-7-aik@ozlabs.ru
The dssall ("Data Stream Stop All") instruction is obsolete altogether
with other Data Cache Instructions since ISA 2.03 (year 2006).
LLVM IAS does not support it but PPC970 seems to be using it.
This switches dssall to .long as there is no much point in fixing LLVM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-6-aik@ozlabs.ru
The LLVM integrated assembler really does not like us reassigning things
to the same label:
<instantiation>:7:9: error: invalid reassignment of non-absolute variable 'fs_label'
This happens across a bunch of platforms:
https://github.com/ClangBuiltLinux/linux/issues/1043https://github.com/ClangBuiltLinux/linux/issues/1008https://github.com/ClangBuiltLinux/linux/issues/920https://github.com/ClangBuiltLinux/linux/issues/1050
There is no hope of getting this fixed in LLVM (see
https://github.com/ClangBuiltLinux/linux/issues/1043#issuecomment-641571200
and https://bugs.llvm.org/show_bug.cgi?id=47798#c1 )
so if we want to build with LLVM_IAS, we need to hack
around it ourselves.
For us the big problem comes from this:
\#define USE_FIXED_SECTION(sname) \
fs_label = start_##sname; \
fs_start = sname##_start; \
use_ftsec sname;
\#define USE_TEXT_SECTION()
fs_label = start_text; \
fs_start = text_start; \
.text
and in particular fs_label.
This works around it by not setting those 'variables' and requiring
that users of the variables instead track for themselves what section
they are in. This isn't amazing, by any stretch, but it gets us further
in the compilation.
Note that even though users have to keep track of the section, using
a wrong one produces an error with both binutils and llvm which prevents
from using wrong section at the compile time:
llvm error example:
AS arch/powerpc/kernel/head_64.o
<unknown>:0: error: Cannot represent a difference across sections
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1
binutils error example:
/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S:1974: Error: can't resolve `system_call_common' {.text section} - `start_r
eal_vectors' {.head.text.real_vectors section}
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-5-aik@ozlabs.ru
This patch future-proofs the kernel against linker changes that might
put the toc pointer at some location other than .got+0x8000, by
replacing __toc_start+0x8000 with .TOC. throughout. If the kernel's
idea of the toc pointer doesn't agree with the linker, bad things
happen.
prom_init.c code relocating its toc is also changed so that a symbolic
__prom_init_toc_start toc-pointer relative address is calculated
rather than assuming that it is always at toc-pointer - 0x8000. The
length calculations loading values from the toc are also avoided.
It's a little incestuous to do that with unreloc_toc picking up
adjusted values (which is fine in practice, they both adjust by the
same amount if all goes well).
I've also changed the way .got is aligned in vmlinux.lds and
zImage.lds, mostly so that dumping out section info by objdump or
readelf plainly shows the alignment is 256. This linker script
feature was added 2005-09-27, available in FSF binutils releases from
2.17 onwards. Should be safe to use in the kernel, I think.
Finally, put *(.got) before the prom_init.o entry which only needs
*(.toc), so that the GOT header goes in the correct place. I don't
believe this makes any difference for the kernel as it would for
dynamic objects being loaded by ld.so. That change is just to stop
lusers who blindly copy kernel scripts being led astray. Of course,
this change needs the prom_init.c changes.
Some notes on .toc and .got.
.toc is a compiler generated section of addresses. .got is a linker
generated section of addresses, generally built when the linker sees
R_*_*GOT* relocations. In the case of powerpc64 ld.bfd, there are
multiple generated .got sections, one per input object file. So you
can somewhat reasonably write in a linker script an input section
statement like *prom_init.o(.got .toc) to mean "the .got and .toc
section for files matching *prom_init.o". On other architectures that
doesn't make sense, because the linker generally has just one .got
section. Even on powerpc64, note well that the GOT entries for
prom_init.o may be merged with GOT entries from other objects. That
means that if prom_init.o references, say, _end via some GOT
relocation, and some other object also references _end via a GOT
relocation, the GOT entry for _end may be in the range
__prom_init_toc_start to __prom_init_toc_end and if the kernel does
something special to GOT/TOC entries in that range then the value of
_end as seen by objects other than prom_init.o will be affected. On
the other hand the GOT entry for _end may not be in the range
__prom_init_toc_start to __prom_init_toc_end. Which way it turns out
is deterministic but a detail of linker operation that should not be
relied on.
A feature of ld.bfd is that input .toc (and .got) sections matching
one linker input section statement may be sorted, to put entries used
by small-model code first, near the toc base. This is why scripts for
powerpc64 normally use *(.got .toc) rather than *(.got) *(.toc), since
the first form allows more freedom to sort.
Another feature of ld.bfd is that indirect addressing sequences using
the GOT/TOC may be edited by the linker to relative addressing. In
many cases relative addressing would be emitted by gcc for
-mcmodel=medium if you appropriately decorate variable declarations
with non-default visibility.
The original patch is here:
https://lore.kernel.org/linuxppc-dev/20210310034813.GM6042@bubble.grove.modra.org/
Signed-off-by: Alan Modra <amodra@au1.ibm.com>
[aik: removed non-relocatable which is gone in 24d33ac5b8]
[aik: added <=2.24 check]
[aik: because of llvm-as, kernel_toc_addr() uses "mr" instead of global register variable]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-2-aik@ozlabs.ru
Some functions defined in 'arch/powerpc/platforms/pseries' are
deserving of an `__init` macro attribute. These functions are only
called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-13-nick.child@ibm.com
Some functions defined in 'arch/powerpc/platforms/powernv' are
deserving of an `__init` macro attribute. These functions are only
called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-12-nick.child@ibm.com
Some functions defined in 'arch/powerpc/platforms/powermac` are only
called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-11-nick.child@ibm.com
`xmon_register_spus` defined in 'arch/powerpc/xmon' is deserving of an
`__init` macro attribute. This functions is only called by other
initialization functions and therefore should inherit the attribute.
Also, change the function declaration in the header file to include
`__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-7-nick.child@ibm.com
Some files functions in 'arch/powerpc/sysdev' are deserving of an `__init`
macro attribute. These functions are only called by other initialization
functions and therefore should inherit the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-6-nick.child@ibm.com
Some functions defined in 'arch/powerpc/perf' are deserving of an
`__init` macro attribute. These functions are only called by other
initialization functions and therefore should inherit the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-5-nick.child@ibm.com
Some functions defined in 'arch/powerpc/mm' are deserving of an
`__init` macro attribute. These functions are only called by other
initialization functions and therefore should inherit the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-4-nick.child@ibm.com
Some functions defined in 'arch/powerpc/lib' are deserving of an `__init`
macro attribute. These functions are only called by other initialization
functions and therefore should inherit the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-3-nick.child@ibm.com
Some functions defined in `arch/powerpc/kernel` (and one in `arch/powerpc/
kexec`) are deserving of an `__init` macro attribute. These functions are
only called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-2-nick.child@ibm.com
Fix conflicts between memslot overhaul and commit 511d25d6b7 ("KVM:
PPC: Book3S: Suppress warnings when allocating too big memory slots")
from the powerpc tree.
Move the assertions requiring restart table searches under
CONFIG_PPC_IRQ_SOFT_MASK_DEBUG.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-6-npiggin@gmail.com
Enabling MSR[EE] in interrupt handlers while interrupts are still soft
masked allows PMIs to profile interrupt handlers to some degree, beyond
what SIAR latching allows.
When perf is not being used, this is almost useless work. It requires an
extra mtmsrd in the irq handler, and it also opens the door to masked
interrupts hitting and requiring replay, which is more expensive than
just taking them directly. This effect can be noticable in high IRQ
workloads.
Avoid enabling MSR[EE] unless perf is currently in use. This saves about
60 cycles (or 8%) on a simple decrementer interrupt microbenchmark.
Replayed interrupts drop from 1.4% of all interrupts taken, to 0.003%.
This does prevent the soft-nmi interrupt being taken in these handlers,
but that's not too reliable anyway. The SMP watchdog will continue to be
the reliable way to catch lockups.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-5-npiggin@gmail.com
Interrupt code enables MSR[EE] in some irq handlers while keeping local
irqs disabled via soft-mask, allowing PMI interrupts to be taken as
soft-NMI to improve profiling of irq handlers.
When perf is not enabled, there is no point to doing this, it's
additional overhead. So provide a function that can say if PMIs should
be taken promptly if possible.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-4-npiggin@gmail.com
The mtmsrd to enable MSR[RI] can be combined with the mtmsrd to enable
MSR[EE] in interrupt entry code, for those interrupts which enable EE.
This helps performance of important synchronous interrupts (e.g., page
faults).
This is similar to what commit dd152f70bd ("powerpc/64s: system call
avoid setting MSR[RI] until we set MSR[EE]") does for system calls.
Do this by enabling EE and RI together at the beginning of the entry
wrapper if PACA_IRQ_HARD_DIS is clear, and only enabling RI if it is
set.
Asynchronous interrupts set PACA_IRQ_HARD_DIS, but synchronous ones
leave it unchanged, so by default they always get EE=1 unless they have
interrupted a caller that is hard disabled. When the sync interrupt
later calls interrupt_cond_local_irq_enable(), it will not require
another mtmsrd because MSR[EE] was already enabled here.
This avoids one mtmsrd L=1 for synchronous interrupts on 64s, which
saves about 20 cycles on POWER9. And for kernel-mode interrupts, both
synchronous and asynchronous, this saves an additional 40 cycles due to
the mtmsrd being moved ahead of mfspr SPRN_AMR, which prevents a SPR
scoreboard stall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-3-npiggin@gmail.com
Make synchronous interrupt handler entry wrappers enable MSR[EE] if
MSR[EE] was enabled in the interrupted context. IRQs are soft-disabled
at this point so there is no change to high level code, but it's a
masked interrupt could fire.
This is a performance disadvantage for interrupts which do not later
call interrupt_cond_local_irq_enable(), because an an additional mtmsrd
or wrtee instruction is executed. However the important synchronous
interrupts (e.g., page fault) do enable interrupts, so the performance
disadvantage is mostly avoided.
In the next patch, MSR[RI] enabling can be combined with MSR[EE]
enabling, which mitigates the performance drop for the former and gives
a performance advanage for the latter interrupts, on 64s machines. 64e
is coming along for the ride for now to avoid divergences with 64s in
this tricky code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-2-npiggin@gmail.com
copy_inst_from_kernel_nofault() uses copy_from_kernel_nofault() to
copy one or two 32bits words. This means calling an out-of-line
function which itself calls back copy_from_kernel_nofault_allowed()
then performs a generic copy with loops.
Rewrite copy_inst_from_kernel_nofault() to do everything at a
single place and use __get_kernel_nofault() directly to perform
single accesses without loops.
Allthough the generic function uses pagefault_disable(), it is not
required on powerpc because do_page_fault() bails earlier when a
kernel mode fault happens on a kernel address.
As the function has now become very small, inline it.
With this change, on an 8xx the time spent in the loop in
ftrace_replace_code() is reduced by 23% at function tracer activation
and 27% at nop tracer activation.
The overall time to activate function tracer (measured with shell
command 'time') is 570ms before the patch and 470ms after the patch.
Even vmlinux size is reduced (by 152 instruction).
Before the patch:
00000018 <copy_inst_from_kernel_nofault>:
18: 94 21 ff e0 stwu r1,-32(r1)
1c: 7c 08 02 a6 mflr r0
20: 38 a0 00 04 li r5,4
24: 93 e1 00 1c stw r31,28(r1)
28: 7c 7f 1b 78 mr r31,r3
2c: 38 61 00 08 addi r3,r1,8
30: 90 01 00 24 stw r0,36(r1)
34: 48 00 00 01 bl 34 <copy_inst_from_kernel_nofault+0x1c>
34: R_PPC_REL24 copy_from_kernel_nofault
38: 2c 03 00 00 cmpwi r3,0
3c: 40 82 00 0c bne 48 <copy_inst_from_kernel_nofault+0x30>
40: 81 21 00 08 lwz r9,8(r1)
44: 91 3f 00 00 stw r9,0(r31)
48: 80 01 00 24 lwz r0,36(r1)
4c: 83 e1 00 1c lwz r31,28(r1)
50: 38 21 00 20 addi r1,r1,32
54: 7c 08 03 a6 mtlr r0
58: 4e 80 00 20 blr
After the patch (before inlining):
00000018 <copy_inst_from_kernel_nofault>:
18: 3d 20 b0 00 lis r9,-20480
1c: 7c 04 48 40 cmplw r4,r9
20: 7c 69 1b 78 mr r9,r3
24: 41 80 00 14 blt 38 <copy_inst_from_kernel_nofault+0x20>
28: 81 44 00 00 lwz r10,0(r4)
2c: 38 60 00 00 li r3,0
30: 91 49 00 00 stw r10,0(r9)
34: 4e 80 00 20 blr
38: 38 60 ff de li r3,-34
3c: 4e 80 00 20 blr
40: 38 60 ff f2 li r3,-14
44: 4e 80 00 20 blr
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Add clang workaround, with version check as suggested by Nathan]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0d5b12183d5176dd702d29ad94c39c384e51c78f.1638208156.git.christophe.leroy@csgroup.eu
Because of circular inclusion of asm/hw_breakpoint.h, we
need to move definition of asm/reg.h outside of inst.h
so that asm/hw_breakpoint.h gets it without including
asm/inst.h
Also remove asm/inst.h from asm/uprobes.h as it's not
needed anymore.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4b79f1491118af96b1ac0735e74aeca02ea4c04e.1638208156.git.christophe.leroy@csgroup.eu
Unlike PPC64 ABI, PPC32 uses the stack to pass a parameter defined
as a struct, even when the struct has a single simple element.
To avoid that, define ppc_inst_t as u32 on PPC32.
Keep it as 'struct ppc_inst' when __CHECKER__ is defined so that
sparse can perform type checking.
Also revert commit 511eea5e2c ("powerpc/kprobes: Fix Oops by passing
ppc_inst as a pointer to emulate_step() on ppc32") as now the
instruction to be emulated is passed as a register to emulate_step().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c6d0c46f598f76ad0b0a88bc0d84773bd921b17c.1638208156.git.christophe.leroy@csgroup.eu
On booke/40x we don't have segments like book3s/32.
On booke/40x we don't have access protection groups like 8xx.
Use the PID register to provide user access protection.
Kernel address space can be accessed with any PID.
User address space has to be accessed with the PID of the user.
User PID is always not null.
Everytime the kernel is entered, set PID register to 0 and
restore PID register when returning to user.
Everytime kernel needs to access user data, PID is restored
for the access.
In TLB miss handlers, check the PID and bail out to data storage
exception when PID is 0 and accessed address is in user space.
Note that also forbids execution of user text by kernel except
when user access is unlocked. But this shouldn't be a problem
as the kernel is not supposed to ever run user text.
This patch prepares the infrastructure but the real activation of KUAP
is done by following patches for each processor type one by one.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5d65576a8e31e9480415785a180c92dd4e72306d.1634627931.git.christophe.leroy@csgroup.eu
Also call kuap_lock() and kuap_save_and_lock() from
interrupt functions with CONFIG_PPC64.
For book3s/64 we keep them empty as it is done in assembly.
Also do the locked assert when switching task unless it is
book3s/64.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1cbf94e26e6d6e2e028fd687588a7e6622d454a6.1634627931.git.christophe.leroy@csgroup.eu
We have many functionnalities common to 40x and BOOKE, it leads to
many places with #if defined(CONFIG_BOOKE) || defined(CONFIG_40x).
We are going to add a few more with KUAP for booke/40x, so create
a new symbol which is defined when either BOOKE or 40x is defined.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9a3dbd60924cb25c9f944d3d8205ac5a0d15e229.1634627931.git.christophe.leroy@csgroup.eu
Add kuap_lock() and call it when entering interrupts from user.
It is called kuap_lock() as it is similar to kuap_save_and_lock()
without the save.
However book3s/32 already have a kuap_lock(). Rename it
kuap_lock_addr().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4437e2deb9f6f549f7089d45e9c6f96a7e77905a.1634627931.git.christophe.leroy@csgroup.eu
Make the following functions generic to all platforms.
- bad_kuap_fault()
- kuap_assert_locked()
- kuap_save_and_lock() (PPC32 only)
- kuap_kernel_restore()
- kuap_get_and_assert_locked()
And for all platforms except book3s/64
- allow_user_access()
- prevent_user_access()
- prevent_user_access_return()
- restore_user_access()
Prepend __ in front of the name of platform specific ones.
For now the generic just calls the platform specific, but
next patch will move redundant parts of specific functions
into the generic one.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/eaef143a8dae7288cd34565ffa7b49c16aee1ec3.1634627931.git.christophe.leroy@csgroup.eu
Deactivating KUEP at boot time is unrelevant for PPC32 and BOOK3E/64.
Remove it.
It allows to refactor setup_kuep() via a __weak function
that only PPC64s will overide for now.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Fix CONFIG_PPC_BOOKS_64 -> CONFIG_PPC_BOOK3S_64 typo]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4c36df18b41c988c4512f45d96220486adbe4c99.1634627931.git.christophe.leroy@csgroup.eu
Calling 'mfsr' to get the content of segment registers is heavy,
in addition it requires clearing of the 'reserved' bits.
In order to avoid this operation, save it in mm context and in
thread struct.
The saved sr0 is the one used by kernel, this means that on
locking entry it can be used as is.
For unlocking, the only thing to do is to clear SR_NX.
This improves null_syscall selftest by 12 cycles, ie 4%.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b02baf2ed8f09bad910dfaeeb7353b2ae6830525.1634627931.git.christophe.leroy@csgroup.eu
When interrupt and syscall entries where converted to C, KUEP locking
and unlocking was also converted. It improved performance by unrolling
the loop, and allowed easily implementing boot time deactivation of
KUEP.
However, null_syscall selftest shows that KUEP is still heavy
(361 cycles with KUEP, 212 cycles without).
A way to improve more is to group 'mtsr's together, instead of
repeating 'addi' + 'mtsr' several times.
In order to do that, more registers need to be available. In C, GCC
will always be able to provide the requested number of registers, but
at the cost of saving some data on the stack, which is counter
performant here.
So let's do it in assembly, when we have full control of which
register can be used. It also has the advantage of locking earlier
and unlocking later and it helps GCC generating less tricky code.
The only drawback is to make boot time deactivation less straight
forward and require 'hand' instruction patching.
Group 'mtsr's by 4.
With this change, null_syscall selftest reports 336 cycles. Without
the change it was 361 cycles, that's a 7% reduction.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/115cb279e9b9948dfd93a065e047081c59e3a2a6.1634627931.git.christophe.leroy@csgroup.eu
On the 8xx, there is absolutely no runtime impact with KUEP. Protection
against execution of user code in kernel mode is set up at boot time
by configuring the groups with contain all user pages as having swapped
protection rights, in extenso EX for user and NA for supervisor.
Configure KUEP at startup and force selection of CONFIG_PPC_KUEP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2129e86944323ffe9ed07fffbeafdfd2e363690a.1634627931.git.christophe.leroy@csgroup.eu
This reverts commit 1791ebd131.
setup_kup() was inlined to manage conflict between PPC32 marking
setup_{kuap/kuep}() __init and PPC64 not marking them __init.
But in fact PPC32 has removed the __init mark for all but 8xx
in order to properly handle SMP.
In order to make setup_kup() grow a bit, revert the commit
mentioned above but remove __init for 8xx as well so that
we don't have to mark setup_kup() as __ref.
Also switch the order so that KUAP is initialised before KUEP
because on the 40x, KUEP will depend on the activation of KUAP.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/7691088fd0994ee3c8db6298dc8c00259e3f6a7f.1634627931.git.christophe.leroy@csgroup.eu
Compiling out hash support code when CONFIG_PPC_64S_HASH_MMU=n saves
128kB kernel image size (90kB text) on powernv_defconfig minus KVM,
350kB on pseries_defconfig minus KVM, 40kB on a tiny config.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup defined(ARCH_HAS_MEMREMAP_COMPAT_ALIGN), which needs CONFIG.
Fix radix_enabled() use in setup_initial_memory_limit(). Add some
stubs to reduce number of ifdefs.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-18-npiggin@gmail.com
This adds Kconfig selection which allows 64s hash MMU support to be
disabled. It can be disabled if radix support is enabled, the minimum
supported CPU type is POWER9 (or higher), and KVM is not selected.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-17-npiggin@gmail.com
To avoid any functional changes to radix paths when building with hash
MMU support disabled (and CONFIG_PPC_MM_SLICES=n), always define the
arch get_unmapped_area calls on 64s platforms.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-16-npiggin@gmail.com
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are
nops.
No functional change intended.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not define/reference kvm_vcpu.wait if __KVM_HAVE_ARCH_WQP is true, and
instead force the architecture (PPC) to define its own rcuwait object.
Allowing common KVM to directly access vcpu->wait without a guard makes
it all too easy to introduce potential bugs, e.g. kvm_vcpu_block(),
kvm_vcpu_on_spin(), and async_pf_execute() all operate on vcpu->wait, not
the result of kvm_arch_vcpu_get_wait(), and so may do the wrong thing for
PPC.
Due to PPC's shenanigans with respect to callbacks and waits (it switches
to the virtual core's wait object at KVM_RUN!?!?), it's not clear whether
or not this fixes any bugs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For PPC HV, get the number of pages directly from the new memslot instead
of computing the same from the userspace memory region, and explicitly
check for !DELETE instead of inferring the same when toggling mmio_update.
The motivation for these changes is to avoid referencing the @mem param
so that it can be dropped in a future commit.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <1e97fb5198be25f98ef82e63a8d770c682264cc9.1638817639.git.maciej.szmigiero@oracle.com>
Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary. This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.
For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY. This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.
For x86, copy the arch data only for FLAGS_ONLY changes. Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.
Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
The printk header file includes ratelimit_types.h for its __ratelimit()
based usage. It is required for the static initializer used in
printk_ratelimited(). It uses a raw_spinlock_t and includes the
spinlock_types.h.
PREEMPT_RT substitutes spinlock_t with a rtmutex based implementation and so
its spinlock_t implmentation (provided by spinlock_rt.h) includes rtmutex.h and
atomic.h which leads to recursive includes where defines are missing.
By including only the raw_spinlock_t defines it avoids the atomic.h
related includes at this stage.
An example on powerpc:
| CALL scripts/atomic/check-atomics.sh
|In file included from include/linux/bug.h:5,
| from include/linux/page-flags.h:10,
| from kernel/bounds.c:10:
|arch/powerpc/include/asm/page_32.h: In function âclear_pageâ:
|arch/powerpc/include/asm/bug.h:87:4: error: implicit declaration of function â=80=98__WARNâ=80=99 [-Werror=3Dimplicit-function-declaration]
| 87 | __WARN(); \
| | ^~~~~~
|arch/powerpc/include/asm/page_32.h:48:2: note: in expansion of macro âWARN_ONâ=99
| 48 | WARN_ON((unsigned long)addr & (L1_CACHE_BYTES - 1));
| | ^~~~~~~
|arch/powerpc/include/asm/bug.h:58:17: error: invalid application of âsizeofâ=99 to incomplete type âstruct bug_entryâ=99
| 58 | "i" (sizeof(struct bug_entry)), \
| | ^~~~~~
|arch/powerpc/include/asm/bug.h:89:3: note: in expansion of macro âBUG_ENTRYâ=99
| 89 | BUG_ENTRY(PPC_TLNEI " %4, 0", \
| | ^~~~~~~~~
|arch/powerpc/include/asm/page_32.h:48:2: note: in expansion of macro âWARN_ONâ=99
| 48 | WARN_ON((unsigned long)addr & (L1_CACHE_BYTES - 1));
| | ^~~~~~~
|In file included from arch/powerpc/include/asm/ptrace.h:298,
| from arch/powerpc/include/asm/hw_irq.h:12,
| from arch/powerpc/include/asm/irqflags.h:12,
| from include/linux/irqflags.h:16,
| from include/asm-generic/cmpxchg-local.h:6,
| from arch/powerpc/include/asm/cmpxchg.h:526,
| from arch/powerpc/include/asm/atomic.h:11,
| from include/linux/atomic.h:7,
| from include/linux/rwbase_rt.h:6,
| from include/linux/rwlock_types.h:55,
| from include/linux/spinlock_types.h:74,
| from include/linux/ratelimit_types.h:7,
| from include/linux/printk.h:10,
| from include/asm-generic/bug.h:22,
| from arch/powerpc/include/asm/bug.h:109,
| from include/linux/bug.h:5,
| from include/linux/page-flags.h:10,
| from kernel/bounds.c:10:
|include/linux/thread_info.h: In function â=80=98copy_overflowâ=80=99:
|include/linux/thread_info.h:210:2: error: implicit declaration of function â=80=98WARNâ=80=99 [-Werror=3Dimplicit-function-declaration]
| 210 | WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count);
| | ^~~~
The WARN / BUG include pulls in printk.h and then ptrace.h expects WARN
(from bug.h) which is not yet complete. Even hw_irq.h has WARN_ON()
statements.
On POWERPC64 there are missing atomic64 defines while building 32bit
VDSO:
| VDSO32C arch/powerpc/kernel/vdso32/vgettimeofday.o
|In file included from include/linux/atomic.h:80,
| from include/linux/rwbase_rt.h:6,
| from include/linux/rwlock_types.h:55,
| from include/linux/spinlock_types.h:74,
| from include/linux/ratelimit_types.h:7,
| from include/linux/printk.h:10,
| from include/linux/kernel.h:19,
| from arch/powerpc/include/asm/page.h:11,
| from arch/powerpc/include/asm/vdso/gettimeofday.h:5,
| from include/vdso/datapage.h:137,
| from lib/vdso/gettimeofday.c:5,
| from <command-line>:
|include/linux/atomic-arch-fallback.h: In function âarch_atomic64_incâ=99:
|include/linux/atomic-arch-fallback.h:1447:2: error: implicit declaration of function âarch_atomic64_addâ; did you mean âarch_atomic_addâ? [-Werror=3Dimpl
|icit-function-declaration]
| 1447 | arch_atomic64_add(1, v);
| | ^~~~~~~~~~~~~~~~~
| | arch_atomic_add
The generic fallback is not included, atomics itself are not used. If
kernel.h does not include printk.h then it comes later from the bug.h
include.
Allow asm/spinlock_types.h to be included from
linux/spinlock_types_raw.h.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211129174654.668506-12-bigeasy@linutronix.de
slb.c is hash-specific SLB management, but do_bad_slb_fault deals with
segment interrupts that occur with radix MMU as well.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-5-npiggin@gmail.com
The pseries platform does not use the native hash code but the PAPR
virtualised hash interfaces, so remove PPC_HASH_MMU_NATIVE.
This requires moving tlbiel code from hash_native.c to hash_utils.c.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-4-npiggin@gmail.com
FW_FEATURE_NATIVE_ALWAYS and FW_FEATURE_NATIVE_POSSIBLE are always
zero and never do anything. Remove them.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-2-npiggin@gmail.com
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add a struct_group() for the spe registers so that memset() can correctly reason
about the size:
In function 'fortify_memset_chk',
inlined from 'restore_user_regs.part.0' at arch/powerpc/kernel/signal_32.c:539:3:
>> include/linux/fortify-string.h:195:4: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
195 | __write_overflow_field();
| ^~~~~~~~~~~~~~~~~~~~~~~~
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211118203604.1288379-1-keescook@chromium.org
Allow the LPID bit width and partition table size to be set at runtime
from the device tree.
Move the PID bit width detection into the same place.
KVM does not support using the extra bits yet, this is mainly required
to get the PTCR register values correct (so KVM will run but it will
not allocate > 4096 LPIDs).
OPAL firmware provides this property for POWER10 CPUs since skiboot
commit 9b85f7d961f2 ("hdata: add mmu-pid-bits and mmu-lpid-bits for
POWER10 CPUs").
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211129030915.1888332-1-npiggin@gmail.com
Running perf fuzzer showed below in dmesg logs:
"Can't find PMC that caused IRQ"
This means a PMU exception happened, but none of the PMC's (Performance
Monitor Counter) were found to be overflown. There are some corner cases
that clears the PMCs after PMI gets masked. In such cases, the perf
interrupt handler will not find the active PMC values that had caused
the overflow and thus leads to this message while replaying.
Case 1: PMU Interrupt happens during replay of other interrupts and
counter values gets cleared by PMU callbacks before replay:
During replay of interrupts like timer, __do_irq() and doorbell
exception, we conditionally enable interrupts via may_hard_irq_enable().
This could potentially create a window to generate a PMI. Since irq soft
mask is set to ALL_DISABLED, the PMI will get masked here. We could get
IPIs run before perf interrupt is replayed and the PMU events could
be deleted or stopped. This will change the PMU SPR values and resets
the counters. Snippet of ftrace log showing PMU callbacks invoked in
__do_irq():
<idle>-0 [051] dns. 132025441306354: __do_irq <-call_do_irq
<idle>-0 [051] dns. 132025441306430: irq_enter <-__do_irq
<idle>-0 [051] dns. 132025441306503: irq_enter_rcu <-__do_irq
<idle>-0 [051] dnH. 132025441306599: xive_get_irq <-__do_irq
<<>>
<idle>-0 [051] dnH. 132025441307770: generic_smp_call_function_single_interrupt <-smp_ipi_demux_relaxed
<idle>-0 [051] dnH. 132025441307839: flush_smp_call_function_queue <-smp_ipi_demux_relaxed
<idle>-0 [051] dnH. 132025441308057: _raw_spin_lock <-event_function
<idle>-0 [051] dnH. 132025441308206: power_pmu_disable <-perf_pmu_disable
<idle>-0 [051] dnH. 132025441308337: power_pmu_del <-event_sched_out
<idle>-0 [051] dnH. 132025441308407: power_pmu_read <-power_pmu_del
<idle>-0 [051] dnH. 132025441308477: read_pmc <-power_pmu_read
<idle>-0 [051] dnH. 132025441308590: isa207_disable_pmc <-power_pmu_del
<idle>-0 [051] dnH. 132025441308663: write_pmc <-power_pmu_del
<idle>-0 [051] dnH. 132025441308787: power_pmu_event_idx <-perf_event_update_userpage
<idle>-0 [051] dnH. 132025441308859: rcu_read_unlock_strict <-perf_event_update_userpage
<idle>-0 [051] dnH. 132025441308975: power_pmu_enable <-perf_pmu_enable
<<>>
<idle>-0 [051] dnH. 132025441311108: irq_exit <-__do_irq
<idle>-0 [051] dns. 132025441311319: performance_monitor_exception <-replay_soft_interrupts
Case 2: PMI's masked during local_* operations, example local_add(). If
the local_add() operation happens within a local_irq_save(), replay of
PMI will be during local_irq_restore(). Similar to case 1, this could
also create a window before replay where PMU events gets deleted or
stopped.
Fix it by updating the PMU callback function power_pmu_disable() to
check for pending perf interrupt. If there is an overflown PMC and
pending perf interrupt indicated in paca, clear the PMI bit in paca to
drop that sample. Clearing of PMI bit is done in power_pmu_disable()
since disable is invoked before any event gets deleted/stopped. With
this fix, if there are more than one event running in the PMU, there is
a chance that we clear the PMI bit for the event which is not getting
deleted/stopped. The other events may still remain active. Hence to make
sure we don't drop valid sample in such cases, another check is added in
power_pmu_enable. This checks if there is an overflown PMC found among
the active events and if so enable back the PMI bit. Two new helper
functions are introduced to clear/set the PMI, ie
clear_pmi_irq_pending() and set_pmi_irq_pending(). Helper function
pmi_irq_pending() is introduced to give a warning if there is pending
PMI bit in paca, but no PMC is overflown.
Also there are corner cases which result in performance monitor
interrupts being triggered during power_pmu_disable(). This happens
since PMXE bit is not cleared along with disabling of other MMCR0 bits
in the pmu_disable. Such PMI's could leave the PMU running and could
trigger PMI again which will set MMCR0 PMAO bit. This could lead to
spurious interrupts in some corner cases. Example, a timer after
power_pmu_del() which will re-enable interrupts and triggers a PMI again
since PMAO bit is still set. But fails to find valid overflow since PMC
was cleared in power_pmu_del(). Fix that by disabling PMXE along with
disabling of other MMCR0 bits in power_pmu_disable().
We can't just replay PMI any time. Hence this approach is preferred
rather than replaying PMI before resetting overflown PMC. Patch also
documents core-book3s on a race condition which can trigger these PMC
messages during idle path in PowerNV.
Fixes: f442d00480 ("powerpc/64s: Add support to mask perf interrupts and replay them")
Reported-by: Nageswara R Sastry <nasastry@in.ibm.com>
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Suggested-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Make pmi_irq_pending() return bool, reflow/reword some comments]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1626846509-1350-2-git-send-email-atrajeev@linux.vnet.ibm.com
Now that atomic_add() and atomic_sub() handle immediate operands,
atomic_inc() and atomic_dec() have no added value compared to the
generic fallback which calls atomic_add(1) and atomic_sub(1).
Also remove atomic_inc_not_zero() which fallsback to
atomic_add_unless() which itself fallsback to
atomic_fetch_add_unless() which now handles immediate operands.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/0bc64a2f18726055093dbb2e479cefc60a409cfd.1632236981.git.christophe.leroy@csgroup.eu
Today we get the following code generation for bitops like
set or clear bit:
c0009fe0: 39 40 08 00 li r10,2048
c0009fe4: 7c e0 40 28 lwarx r7,0,r8
c0009fe8: 7c e7 53 78 or r7,r7,r10
c0009fec: 7c e0 41 2d stwcx. r7,0,r8
c000d568: 39 00 18 00 li r8,6144
c000d56c: 7c c0 38 28 lwarx r6,0,r7
c000d570: 7c c6 40 78 andc r6,r6,r8
c000d574: 7c c0 39 2d stwcx. r6,0,r7
Most set bits are constant on lower 16 bits, so it can easily
be replaced by the "immediate" version of the operation. Allow
GCC to choose between the normal or immediate form.
For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
when all bits to be cleared are consecutive.
On 64 bits we don't have any equivalent single operation for clearing,
single bits or a few bits, we'd need two 'rldicl' so it is not
worth it, the li/andc sequence is doing the same.
With this patch we get:
c0009fe0: 7d 00 50 28 lwarx r8,0,r10
c0009fe4: 61 08 08 00 ori r8,r8,2048
c0009fe8: 7d 00 51 2d stwcx. r8,0,r10
c000d558: 7c e0 40 28 lwarx r7,0,r8
c000d55c: 54 e7 05 64 rlwinm r7,r7,0,21,18
c000d560: 7c e0 41 2d stwcx. r7,0,r8
On pmac32_defconfig, it reduces the text by approx 10 kbytes.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e6f815d9181bab09df3b350af51149437863e9f9.1632236981.git.christophe.leroy@csgroup.eu
Introduce macros that operate on a (start, end) range of GPRs, which
reduces lines of code and need to do mental arithmetic while reading the
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211022061322.2671178-1-npiggin@gmail.com