With NEED_PER_CPU_PAGE_FIRST_CHUNK enabled, we need a function to
populate pte, this patch adds a generic pcpu populate pte function,
pcpu_populate_pte(), which is marked __weak and used on most
architectures, but it is overridden on x86, which has its own
implementation.
Link: https://lkml.kernel.org/r/20211216112359.103822-5-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With the previous patch, we could add a generic pcpu first chunk
allocate and free function to cleanup the duplicated definations on each
architecture.
Link: https://lkml.kernel.org/r/20211216112359.103822-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add pcpu_fc_cpu_to_node_fn_t and pass it into pcpu_fc_alloc_fn_t, pcpu
first chunk allocation will call it to alloc memblock on the
corresponding node by it, this is prepare for the next patch.
Link: https://lkml.kernel.org/r/20211216112359.103822-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 314f6c23dd ("powerpc/64s: Mask NIP before checking against
SRR0") masked off the low 2 bits of the NIP value in the interrupt
stack frame in case they are non-zero and mis-compare against a SRR0
register value of a CPU which always reads back 0 from the 2 low bits
which are reserved.
This now causes the opposite problem that an implementation which does
implement those bits in SRR0 will mis-compare against the masked NIP
value in which they have been cleared. QEMU is one such implementation,
and this is allowed by the architecture.
This can be triggered by sigfuz by setting low bits of PT_NIP in the
signal context.
Fix this for now by masking the SRR0 bits as well. Cleaner is probably
to sanitise these values before putting them in registers or stack, but
this is the quick and backportable fix.
Fixes: 314f6c23dd ("powerpc/64s: Mask NIP before checking against SRR0")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220117134403.2995059-1-npiggin@gmail.com
Pull signal/exit/ptrace updates from Eric Biederman:
"This set of changes deletes some dead code, makes a lot of cleanups
which hopefully make the code easier to follow, and fixes bugs found
along the way.
The end-game which I have not yet reached yet is for fatal signals
that generate coredumps to be short-circuit deliverable from
complete_signal, for force_siginfo_to_task not to require changing
userspace configured signal delivery state, and for the ptrace stops
to always happen in locations where we can guarantee on all
architectures that the all of the registers are saved and available on
the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task
are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues
reported were small enough that they would not affect bisection so I
simply added the fixes and did not fold the fixes into the changes
they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I
dropped the change that caused that bug and replaced it entirely with
something much more restrained. Unfortunately that required some
rebasing.
Some successes after this set of changes: There are few enough calls
to do_exit to audit in a reasonable amount of time. The lifetime of
struct kthread now matches the lifetime of struct task, and the
pointer to struct kthread is no longer stored in set_child_tid. The
flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is
removed. Issues where task->exit_code was examined with
signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am
cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at:
https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.orghttps://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct
once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits)
ptrace/m68k: Stop open coding ptrace_report_syscall
ptrace: Remove unused regs argument from ptrace_report_syscall
ptrace: Remove second setting of PT_SEIZED in ptrace_attach
taskstats: Cleanup the use of task->exit_code
exit: Use the correct exit_code in /proc/<pid>/stat
exit: Fix the exit_code for wait_task_zombie
exit: Coredumps reach do_group_exit
exit: Remove profile_handoff_task
exit: Remove profile_task_exit & profile_munmap
signal: clean up kernel-doc comments
signal: Remove the helper signal_group_exit
signal: Rename group_exit_task group_exec_task
coredump: Stop setting signal->group_exit_task
signal: Remove SIGNAL_GROUP_COREDUMP
signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process
signal: Make coredump handling explicit in complete_signal
signal: Have prepare_signal detect coredumps using signal->core_state
signal: Have the oom killer detect coredumps using signal->core_state
exit: Move force_uaccess back into do_exit
exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit
...
Merge misc updates from Andrew Morton:
"146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
mm/damon: hide kernel pointer from tracepoint event
mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
mm/damon/dbgfs: remove an unnecessary variable
mm/damon: move the implementation of damon_insert_region to damon.h
mm/damon: add access checking for hugetlb pages
Docs/admin-guide/mm/damon/usage: update for schemes statistics
mm/damon/dbgfs: support all DAMOS stats
Docs/admin-guide/mm/damon/reclaim: document statistics parameters
mm/damon/reclaim: provide reclamation statistics
mm/damon/schemes: account how many times quota limit has exceeded
mm/damon/schemes: account scheme actions that successfully applied
mm/damon: remove a mistakenly added comment for a future feature
Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
Docs/admin-guide/mm/damon/usage: remove redundant information
Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
mm/damon: convert macro functions to static inline functions
mm/damon: modify damon_rand() macro to static inline function
mm/damon: move damon_rand() definition into damon.h
...
- Optimise radix KVM guest entry/exit by 2x on Power9/Power10.
- Allow firmware to tell us whether to disable the entry and uaccess flushes on Power10
or later CPUs.
- Add BPF_PROBE_MEM support for 32 and 64-bit BPF jits.
- Several fixes and improvements to our hard lockup watchdog.
- Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on 32-bit.
- Allow building the 64-bit Book3S kernel without hash MMU support, ie. Radix only.
- Add KUAP (SMAP) support for 40x, 44x, 8xx, Book3E (64-bit).
- Add new encodings for perf_mem_data_src.mem_hops field, and use them on Power10.
- A series of small performance improvements to 64-bit interrupt entry.
- Several commits fixing issues when building with the clang integrated assembler.
- Many other small features and fixes.
Thanks to: Alan Modra, Alexey Kardashevskiy, Ammar Faizi, Anders Roxell, Arnd Bergmann,
Athira Rajeev, Cédric Le Goater, Christophe JAILLET, Christophe Leroy, Christoph Hellwig,
Daniel Axtens, David Yang, Erhard Furtner, Fabiano Rosas, Greg Kroah-Hartman, Guo Ren,
Hari Bathini, Jason Wang, Joel Stanley, Julia Lawall, Kajol Jain, Kees Cook, Laurent
Dufour, Madhavan Srinivasan, Mark Brown, Minghao Chi, Nageswara R Sastry, Naresh Kamboju,
Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Nick Child, Oliver O'Halloran, Peiwei
Hu, Randy Dunlap, Ravi Bangoria, Rob Herring, Russell Currey, Sachin Sant, Sean
Christopherson, Segher Boessenkool, Thadeu Lima de Souza Cascardo, Tyrel Datwyler, Xiang
wangx, Yang Guang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmHhVFMTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgKwzD/9UUEZzWyzMVRJvP9FPZByN2M8czxHJ
tuqEuVqnfks8ad8tfm2ebng5t8ZuVASBQU2fpPA1+lpdvprgZN5RFGMRh729vskn
2aHQPmFvFObNbXOgCoXzk+C5xYi3zoRMVM968neSPBneYo+xDicn/zN5CHAgsjhX
+baemJQ7/xzwLiZgTHe8fWw3nTk3IbPBpha59SdTvR8Moy6I4O8CDPIYEm3U3/J3
x14ZRETqjksL7YOzEBk0avm1dDZRw/johz29oRYSmCj7dyy5OqrkPwokJiRY90eA
1lVdofDc0zElaSWkVGzKdSWRUIXjKIVdtejvDeEvl6H/mI6q4TVZE8rFmn+3Rvgf
9q0iKtmw5Kn11cqgY/pgEGmxnQtIdAodNfI/t939E7+O5LbcznuYUiy0J/kTD/vl
Xduotg2dsCI+5ukf1wrk2wt9LhqZL+ziOeaBhyDM4orV8T3HBYL6zWBptun//IGO
lK6TvvCHSYnGqY4bnrAmiOnbbEtnP6nN3zbcXgSvPM0wCRHPIEqd0NRXtfISo32d
vBPq1neXWo4wrRJj9X3yOuP+5fEA4I+hB3yrCJOkcEcz+8NhlboQXU7raVsJL+bd
kze75H8hwX7kE71oJFFl13LbSNABgiLFARTBXKfvdQA2iLdR0Snvm+OouvwWRPo/
Po7Nm3zqdLc/1A==
=BxhQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Optimise radix KVM guest entry/exit by 2x on Power9/Power10.
- Allow firmware to tell us whether to disable the entry and uaccess
flushes on Power10 or later CPUs.
- Add BPF_PROBE_MEM support for 32 and 64-bit BPF jits.
- Several fixes and improvements to our hard lockup watchdog.
- Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on 32-bit.
- Allow building the 64-bit Book3S kernel without hash MMU support, ie.
Radix only.
- Add KUAP (SMAP) support for 40x, 44x, 8xx, Book3E (64-bit).
- Add new encodings for perf_mem_data_src.mem_hops field, and use them
on Power10.
- A series of small performance improvements to 64-bit interrupt entry.
- Several commits fixing issues when building with the clang integrated
assembler.
- Many other small features and fixes.
Thanks to Alan Modra, Alexey Kardashevskiy, Ammar Faizi, Anders Roxell,
Arnd Bergmann, Athira Rajeev, Cédric Le Goater, Christophe JAILLET,
Christophe Leroy, Christoph Hellwig, Daniel Axtens, David Yang, Erhard
Furtner, Fabiano Rosas, Greg Kroah-Hartman, Guo Ren, Hari Bathini, Jason
Wang, Joel Stanley, Julia Lawall, Kajol Jain, Kees Cook, Laurent Dufour,
Madhavan Srinivasan, Mark Brown, Minghao Chi, Nageswara R Sastry, Naresh
Kamboju, Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Nick Child,
Oliver O'Halloran, Peiwei Hu, Randy Dunlap, Ravi Bangoria, Rob Herring,
Russell Currey, Sachin Sant, Sean Christopherson, Segher Boessenkool,
Thadeu Lima de Souza Cascardo, Tyrel Datwyler, Xiang wangx, and Yang
Guang.
* tag 'powerpc-5.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (240 commits)
powerpc/xmon: Dump XIVE information for online-only processors.
powerpc/opal: use default_groups in kobj_type
powerpc/cacheinfo: use default_groups in kobj_type
powerpc/sched: Remove unused TASK_SIZE_OF
powerpc/xive: Add missing null check after calling kmalloc
powerpc/floppy: Remove usage of the deprecated "pci-dma-compat.h" API
selftests/powerpc: Add a test of sigreturning to an unaligned address
powerpc/64s: Use EMIT_WARN_ENTRY for SRR debug warnings
powerpc/64s: Mask NIP before checking against SRR0
powerpc/perf: Fix spelling of "its"
powerpc/32: Fix boot failure with GCC latent entropy plugin
powerpc/code-patching: Replace patch_instruction() by ppc_inst_write() in selftests
powerpc/code-patching: Move code patching selftests in its own file
powerpc/code-patching: Move instr_is_branch_{i/b}form() in code-patching.h
powerpc/code-patching: Move patch_exception() outside code-patching.c
powerpc/code-patching: Use test_trampoline for prefixed patch test
powerpc/code-patching: Fix patch_branch() return on out-of-range failure
powerpc/code-patching: Reorganise do_patch_instruction() to ease error handling
powerpc/code-patching: Fix unmap_patch_area() error handling
powerpc/code-patching: Fix error handling in do_patch_instruction()
...
Bindings:
- DT schema conversions for Samsung clocks, RNG bindings, Qcom Command
DB and rmtfs, gpio-restart, i2c-mux-gpio, i2c-mux-pinctl, Tegra I2C
and BPMP, pwm-vibrator, Arm DSU, and Cadence macb
- DT schema conversions for Broadcom platforms: interrupt controllers,
STB GPIO, STB waketimer, STB reset, iProc MDIO mux, iProc PCIe,
Cygnus PCIe PHY, PWM, USB BDC, BCM6328 LEDs, TMON, SYSTEMPORT, AMAC,
Northstar 2 PCIe PHY, GENET, moca PHY, GISB arbiter, and SATA
- Add binding schemas for Tegra210 EMC table, TI DC-DC converters,
- Clean-ups of MDIO bus schemas to fix 'unevaluatedProperties' issues
- More fixes due to 'unevaluatedProperties' enabling
- Data type fixes and clean-ups of binding examples found in preparation
to move to validating DTB files directly (instead of intermediate YAML
representation.
- Vendor prefixes for T-Head Semiconductor, OnePlus, and Sunplus
- Add various new compatible strings
DT core:
- Silence a warning for overlapping reserved memory regions
- Reimplement unittest overlay tracking
- Fix stack frame size warning in unittest
- Clean-ups of early FDT scanning functions
- Fix handling of "linux,usable-memory-range" on EFI booted systems
- Add support for 'fail' status on CPU nodes
- Improve error message in of_phandle_iterator_next()
- kbuild: Disable duplicate unit-address warnings for disabled nodes
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmHfCdcQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw+UZD/0ZMQQ6VF20MW7Gg0bOutd8Q6Q6opjrCG5c
nLW5mv8Q+um3sI1ZpwdMI4zAfCmTfeL13ZM9KtJKlJ0o41bgId+kZsezy4I2rN9+
sE1CwA4TninKTJsUkmyQX4fgJRUZ95Eubryfb07sy7nbK3LZQ+t18R5tzVBDpzy4
7hy4eM6mlMxgIJDi7EUboLZslkMM4TGGutLsk5C5T5V5lcWSt3Jj5WZtl5k4Wykq
j4i9mU+GGTZi0nGAJQ7lNoLPatZDSVQx5tzNV/Wi8hSwZbn0Kycu+IuWZyihILz/
9lzB/7tv8fl+xkTaJ5xxaY05HcDeX02yCLzh3PfAHRYdbQ2EkFoaKqJ81SLfAq5t
aH87v41wFSrjzynxpppqswXOdqI/jofrHrGlQldnw0VHGTjEfDbyZGRQFPHmuzTG
gXaSNKCxppG7ThpXarfu7D4TdYV75n+cBOsC/BBopYgIS2+xmjDA3t5Scks1/4NX
1Hfq9IMF9iYJYc/GNXBWcOrLn9d1ILYt6HrKRQar1NIEFH1Lt0c2aw5WsyvOZ4zx
aLHLSbEwnl+2wleyGB9YQkFaaF7N6qcid3u9KFRJP6nTojoaeQaIi3MR9F3LVReZ
LV5YfWEcij1zc+lzwgHc6+8bbgFxrKgOC2IL/B6u93u/BO0wmF/54kbEZKaLyX8d
a7Iii4IYFw==
=2g8v
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
"Bindings:
- DT schema conversions for Samsung clocks, RNG bindings, Qcom
Command DB and rmtfs, gpio-restart, i2c-mux-gpio, i2c-mux-pinctl,
Tegra I2C and BPMP, pwm-vibrator, Arm DSU, and Cadence macb
- DT schema conversions for Broadcom platforms: interrupt
controllers, STB GPIO, STB waketimer, STB reset, iProc MDIO mux,
iProc PCIe, Cygnus PCIe PHY, PWM, USB BDC, BCM6328 LEDs, TMON,
SYSTEMPORT, AMAC, Northstar 2 PCIe PHY, GENET, moca PHY, GISB
arbiter, and SATA
- Add binding schemas for Tegra210 EMC table, TI DC-DC converters,
- Clean-ups of MDIO bus schemas to fix 'unevaluatedProperties' issues
- More fixes due to 'unevaluatedProperties' enabling
- Data type fixes and clean-ups of binding examples found in
preparation to move to validating DTB files directly (instead of
intermediate YAML representation.
- Vendor prefixes for T-Head Semiconductor, OnePlus, and Sunplus
- Add various new compatible strings
DT core:
- Silence a warning for overlapping reserved memory regions
- Reimplement unittest overlay tracking
- Fix stack frame size warning in unittest
- Clean-ups of early FDT scanning functions
- Fix handling of "linux,usable-memory-range" on EFI booted systems
- Add support for 'fail' status on CPU nodes
- Improve error message in of_phandle_iterator_next()
- kbuild: Disable duplicate unit-address warnings for disabled nodes"
* tag 'devicetree-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (114 commits)
dt-bindings: net: mdio: Drop resets/reset-names child properties
dt-bindings: clock: samsung: convert S5Pv210 to dtschema
dt-bindings: clock: samsung: convert Exynos5410 to dtschema
dt-bindings: clock: samsung: convert Exynos5260 to dtschema
dt-bindings: clock: samsung: extend Exynos7 bindings with UFS
dt-bindings: clock: samsung: convert Exynos7 to dtschema
dt-bindings: clock: samsung: convert Exynos5433 to dtschema
dt-bindings: i2c: maxim,max96712: Add bindings for Maxim Integrated MAX96712
dt-bindings: iio: adi,ltc2983: Fix 64-bit property sizes
dt-bindings: power: maxim,max17040: Fix incorrect type for 'maxim,rcomp'
dt-bindings: interrupt-controller: arm,gic-v3: Fix 'interrupts' cell size in example
dt-bindings: iio/magnetometer: yamaha,yas530: Fix invalid 'interrupts' in example
dt-bindings: clock: imx5: Drop clock consumer node from example
dt-bindings: Drop required 'interrupt-parent'
dt-bindings: net: ti,dp83869: Drop value on boolean 'ti,max-output-impedance'
dt-bindings: net: wireless: mt76: Fix 8-bit property sizes
dt-bindings: PCI: snps,dw-pcie-ep: Drop conflicting 'max-functions' schema
dt-bindings: i2c: st,stm32-i2c: Make each example a separate entry
dt-bindings: net: stm32-dwmac: Make each example a separate entry
dt-bindings: net: Cleanup MDIO node schemas
...
accesing it in order to prevent any potential data races, and convert
all users to those new accessors
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmHcgFoACgkQEsHwGGHe
VUqXeRAAvcNEfFw6BvXeGfFTxKmOrsRtu2WCkAkjvamyhXMCrjBqqHlygLJFCH5i
2mc6HBohzo4vBFcgi3R5tVkGazqlthY1KUM9Jpk7rUuUzi0phTH7n/MafZOm9Es/
BHYcAAyT/NwZRbCN0geccIzBtbc4xr8kxtec7vkRfGDx8B9/uFN86xm7cKAaL62G
UDs0IquDPKEns3A7uKNuvKztILtuZWD1WcSkbOULJzXgLkb+cYKO1Lm9JK9rx8Ds
8tjezrJgOYGLQyyv0i3pWelm3jCZOKUChPslft0opvVUbrNd8piehvOm9CWopHcB
QsYOWchnULTE9o4ZAs/1PkxC0LlFEWZH8bOLxBMTDVEY+xvmDuj1PdBUpncgJbOh
dunHzsvaWproBSYUXA9nKhZWTVGl+CM8Ks7jXjl3IPynLd6cpYZ/5gyBVWEX7q3e
8htG95NzdPPo7doxMiNSKGSmSm0Np1TJ/i89vsYeGfefsvsq53Fyjhu7dIuTWHmU
2YUe6qHs6dF9x1bkHAAZz6T9Hs4BoGQBcXUnooT9JbzVdv2RfTPsrawdu8dOnzV1
RhwCFdFcll0AIEl0T9fCYzUI/Ga8ZS0roXs5NZ4wl0lwr0BGFwiU8WC1FUdGsZo9
0duaa0Tpv0OWt6rIMMB/E9QsqCDsQ4CMHuQpVVw+GOO5ux9kMms=
=v6Xn
-----END PGP SIGNATURE-----
Merge tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull thread_info flag accessor helper updates from Borislav Petkov:
"Add a set of thread_info.flags accessors which snapshot it before
accesing it in order to prevent any potential data races, and convert
all users to those new accessors"
* tag 'core_entry_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc: Snapshot thread flags
powerpc: Avoid discarding flags in system_call_exception()
openrisc: Snapshot thread flags
microblaze: Snapshot thread flags
arm64: Snapshot thread flags
ARM: Snapshot thread flags
alpha: Snapshot thread flags
sched: Snapshot thread flags
entry: Snapshot thread flags
x86: Snapshot thread flags
thread_info: Add helpers to snapshot thread flags
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME support
(scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and reciprocal
square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP == 1
(potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmHXNtYACgkQa9axLQDI
XvHBGw/+OVGdbORxwrU+uRb7N6qIJkrW/mmM4x1KLo1i+REZLb8/VlXm0xC60FG+
39x6FSVkRr+lLDfTqpQsOez5FpdsvOe9Fc4L3bwniDg+EPo7x65VmP2dw/Ae2q0i
87xyWCczx5hFEPF/1sb1R1pm3bTXjeklBkdv+OXhwflLOwpCp1J8z8WJK8qJVFX6
CmuE6Q4fDQr0ghl9Nf8DiAr20mHDh8wMKNUJOg4waaQOOCta6q1oJ3qfz6E9z1eW
zEE3dfZgBCx7HCRc3KGgzT7H4Ces3BYvhBYP6bJRliVI88XdPiM4MfdGL4UIb27Q
NLAdr+FVzk/YLzMHtxSfkT10nBqoOPWUTckLu9jIIl5cpBX73Wiz7jfzBvqFmC/y
opSFMZ3lwQPM5WAPtAlZptA3GPPySeInVmvUgB7IQ+1Q1T1n8ri1y5hzTYC4Sc/g
amJI1rXf1Al8+2zFBggr6Up+EOnfV9nAwrzLXkRlASsfmvY4dnVWg3NWfBqtEHAq
VuZCecSgawxuSlpmJ4VGbLrBFaz18bn9EzujR5fFvi5Qcg1CMFOROi2+6IynopNV
IS0R8j6fwgQPA5lcnNIPeJRRkQoqO4l8bPDzeXEny0BSw313EgBSo9aQtnjyIJbp
BTuDHARKs+/NvDPvd8GQkxNPgwJnVOL9pdgNAolEu1/k7JtnIS0=
=ecyi
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- KCSAN enabled for arm64.
- Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.
- Some more SVE clean-ups and refactoring in preparation for SME
support (scalable matrix extensions).
- BTI clean-ups (SYM_FUNC macros etc.)
- arm64 atomics clean-up and codegen improvements.
- HWCAPs for FEAT_AFP (alternate floating point behaviour) and
FEAT_RPRESS (increased precision of reciprocal estimate and
reciprocal square root estimate).
- Use SHA3 instructions to speed-up XOR.
- arm64 unwind code refactoring/unification.
- Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP ==
1 (potentially set by a hypervisor; user-space already does this).
- Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.
- Other fixes and clean-ups; highlights: fix the handling of erratum
1418040, correct the calculation of the nomap region boundaries,
introduce io_stop_wc() mapped to the new DGH instruction (data
gathering hint).
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
arm64: Use correct method to calculate nomap region boundaries
arm64: Drop outdated links in comments
arm64: perf: Don't register user access sysctl handler multiple times
drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check
perf/smmuv3: Fix unused variable warning when CONFIG_OF=n
arm64: errata: Fix exec handling in erratum 1418040 workaround
arm64: Unhash early pointer print plus improve comment
asm-generic: introduce io_stop_wc() and add implementation for ARM64
arm64: Ensure that the 'bti' macro is defined where linkage.h is included
arm64: remove __dma_*_area() aliases
docs/arm64: delete a space from tagged-address-abi
arm64: Enable KCSAN
kselftest/arm64: Add pidbench for floating point syscall cases
arm64/fp: Add comments documenting the usage of state restore functions
kselftest/arm64: Add a test program to exercise the syscall ABI
kselftest/arm64: Allow signal tests to trigger from a function
kselftest/arm64: Parameterise ptrace vector length information
arm64/sve: Minor clarification of ABI documentation
arm64/sve: Generalise vector length configuration prctl() for SME
arm64/sve: Make sysctl interface for SVE reusable by SME
...
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the powerpc cacheinfo sysfs code to use default_groups
field which has been the preferred way since aa30f47cf6 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220104155450.1291277-1-gregkh@linuxfoundation.org
When CONFIG_PPC_RFI_SRR_DEBUG=y we check the SRR values before returning
from interrupts. This is done in asm using EMIT_BUG_ENTRY, and passing
BUGFLAG_WARNING.
However that fails to create an exception table entry for the warning,
and so do_program_check() fails the exception table search and proceeds
to call _exception(), resulting in an oops like:
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 2 PID: 1204 Comm: sigreturn_unali Tainted: P 5.16.0-rc2-00194-g91ca3d4f77c5 #12
NIP: c00000000000c5b0 LR: 0000000000000000 CTR: 0000000000000000
...
NIP [c00000000000c5b0] system_call_common+0x150/0x268
LR [0000000000000000] 0x0
Call Trace:
[c00000000db73e10] [c00000000000c558] system_call_common+0xf8/0x268 (unreliable)
...
Instruction dump:
7cc803a6 888d0931 2c240000 4082001c 38800000 988d0931 e8810170 e8a10178
7c9a03a6 7cbb03a6 7d7a02a6 e9810170 <7f0b6088> 7d7b02a6 e9810178 7f0b6088
We should instead use EMIT_WARN_ENTRY, which creates an exception table
entry for the warning, allowing the warning to be correctly recognised,
and the code to resume after printing the warning.
Note however that because this warning is buried deep in the interrupt
return path, we are not able to recover from it (due to MSR_RI being
clear), so we still end up in die() with an unrecoverable exception.
Fixes: 59dc5bfca0 ("powerpc/64s: avoid reloading (H)SRR registers if they are still valid")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221135101.2085547-2-mpe@ellerman.id.au
When CONFIG_PPC_RFI_SRR_DEBUG=y we check that NIP and SRR0 match when
returning from interrupts. This can trigger falsely if NIP has either of
its two low bits set via sigreturn or ptrace, while SRR0 has its low two
bits masked in hardware.
As a quick fix make sure to mask the low bits before doing the check.
Fixes: 59dc5bfca0 ("powerpc/64s: avoid reloading (H)SRR registers if they are still valid")
Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/20211221135101.2085547-1-mpe@ellerman.id.au
Boot fails with GCC latent entropy plugin enabled.
This is due to early boot functions trying to access 'latent_entropy'
global data while the kernel is not relocated at its final
destination yet.
As there is no way to tell GCC to use PTRRELOC() to access it,
disable latent entropy plugin in early_32.o and feature-fixups.o and
code-patching.o
Fixes: 38addce8b6 ("gcc-plugins: Add latent_entropy plugin")
Cc: stable@vger.kernel.org # v4.9+
Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215217
Link: https://lore.kernel.org/r/2bac55483b8daf5b1caa163a45fa5f9cdbe18be4.1640178426.git.christophe.leroy@csgroup.eu
The dssall ("Data Stream Stop All") instruction is obsolete altogether
with other Data Cache Instructions since ISA 2.03 (year 2006).
LLVM IAS does not support it but PPC970 seems to be using it.
This switches dssall to .long as there is no much point in fixing LLVM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-6-aik@ozlabs.ru
The LLVM integrated assembler really does not like us reassigning things
to the same label:
<instantiation>:7:9: error: invalid reassignment of non-absolute variable 'fs_label'
This happens across a bunch of platforms:
https://github.com/ClangBuiltLinux/linux/issues/1043https://github.com/ClangBuiltLinux/linux/issues/1008https://github.com/ClangBuiltLinux/linux/issues/920https://github.com/ClangBuiltLinux/linux/issues/1050
There is no hope of getting this fixed in LLVM (see
https://github.com/ClangBuiltLinux/linux/issues/1043#issuecomment-641571200
and https://bugs.llvm.org/show_bug.cgi?id=47798#c1 )
so if we want to build with LLVM_IAS, we need to hack
around it ourselves.
For us the big problem comes from this:
\#define USE_FIXED_SECTION(sname) \
fs_label = start_##sname; \
fs_start = sname##_start; \
use_ftsec sname;
\#define USE_TEXT_SECTION()
fs_label = start_text; \
fs_start = text_start; \
.text
and in particular fs_label.
This works around it by not setting those 'variables' and requiring
that users of the variables instead track for themselves what section
they are in. This isn't amazing, by any stretch, but it gets us further
in the compilation.
Note that even though users have to keep track of the section, using
a wrong one produces an error with both binutils and llvm which prevents
from using wrong section at the compile time:
llvm error example:
AS arch/powerpc/kernel/head_64.o
<unknown>:0: error: Cannot represent a difference across sections
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1
binutils error example:
/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
/home/aik/p/kernels-llvm/llvm/arch/powerpc/kernel/exceptions-64s.S:1974: Error: can't resolve `system_call_common' {.text section} - `start_r
eal_vectors' {.head.text.real_vectors section}
make[3]: *** [/home/aik/p/kernels-llvm/llvm/scripts/Makefile.build:388: arch/powerpc/kernel/head_64.o] Error 1
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-5-aik@ozlabs.ru
This patch future-proofs the kernel against linker changes that might
put the toc pointer at some location other than .got+0x8000, by
replacing __toc_start+0x8000 with .TOC. throughout. If the kernel's
idea of the toc pointer doesn't agree with the linker, bad things
happen.
prom_init.c code relocating its toc is also changed so that a symbolic
__prom_init_toc_start toc-pointer relative address is calculated
rather than assuming that it is always at toc-pointer - 0x8000. The
length calculations loading values from the toc are also avoided.
It's a little incestuous to do that with unreloc_toc picking up
adjusted values (which is fine in practice, they both adjust by the
same amount if all goes well).
I've also changed the way .got is aligned in vmlinux.lds and
zImage.lds, mostly so that dumping out section info by objdump or
readelf plainly shows the alignment is 256. This linker script
feature was added 2005-09-27, available in FSF binutils releases from
2.17 onwards. Should be safe to use in the kernel, I think.
Finally, put *(.got) before the prom_init.o entry which only needs
*(.toc), so that the GOT header goes in the correct place. I don't
believe this makes any difference for the kernel as it would for
dynamic objects being loaded by ld.so. That change is just to stop
lusers who blindly copy kernel scripts being led astray. Of course,
this change needs the prom_init.c changes.
Some notes on .toc and .got.
.toc is a compiler generated section of addresses. .got is a linker
generated section of addresses, generally built when the linker sees
R_*_*GOT* relocations. In the case of powerpc64 ld.bfd, there are
multiple generated .got sections, one per input object file. So you
can somewhat reasonably write in a linker script an input section
statement like *prom_init.o(.got .toc) to mean "the .got and .toc
section for files matching *prom_init.o". On other architectures that
doesn't make sense, because the linker generally has just one .got
section. Even on powerpc64, note well that the GOT entries for
prom_init.o may be merged with GOT entries from other objects. That
means that if prom_init.o references, say, _end via some GOT
relocation, and some other object also references _end via a GOT
relocation, the GOT entry for _end may be in the range
__prom_init_toc_start to __prom_init_toc_end and if the kernel does
something special to GOT/TOC entries in that range then the value of
_end as seen by objects other than prom_init.o will be affected. On
the other hand the GOT entry for _end may not be in the range
__prom_init_toc_start to __prom_init_toc_end. Which way it turns out
is deterministic but a detail of linker operation that should not be
relied on.
A feature of ld.bfd is that input .toc (and .got) sections matching
one linker input section statement may be sorted, to put entries used
by small-model code first, near the toc base. This is why scripts for
powerpc64 normally use *(.got .toc) rather than *(.got) *(.toc), since
the first form allows more freedom to sort.
Another feature of ld.bfd is that indirect addressing sequences using
the GOT/TOC may be edited by the linker to relative addressing. In
many cases relative addressing would be emitted by gcc for
-mcmodel=medium if you appropriately decorate variable declarations
with non-default visibility.
The original patch is here:
https://lore.kernel.org/linuxppc-dev/20210310034813.GM6042@bubble.grove.modra.org/
Signed-off-by: Alan Modra <amodra@au1.ibm.com>
[aik: removed non-relocatable which is gone in 24d33ac5b8]
[aik: added <=2.24 check]
[aik: because of llvm-as, kernel_toc_addr() uses "mr" instead of global register variable]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211221055904.555763-2-aik@ozlabs.ru
Some functions defined in `arch/powerpc/kernel` (and one in `arch/powerpc/
kexec`) are deserving of an `__init` macro attribute. These functions are
only called by other initialization functions and therefore should inherit
the attribute.
Also, change function declarations in header files to include `__init`.
Signed-off-by: Nick Child <nick.child@ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211216220035.605465-2-nick.child@ibm.com
Fix a recently introduced oops at boot on 85xx in some configurations.
Fix crashes when loading some livepatch modules with STRICT_MODULE_RWX.
Thanks to: Joe Lawrence, Russell Currey, Xiaoming Ni.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmG+q+UTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgG5sEACVsb39+Q8w0GKJxUGR0Wri1InJTkLC
UWpd9PVzG6n8+Epl0f+H5cMlv5V6xaDzTsNXEf27xIACMlPSVHVLDW8u/RFPizdp
fUDUweGUvK/M0GQW89nRV29DjdNjMmUq1KQFM6eEgYN/0nS3HqF27jamiaFhiZ+e
pJl1g2dy/DTWrFz3V/8b0nv6kyJE8GlpnCmTchDsPNDbzWQVJXVfM7IB/9N+EVna
MQAZCyKSOafTgNRPOW+Sm4oF64z+Yn4cxURVKfzbM+0Cd8pNIcGZ7McXrnp0DR4t
bLGeYEzyR1bPc3VTt7b+jafDP41AakRllKAFz65+pdYOcV1hTRM4inbLVPT5Bmz6
7ANlX/w4BwYWFk3mWW14BoxU02FlRKRddXhusq7x2wIqvtn1KI8Qm1xpneiGpcW+
GEe+k9ONGowXp/bc47/Co6dPSHKRRHUC3MsgbDv2Y7M5G9vsetckdQzA2bIqYSx8
r1ZXNiMwQarPeB6FrGvNbekyCsIhJkk6chhtL0k27BVOFqYiHZtaqBs+RCQYPdC8
Aih8WOEqdldbItFI3MihyKULucgkVpdnm80Ja67qRthuJICF9OML3mdrt83bxEIp
fu4Bsw8dygVKJWScZj8AUC/ZBxOrrTGyZR1RvsHpaAiEfinMkxxTnXjqNDiSZWXt
DKKLx6pfSFrokw==
=Nmp5
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fix a recently introduced oops at boot on 85xx in some configurations.
Fix crashes when loading some livepatch modules with
STRICT_MODULE_RWX.
Thanks to Joe Lawrence, Russell Currey, and Xiaoming Ni"
* tag 'powerpc-5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/module_64: Fix livepatching for RO modules
powerpc/85xx: Fix oops when CONFIG_FSL_PMC=n
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework
early_init_dt_scan_memory() to be called directly and use libfdt.
Cc: John Crispin <john@phrozen.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: linux-mips@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Frank Rowand <frank.rowand@sony.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211215150102.1303588-1-robh@kernel.org
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework early_init_dt_scan_root()
to be called directly and use libfdt.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Frank Rowand <frank.rowand@sony.com>
Link: https://lore.kernel.org/r/20211118181213.1433346-3-robh@kernel.org
Use of the of_scan_flat_dt() function predates libfdt and is discouraged
as libfdt provides a nicer set of APIs. Rework
early_init_dt_scan_chosen() to be called directly and use libfdt.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Reviewed-by: Frank Rowand <frank.rowand@sony.com>
Link: https://lore.kernel.org/r/20211118181213.1433346-2-robh@kernel.org
Reading the CFAR register is quite costly (~20 cycles on POWER9). It is
a good idea to have for most synchronous interrupts, but for async ones
it is much less important.
Doorbell, external, and decrementer interrupts are the important
asynchronous ones. HV interrupts can't skip CFAR if KVM HV is possible,
because it might be a guest exit that requires CFAR preserved. But the
important pseries interrupts can avoid loading CFAR.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-7-npiggin@gmail.com
Enabling MSR[EE] in interrupt handlers while interrupts are still soft
masked allows PMIs to profile interrupt handlers to some degree, beyond
what SIAR latching allows.
When perf is not being used, this is almost useless work. It requires an
extra mtmsrd in the irq handler, and it also opens the door to masked
interrupts hitting and requiring replay, which is more expensive than
just taking them directly. This effect can be noticable in high IRQ
workloads.
Avoid enabling MSR[EE] unless perf is currently in use. This saves about
60 cycles (or 8%) on a simple decrementer interrupt microbenchmark.
Replayed interrupts drop from 1.4% of all interrupts taken, to 0.003%.
This does prevent the soft-nmi interrupt being taken in these handlers,
but that's not too reliable anyway. The SMP watchdog will continue to be
the reliable way to catch lockups.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-5-npiggin@gmail.com
The mtmsrd to enable MSR[RI] can be combined with the mtmsrd to enable
MSR[EE] in interrupt entry code, for those interrupts which enable EE.
This helps performance of important synchronous interrupts (e.g., page
faults).
This is similar to what commit dd152f70bd ("powerpc/64s: system call
avoid setting MSR[RI] until we set MSR[EE]") does for system calls.
Do this by enabling EE and RI together at the beginning of the entry
wrapper if PACA_IRQ_HARD_DIS is clear, and only enabling RI if it is
set.
Asynchronous interrupts set PACA_IRQ_HARD_DIS, but synchronous ones
leave it unchanged, so by default they always get EE=1 unless they have
interrupted a caller that is hard disabled. When the sync interrupt
later calls interrupt_cond_local_irq_enable(), it will not require
another mtmsrd because MSR[EE] was already enabled here.
This avoids one mtmsrd L=1 for synchronous interrupts on 64s, which
saves about 20 cycles on POWER9. And for kernel-mode interrupts, both
synchronous and asynchronous, this saves an additional 40 cycles due to
the mtmsrd being moved ahead of mfspr SPRN_AMR, which prevents a SPR
scoreboard stall.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210922145452.352571-3-npiggin@gmail.com
Livepatching a loaded module involves applying relocations through
apply_relocate_add(), which attempts to write to read-only memory when
CONFIG_STRICT_MODULE_RWX=y. Work around this by performing these
writes through the text poke area by using patch_instruction().
R_PPC_REL24 is the only relocation type generated by the kpatch-build
userspace tool or klp-convert kernel tree that I observed applying a
relocation to a post-init module.
A more comprehensive solution is planned, but using patch_instruction()
for R_PPC_REL24 on should serve as a sufficient fix.
This does have a performance impact, I observed ~15% overhead in
module_load() on POWER8 bare metal with checksum verification off.
Fixes: c35717c71e ("powerpc: Set ARCH_HAS_STRICT_MODULE_RWX")
Cc: stable@vger.kernel.org # v5.14+
Reported-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Russell Currey <ruscur@russell.cc>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
[mpe: Check return codes from patch_instruction()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211214121248.777249-1-mpe@ellerman.id.au
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Make arch_stack_walk() available for ARCH_STACKWALK architectures
without it being entangled in STACKTRACE.
Link: https://lore.kernel.org/lkml/20211022152104.356586621@infradead.org/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Mark: rebase, drop unnecessary arm change]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20211129142849.3056714-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
In panic path, fadump is triggered via a panic notifier function.
Before calling panic notifier functions, smp_send_stop() gets called,
which stops all CPUs except the panic'ing CPU. Commit 8389b37dff
("powerpc: stop_this_cpu: remove the cpu from the online map.") and
again commit bab26238bb ("powerpc: Offline CPU in stop_this_cpu()")
started marking CPUs as offline while stopping them. So, if a kernel
has either of the above commits, vmcore captured with fadump via panic
path would not process register data for all CPUs except the panic'ing
CPU. Sample output of crash-utility with such vmcore:
# crash vmlinux vmcore
...
KERNEL: vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Wed Nov 10 09:56:34 EST 2021
UPTIME: 00:00:42
LOAD AVERAGE: 2.27, 0.69, 0.24
TASKS: 183
NODENAME: XXXXXXXXX
RELEASE: 5.15.0+
VERSION: #974 SMP Wed Nov 10 04:18:19 CST 2021
MACHINE: ppc64le (2500 Mhz)
MEMORY: 8 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 3394
COMMAND: "bash"
TASK: c0000000150a5f80 [THREAD_INFO: c0000000150a5f80]
CPU: 1
STATE: TASK_RUNNING (PANIC)
crash> p -x __cpu_online_mask
__cpu_online_mask = $1 = {
bits = {0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
crash>
crash> p -x __cpu_active_mask
__cpu_active_mask = $2 = {
bits = {0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
}
crash>
While this has been the case since fadump was introduced, the issue
was not identified for two probable reasons:
- In general, the bulk of the vmcores analyzed were from crash
due to exception.
- The above did change since commit 8341f2f222 ("sysrq: Use
panic() to force a crash") started using panic() instead of
deferencing NULL pointer to force a kernel crash. But then
commit de6e5d3841 ("powerpc: smp_send_stop do not offline
stopped CPUs") stopped marking CPUs as offline till kernel
commit bab26238bb ("powerpc: Offline CPU in stop_this_cpu()")
reverted that change.
To ensure post processing register data of all other CPUs happens
as intended, let panic() function take the crash friendly path (read
crash_smp_send_stop()) with the help of crash_kexec_post_notifiers
option. Also, as register data for all CPUs is captured by f/w, skip
IPI callbacks here for fadump, to avoid any complications in finding
the right backtraces.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-2-hbathini@linux.ibm.com
Kdump can be triggered after panic_notifers since commit f06e5153f4
("kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump
after panic_notifers") introduced crash_kexec_post_notifiers option.
But using this option would mean smp_send_stop(), that marks all other
CPUs as offline, gets called before kdump is triggered. As a result,
kdump routines fail to save other CPUs' registers. To fix this, kdump
friendly crash_smp_send_stop() function was introduced with kernel
commit 0ee59413c9 ("x86/panic: replace smp_send_stop() with kdump
friendly version in panic path"). Override this kdump friendly weak
function to handle crash_kexec_post_notifiers option appropriately
on powerpc.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
[Fixed signature of crash_stop_this_cpu() - reported by lkp@intel.com]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211207103719.91117-1-hbathini@linux.ibm.com
Unlike PPC64 ABI, PPC32 uses the stack to pass a parameter defined
as a struct, even when the struct has a single simple element.
To avoid that, define ppc_inst_t as u32 on PPC32.
Keep it as 'struct ppc_inst' when __CHECKER__ is defined so that
sparse can perform type checking.
Also revert commit 511eea5e2c ("powerpc/kprobes: Fix Oops by passing
ppc_inst as a pointer to emulate_step() on ppc32") as now the
instruction to be emulated is passed as a register to emulate_step().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c6d0c46f598f76ad0b0a88bc0d84773bd921b17c.1638208156.git.christophe.leroy@csgroup.eu
This adds KUAP support to 40x. This is done by checking
the content of SPRN_PID at the time user pgtable is loaded.
40x doesn't have KUEP, but KUAP implies KUEP because when the
PID doesn't match the page's PID, the page cannot be read nor
executed.
So KUEP is now automatically selected when KUAP is selected and
disabled when KUAP is disabled.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/aaefa91897ddc42ac11019dc0e1d1a525bd08e90.1634627931.git.christophe.leroy@csgroup.eu
On booke/40x we don't have segments like book3s/32.
On booke/40x we don't have access protection groups like 8xx.
Use the PID register to provide user access protection.
Kernel address space can be accessed with any PID.
User address space has to be accessed with the PID of the user.
User PID is always not null.
Everytime the kernel is entered, set PID register to 0 and
restore PID register when returning to user.
Everytime kernel needs to access user data, PID is restored
for the access.
In TLB miss handlers, check the PID and bail out to data storage
exception when PID is 0 and accessed address is in user space.
Note that also forbids execution of user text by kernel except
when user access is unlocked. But this shouldn't be a problem
as the kernel is not supposed to ever run user text.
This patch prepares the infrastructure but the real activation of KUAP
is done by following patches for each processor type one by one.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/5d65576a8e31e9480415785a180c92dd4e72306d.1634627931.git.christophe.leroy@csgroup.eu
Also call kuap_lock() and kuap_save_and_lock() from
interrupt functions with CONFIG_PPC64.
For book3s/64 we keep them empty as it is done in assembly.
Also do the locked assert when switching task unless it is
book3s/64.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1cbf94e26e6d6e2e028fd687588a7e6622d454a6.1634627931.git.christophe.leroy@csgroup.eu
We have many functionnalities common to 40x and BOOKE, it leads to
many places with #if defined(CONFIG_BOOKE) || defined(CONFIG_40x).
We are going to add a few more with KUAP for booke/40x, so create
a new symbol which is defined when either BOOKE or 40x is defined.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/9a3dbd60924cb25c9f944d3d8205ac5a0d15e229.1634627931.git.christophe.leroy@csgroup.eu
Add kuap_lock() and call it when entering interrupts from user.
It is called kuap_lock() as it is similar to kuap_save_and_lock()
without the save.
However book3s/32 already have a kuap_lock(). Rename it
kuap_lock_addr().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4437e2deb9f6f549f7089d45e9c6f96a7e77905a.1634627931.git.christophe.leroy@csgroup.eu
Calling 'mfsr' to get the content of segment registers is heavy,
in addition it requires clearing of the 'reserved' bits.
In order to avoid this operation, save it in mm context and in
thread struct.
The saved sr0 is the one used by kernel, this means that on
locking entry it can be used as is.
For unlocking, the only thing to do is to clear SR_NX.
This improves null_syscall selftest by 12 cycles, ie 4%.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/b02baf2ed8f09bad910dfaeeb7353b2ae6830525.1634627931.git.christophe.leroy@csgroup.eu
When interrupt and syscall entries where converted to C, KUEP locking
and unlocking was also converted. It improved performance by unrolling
the loop, and allowed easily implementing boot time deactivation of
KUEP.
However, null_syscall selftest shows that KUEP is still heavy
(361 cycles with KUEP, 212 cycles without).
A way to improve more is to group 'mtsr's together, instead of
repeating 'addi' + 'mtsr' several times.
In order to do that, more registers need to be available. In C, GCC
will always be able to provide the requested number of registers, but
at the cost of saving some data on the stack, which is counter
performant here.
So let's do it in assembly, when we have full control of which
register can be used. It also has the advantage of locking earlier
and unlocking later and it helps GCC generating less tricky code.
The only drawback is to make boot time deactivation less straight
forward and require 'hand' instruction patching.
Group 'mtsr's by 4.
With this change, null_syscall selftest reports 336 cycles. Without
the change it was 361 cycles, that's a 7% reduction.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/115cb279e9b9948dfd93a065e047081c59e3a2a6.1634627931.git.christophe.leroy@csgroup.eu
On book3e,
- When using 64 bits PTE: User pages don't have the SX bit defined
so KUEP is always active.
- When using 32 bits PTE: Implement KUEP by clearing SX bit during
TLB miss for user pages. The impact is minimal and worth neither
boot time nor build time selection.
Activate it at all time.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/e376b114283fb94504e2aa2de846780063252cde.1634627931.git.christophe.leroy@csgroup.eu
Compiling out hash support code when CONFIG_PPC_64S_HASH_MMU=n saves
128kB kernel image size (90kB text) on powernv_defconfig minus KVM,
350kB on pseries_defconfig minus KVM, 40kB on a tiny config.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fixup defined(ARCH_HAS_MEMREMAP_COMPAT_ALIGN), which needs CONFIG.
Fix radix_enabled() use in setup_initial_memory_limit(). Add some
stubs to reduce number of ifdefs.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-18-npiggin@gmail.com
This adds Kconfig selection which allows 64s hash MMU support to be
disabled. It can be disabled if radix support is enabled, the minimum
supported CPU type is POWER9 (or higher), and KVM is not selected.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-17-npiggin@gmail.com
Radix never sets mmu_linear_psize so it's always 4K, which causes pcpu
atom_size to always be PAGE_SIZE. 64e sets it to 1GB always.
Make paths for these platforms to be explicit about what value they set
atom_size to.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-12-npiggin@gmail.com
The radix test can exclude slb_flush_all_realmode() from being called
because flush_and_reload_slb() is only expected to flush ERAT when
called by flush_erat(), which is only on pre-ISA v3.0 CPUs that do not
support radix.
This helps the later change to make hash support configurable to not
introduce runtime changes to radix mode behaviour.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-9-npiggin@gmail.com
slb.c is hash-specific SLB management, but do_bad_slb_fault deals with
segment interrupts that occur with radix MMU as well.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211201144153.2456614-5-npiggin@gmail.com
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Add a struct_group() for the spe registers so that memset() can correctly reason
about the size:
In function 'fortify_memset_chk',
inlined from 'restore_user_regs.part.0' at arch/powerpc/kernel/signal_32.c:539:3:
>> include/linux/fortify-string.h:195:4: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
195 | __write_overflow_field();
| ^~~~~~~~~~~~~~~~~~~~~~~~
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211118203604.1288379-1-keescook@chromium.org
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Generally this is unlikely to cause a
problem in practice, but it is somewhat unsound, and KCSAN will
legitimately warn that there is a data race.
To avoid such issues, a snapshot of the flags has to be taken prior to
using them. Some places already use READ_ONCE() for that, others do not.
Convert them all to the new flag accessor helpers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Link: https://lore.kernel.org/r/20211129130653.2037928-11-mark.rutland@arm.com
Some thread flags can be set remotely, and so even when IRQs are disabled,
the flags can change under our feet. Thus, when setting flags we must use
an atomic operation rather than a plain read-modify-write sequence, as a
plain read-modify-write may discard flags which are concurrently set by a
remote thread, e.g.
// task A // task B
tmp = A->thread_info.flags;
set_tsk_thread_flag(A, NEWFLAG_B);
tmp |= NEWFLAG_A;
A->thread_info.flags = tmp;
arch/powerpc/kernel/interrupt.c's system_call_exception() sets
_TIF_RESTOREALL in the thread info flags with a read-modify-write, which
may result in other flags being discarded.
Elsewhere in the file it uses clear_bits() to atomically remove flag bits,
so use set_bits() here for consistency with those.
There may be reasons (e.g. instrumentation) that prevent the use of
set_thread_flag() and clear_thread_flag() here, which would otherwise be
preferable.
Fixes: ae7aaecc3f ("powerpc/64s: system call rfscv workaround for TM bugs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Eirik Fuller <efuller@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Link: https://lore.kernel.org/r/20211129130653.2037928-10-mark.rutland@arm.com
module_alloc() first tries to allocate module text within 24 bits direct
jump from kernel text, and tries a wider allocation if first one fails.
When first allocation fails the following is observed in kernel logs:
vmap allocation for size 2400256 failed: use vmalloc=<size> to increase size
systemd-udevd: vmalloc error: size 2395133, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null)
CPU: 0 PID: 127 Comm: systemd-udevd Tainted: G W 5.15.5-gentoo-PowerMacG4 #9
Call Trace:
[e2a53a50] [c0ba0048] dump_stack_lvl+0x80/0xb0 (unreliable)
[e2a53a70] [c0540128] warn_alloc+0x11c/0x2b4
[e2a53b50] [c0531be8] __vmalloc_node_range+0xd8/0x64c
[e2a53c10] [c00338c0] module_alloc+0xa0/0xac
[e2a53c40] [c027a368] load_module+0x2ae0/0x8148
[e2a53e30] [c027fc78] sys_finit_module+0xfc/0x130
[e2a53f30] [c0035098] ret_from_syscall+0x0/0x28
...
Add __GFP_NOWARN flag to first allocation so that no warning appears
when it fails.
Reported-by: Erhard Furtner <erhard_f@mailbox.org>
Fixes: 2ec13df167 ("powerpc/modules: Load modules closer to kernel text")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/93c9b84d6ec76aaf7b4f03468e22433a6d308674.1638267035.git.christophe.leroy@csgroup.eu
Introduce macros that operate on a (start, end) range of GPRs, which
reduces lines of code and need to do mental arithmetic while reading the
code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211022061322.2671178-1-npiggin@gmail.com
The printk layer at the moment does not seem to have a good way to force
flush printk messages that are created in NMI context, except in the
panic path.
NMI-context printk messages normally get to the console with irq_work,
but that won't help if the CPU is stuck with irqs disabled, as can be
the case for hard lockup watchdog messages.
The watchdog currently flushes the printk buffers after detecting a
lockup on remote CPUs, but they may not have processed their NMI IPI
yet by that stage, or they may have self-detected a lockup in which
case they won't go via this NMI IPI path.
Improve the situation by having NMI-context mark a flag if it called
printk, and have watchdog timer interrupts check if that flag was set
and try to flush if it was. Latency is not a big problem because we
were already stuck for a while, just need to try to make sure the
messages eventually make it out.
Depends-on: 5d5e4522a7 ("printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211119113146.752759-6-npiggin@gmail.com
Unlike PPC64, PPC32 doesn't require any special compiler option
to get _mcount() call not clobbering registers.
Provide ftrace_regs_caller() and ftrace_regs_call() and activate
HAVE_DYNAMIC_FTRACE_WITH_REGS.
That's heavily copied from ftrace_64_mprofile.S
For the time being leave livepatching aside, it will come with
following patch.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1862dc7719855cc2a4eec80920d94c955877557e.1635423081.git.christophe.leroy@csgroup.eu
All functions calling _mcount do it exactly the same way, with the
following sequence of instructions:
c07de788: 7c 08 02 a6 mflr r0
c07de78c: 90 01 00 04 stw r0,4(r1)
c07de790: 4b 84 13 65 bl c001faf4 <_mcount>
Allthough LR is pushed on stack, it is still in r0 while entering
_mcount().
Function arguments are in r3-r10, so r11 and r12 are still available
at that point.
Do like PPC64 and use r12 to move LR into CTR, so that r0 is preserved
and doesn't need to be restored from the stack.
While at it, bring back the EXPORT_SYMBOL at the end of _mcount.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/24a3ba7db388537c44a038026f926d885372e6d3.1635423081.git.christophe.leroy@csgroup.eu
Prior to commit b1923caa6e ("powerpc: Merge 32-bit and 64-bit
setup_arch()") probe_machine() was called from setup_32/64.c and lived
in setup-common.c. But now it's only called from setup-common.c so it
can be static and __init, and we don't need the declaration in
machdep.h either.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211124093254.1054750-6-mpe@ellerman.id.au
setup_profiling_timer() is only needed when CONFIG_PROFILING is enabled.
Fixes the following W=1 warning when CONFIG_PROFILING=n:
linux/arch/powerpc/kernel/smp.c:1638:5: error: no previous prototype for ‘setup_profiling_timer’
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211124093254.1054750-5-mpe@ellerman.id.au
Fix KVM using a Power9 instruction on earlier CPUs, which could lead to the host SLB being
incorrectly invalidated and a subsequent host crash.
Fix kernel hardlockup on vmap stack overflow on 32-bit.
Thanks to: Christophe Leroy, Nicholas Piggin, Fabiano Rosas.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmGh96MTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgGHDEACu37n/NLFp3qzHyrk4fHN6sUQ0wGsE
oroKeIrTP2+JPsoEDd6Jd9oQmqmVLmVTp4y/Lu6AAJbK4p1He1iPIxIbZn12NwqY
/siCihV76PXR8748g1j8r3a3UYDmmvmllAIf9JdAn2I4kKPE3J37/1NEGSqZyJ80
HzkheTczUweANhVuG1RDh3WvnAKMuOvtrT/YgetVILig23oiUORQkZH5wcpScaCN
qShnqKSuRt8HWaF3uRh6D0sf4o0BWfZ6Pjyi7R/T7s2dQVgz0uCtu4MuyGirx/0J
uU3yDEZD24y6doyi6HUgAFGeGYqJXky2XDOMrgV6CNGFty9HFp6GgN6gtX353pZa
9Xplw4kg9UJwGcyk1BvgdQBHdc83y5zmr8c2PfhgiP/5rfEyCTYtHgZKUu5LLW8Y
ECYkSNnlAT68SancCmiXgos1heIiwB7922cRWJANfGEjItNBFldtb2QZio2bWOUX
g4gZ8HN2P0OmOxV1nnNkmIe+ZKLeHPN2N/FR7pUdkFk2+BLWFx+SbrceCCicwMF4
CUUzll8FMFmVAGjFivG/axzqlDWMY5lrDsFZamaWN3VQEo7TwL6l8/cDcFZF4QBq
viLWbi7sfQm6M3tZXFS465sG4SpUQ4lI8PKptuz5DMdr6WKisTLR7OF0klxsEwZj
ZcT8a/wzUIZ53w==
=MBp+
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fix KVM using a Power9 instruction on earlier CPUs, which could lead
to the host SLB being incorrectly invalidated and a subsequent host
crash.
Fix kernel hardlockup on vmap stack overflow on 32-bit.
Thanks to Christophe Leroy, Nicholas Piggin, and Fabiano Rosas"
* tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32: Fix hardlockup on vmap stack overflow
KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB
André Almeida sends an update for the newly added futex_waitv
syscall that was initially only added to a few architectures.
Some additional ones have since made it through architecture
maintainer trees, this finishes the remaining ones.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmGfrmIACgkQmmx57+YA
GNl0eg/9Eu2iliGxiQPzPw93a3DCYr7eiylbhHbgzvCbUtDL4cTE/0fcFGCYUoVu
k5UT/5ja3Q4z8wE+MUbxXf6igr+ANKJNwomKksTJV1Q6wKF23k5vUEd2ZXGyvrSk
F+2X0bpB0wZzmMBh98L6dWsEaDJ//8iRZNW4ErmPR4mp25iDqhzZ1k2RoRg9mYoM
/Lw6u5zlBiqAw5nnMc6BxPo8gJGwgcKDu4VDngpGDVGlvBHgd7Kf1AKi1+aBumml
WzVXEEdUp3LTj7O/8oU3dKPuZMhl+k/mKUvHn/cieG00FxfMpOnxj0gdaUGgG0O4
imhQquXtPueq/W3Uod+MhVulKR6AziEOsPhJayMGopLReE0spKDxcplC/OSbMhGj
0iA6auLdHpHBCg3KWZDj0Sjf2NwL9UCFVDrk8XAQQo7bY2U5LQbcbr0PS7iLlpzN
gIb+r2k3Azwk/Iqnvl2/SR1cZTx5KxfxCydT03vXaYQXhc5l+u4RvjOZcSGgZXqd
gzC78vd6fJGKyVeaL2xb6/w7qmA4DFd6sQAbgMlVFRRS6AIJrIdDEZAQztSJ+uYK
p9A2qzeXz+ftzonySbPHmV4RBGjSVVecUBjRmmDfKNbqmFUjjnCwCv+LeDXh4IS6
mjmSA/9liYdkHUaujwCLhOEskuCx9ZPeAsqmzXMER0BDzReVlH4=
=Vy2A
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic syscall table update from Arnd Bergmann:
"André Almeida sends an update for the newly added futex_waitv syscall
that was initially only added to a few architectures.
Some additional ones have since made it through architecture
maintainer trees, this finishes the remaining ones"
* tag 'asm-generic-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
futex: Wireup futex_waitv syscall
wd_smp_last_reset_tb now gets reset by watchdog_smp_panic() as part of
marking CPUs stuck and removing them from the pending mask before it
begins any printing. This causes last reset times reported to be off.
Fix this by reading it into a local variable before it gets reset.
Fixes: 76521c4b02 ("powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211125103346.1188958-1-npiggin@gmail.com
When taking watchdog actions, printing messages, comparing and
re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
and under wd_smp_lock or printing lock (if applicable).
This should keep timebase mostly monotonic with kernel log messages, and
could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
something a long way in the past, and causing other CPUs to appear to be
stuck.
These additional TB reads are all slowpath (lockup has been detected),
so performance does not matter.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211110025056.2084347-5-npiggin@gmail.com
There is a deadlock with the console_owner lock and the wd_smp_lock:
CPU x takes the console_owner lock
CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
CPU y detects deadlock, tries to print something and spins on console_owner
-> deadlock
Change the watchdog locking scheme so wd_smp_lock protects the watchdog
internal data, but "reporting" (printing, issuing NMI IPIs, taking any
action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
a problem but can not take the reporting lock, it just returns because
something else is already reporting. It will try again at some point.
Typically hard lockup watchdog report usefulness is not impacted due to
failure to spewing a large enough amount of data in as short a time as
possible, but by messages getting garbled.
Laurent debugged this and found the deadlock, and this patch is based on
his general approach to avoid expensive operations while holding the lock.
With the addition of the reporting exclusion.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
[np: rework to add reporting exclusion update changelog]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211110025056.2084347-4-npiggin@gmail.com
Most updates to wd_smp_cpus_pending are under lock except the watchdog
interrupt bit clear.
This can race with non-atomic RMW updates to the mask under lock, which
can happen in two instances:
Firstly, if another CPU detects this one is stuck, removes it from the
mask, mask becomes empty and is re-filled with non-atomic stores. This
is okay because it would re-fill the mask with this CPU's bit clear
anyway (because this CPU is now stuck), so it doesn't matter that the
bit clear update got "lost". Add a comment for this.
Secondly, if another CPU detects a different CPU is stuck and removes it
from the pending mask with a non-atomic store to bytes which also
include the bit of this CPU. This case can result in the bit clear being
lost and the end result being the bit is set. This should be so rare it
hardly matters, but to make things simpler to reason about just avoid
the non-atomic access for that case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211110025056.2084347-3-npiggin@gmail.com
It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.
Order the cpumask clear vs the subsequent test to close this race.
Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211110025056.2084347-2-npiggin@gmail.com
Provide API documentation for rtas_busy_delay_time(), explaining why we
return the same value for 9900 and -2.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211117060259.957178-3-nathanl@linux.ibm.com
Generally RTAS cannot block, and in PAPR it is required to return control
to the OS within a few tens of microseconds. In order to support operations
which may take longer to complete, many RTAS primitives can return
intermediate -2 ("busy") or 990x ("extended delay") values, which indicate
that the OS should reattempt the same call with the same arguments at some
point in the future.
Current versions of PAPR are less than clear about this, but the intended
meanings of these values in more detail are:
RTAS_BUSY (-2): RTAS has suspended a potentially long-running operation in
order to meet its latency obligation and give the OS the opportunity to
perform other work. RTAS can resume making progress as soon as the OS
reattempts the call.
RTAS_EXTENDED_DELAY_{MIN...MAX} (9900-9905): RTAS must wait for an external
event to occur or for internal contention to resolve before it can complete
the requested operation. The value encodes a non-binding hint as to roughly
how long the OS should wait before calling again, but the OS is allowed to
reattempt the call sooner or even immediately.
Linux of course must take its own CPU scheduling obligations into account
when handling these statuses; e.g. a task which receives an RTAS_BUSY
status should check whether to reschedule before it attempts the RTAS call
again to avoid starving other tasks.
rtas_busy_delay() is a helper function that "consumes" a busy or extended
delay status. Common usage:
int rc;
do {
rc = rtas_call(rtas_token("some-function"), ...);
} while (rtas_busy_delay(rc));
/* convert rc to Linux error value, etc */
If rc is a busy or extended delay status, the caller can rely on
rtas_busy_delay() to perform an appropriate sleep or reschedule and return
nonzero. Other statuses are handled normally by the caller.
The current implementation of rtas_busy_delay() both oversleeps and
overuses the CPU:
* It performs msleep() for all 990x and even when no delay is
suggested (-2), but this is understood to actually sleep for two jiffies
minimum in practice (20ms with HZ=100). 9900 (1ms) and 9901 (10ms)
appear to be the most common extended delay statuses, and the
oversleeping measurably lengthens DLPAR operations, which perform
many RTAS calls.
* It does not sleep on 990x unless need_resched() is true, causing code
like the loop above to needlessly retry, wasting CPU time.
Alter the logic to align better with the intended meanings:
* When passed RTAS_BUSY, perform cond_resched() and return without
sleeping. The caller should reattempt immediately
* Always sleep when passed an extended delay status, using usleep_range()
for precise shorter sleeps. Limit the sleep time to one second even
though there are higher architected values.
Change rtas_busy_delay()'s return type to bool to better reflect its usage,
and add kernel-doc.
rtas_busy_delay_time() is unchanged, even though it "incorrectly" returns 1
for RTAS_BUSY. There are users of that API with open-coded delay loops in
sensitive contexts that will have to be taken on an individual basis.
Brief results for addition and removal of 5GB memory on a small P9 PowerVM
partition follow. Load was generated with stress-ng --cpu N. For add,
elapsed time is greatly reduced without significant change in the number of
RTAS calls or time spent on CPU. For remove, elapsed time is modestly
reduced, with significant reductions in RTAS calls and time spent on CPU.
With no competing workload (- before, + after):
Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
- 1,935 probe:rtas_call # 0.003 M/sec ( +- 0.22% )
- 609.99 msec task-clock # 0.183 CPUs utilized ( +- 0.19% )
+ 1,956 probe:rtas_call # 0.003 M/sec ( +- 0.17% )
+ 618.56 msec task-clock # 0.278 CPUs utilized ( +- 0.11% )
- 3.3322 +- 0.0670 seconds time elapsed ( +- 2.01% )
+ 2.2222 +- 0.0416 seconds time elapsed ( +- 1.87% )
Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
- 6,224 probe:rtas_call # 0.008 M/sec ( +- 2.57% )
- 750.36 msec task-clock # 0.190 CPUs utilized ( +- 2.01% )
+ 843 probe:rtas_call # 0.003 M/sec ( +- 0.12% )
+ 250.66 msec task-clock # 0.068 CPUs utilized ( +- 0.17% )
- 3.9394 +- 0.0890 seconds time elapsed ( +- 2.26% )
+ 3.678 +- 0.113 seconds time elapsed ( +- 3.07% )
With all CPUs 100% busy (- before, + after):
Performance counter stats for 'bash -c echo "memory add count 20" > /sys/kernel/dlpar' (10 runs):
- 2,979 probe:rtas_call # 0.003 M/sec ( +- 0.12% )
- 1,096.62 msec task-clock # 0.105 CPUs utilized ( +- 0.10% )
+ 2,981 probe:rtas_call # 0.003 M/sec ( +- 0.22% )
+ 1,095.26 msec task-clock # 0.154 CPUs utilized ( +- 0.21% )
- 10.476 +- 0.104 seconds time elapsed ( +- 1.00% )
+ 7.1124 +- 0.0865 seconds time elapsed ( +- 1.22% )
Performance counter stats for 'bash -c echo "memory remove count 20" > /sys/kernel/dlpar' (10 runs):
- 2,702 probe:rtas_call # 0.004 M/sec ( +- 4.00% )
- 722.71 msec task-clock # 0.067 CPUs utilized ( +- 2.41% )
+ 1,246 probe:rtas_call # 0.003 M/sec ( +- 0.25% )
+ 487.73 msec task-clock # 0.049 CPUs utilized ( +- 0.20% )
- 10.829 +- 0.163 seconds time elapsed ( +- 1.51% )
+ 9.9887 +- 0.0866 seconds time elapsed ( +- 0.87% )
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211117060259.957178-2-nathanl@linux.ibm.com
Fix the following issues reported by kernel-doc:
$ scripts/kernel-doc -v -none arch/powerpc/kernel/rtas.c
arch/powerpc/kernel/rtas.c:810: info: Scanning doc for function rtas_activate_firmware
arch/powerpc/kernel/rtas.c:818: warning: contents before sections
arch/powerpc/kernel/rtas.c:841: info: Scanning doc for function rtas_call_reentrant
arch/powerpc/kernel/rtas.c:893: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* Find a specific pseries error log in an RTAS extended event log.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211116215806.928235-1-nathanl@linux.ibm.com
The EEH recovery logic in eeh_handle_normal_event() has some pretty strange
flow control. If we remove all the actual recovery logic we're left with
the following skeleton:
if (result != PCI_ERS_RESULT_DISCONNECT) {
...
}
if (result != PCI_ERS_RESULT_DISCONNECT) {
...
}
if (result == PCI_ERS_RESULT_NONE) {
...
}
if (result == PCI_ERS_RESULT_CAN_RECOVER) {
...
}
if (result == PCI_ERS_RESULT_CAN_RECOVER) {
...
}
if (result == PCI_ERS_RESULT_NEED_RESET) {
...
}
if ((result == PCI_ERS_RESULT_RECOVERED) ||
(result == PCI_ERS_RESULT_NONE)) {
...
goto out;
}
/*
* unsuccessful recovery / PCI_ERS_RESULT_DISCONECTED
* handling is here.
*/
...
out:
...
Most of the "if () { ... }" blocks above change "result" to
PCI_ERS_RESULT_DISCONNECTED if an error occurs in that recovery step. This
makes the control flow a bit confusing since it breaks the early-exit
pattern that is generally used in Linux. In any case we end up handling the
error in the final else block so why not just jump there directly? Doing so
also allows us to de-indent a bunch of code.
No functional changes.
[dja: rebase on top of linux-next + my preceeding refactor,
move clearing the EEH_DEV_NO_HANDLER bit above the first goto so that
it is always clear in the error handler code as it was before.]
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015070628.1331635-2-dja@axtens.net
The control flow of eeh_handle_normal_event() is a bit tricky.
Break out one of the error handling paths - rather than be in an else
block, we'll make it part of the regular body of the function and put a
'goto out;' in the true limb of the if.
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015070628.1331635-1-dja@axtens.net
for_each_node_by_type performs an of_node_get on each iteration, so
a break out of the loop requires an of_node_put.
A simplified version of the semantic patch that fixes this problem is as
follows (http://coccinelle.lip6.fr):
// <smpl>
@@
local idexpression n;
expression e;
@@
for_each_node_by_type(n,...) {
...
(
of_node_put(n);
|
e = n
|
+ of_node_put(n);
? break;
)
...
}
... when != n
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1448051604-25256-6-git-send-email-Julia.Lawall@lip6.fr
Linux implements SPR save/restore including storage space for registers
in the task struct for process context switching. Make use of this
similarly to the way we make use of the context switching fp/vec save
restore.
This improves code reuse, allows some stack space to be saved, and helps
with avoiding VRSAVE updates if they are not required.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-39-npiggin@gmail.com
This reduces the number of mtmsrd required to enable facility bits when
saving/restoring registers, by having the KVM code set all bits up front
rather than using individual facility functions that set their particular
MSR bits.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-20-npiggin@gmail.com
KVM PMU management code looks for particular frozen/disabled bits in
the PMU registers so it knows whether it must clear them when coming
out of a guest or not. Setting this up helps KVM make these optimisations
without getting confused. Longer term the better approach might be to
move guest/host PMU switching to the perf subsystem.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-12-npiggin@gmail.com
This register controls supervisor SPR modifications, and as such is only
relevant for KVM. KVM always sets AMOR to ~0 on guest entry, and never
restores it coming back out to the host, so it can be kept constant and
avoid the mtSPR in KVM guest entry.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-10-npiggin@gmail.com
Rather than have KVM look up the host timer and fiddle with the
irq-work internal details, have the powerpc/time.c code provide a
function for KVM to re-arm the Linux timer code when exiting a
guest.
This is implementation has an improvement over existing code of
marking a decrementer interrupt as soft-pending if a timer has
expired, rather than setting DEC to a -ve value, which tended to
cause host timers to take two interrupts (first hdec to exit the
guest, then the immediate dec).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-8-npiggin@gmail.com
On processors that don't suppress the HDEC exceptions when LPCR[HDICE]=0,
this could help reduce needless guest exits due to leftover exceptions on
entering the guest.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-6-npiggin@gmail.com
There is no need to save away the host DEC value, as it is derived
from the host timer subsystem which maintains the next timer time,
so it can be restored from there.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211123095231.1036501-5-npiggin@gmail.com
Since the commit c118c7303a ("powerpc/32: Fix vmap stack - Do not
activate MMU before reading task struct") a vmap stack overflow
results in a hard lockup. This is because emergency_ctx is still
addressed with its virtual address allthough data MMU is not active
anymore at that time.
Fix it by using a physical address instead.
Fixes: c118c7303a ("powerpc/32: Fix vmap stack - Do not activate MMU before reading task struct")
Cc: stable@vger.kernel.org # v5.10+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/ce30364fb7ccda489272af4a1612b6aa147e1d23.1637227521.git.christophe.leroy@csgroup.eu
Fix a bug in copying of sigset_t for 32-bit systems, which caused X to not start.
Fix handling of shared LSIs (rare) with the xive interrupt controller (Power9/10).
Fix missing TOC setup in some KVM code, which could result in oopses depending on kernel
data layout.
Fix DMA mapping when we have persistent memory and only one DMA window available.
Fix further problems with STRICT_KERNEL_RWX on 8xx, exposed by a recent fix.
A couple of other minor fixes.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Cédric Le Goater, Christian Zigotzky,
Christophe Leroy, Daniel Axtens, Finn Thain, Greg Kurz, Masahiro Yamada, Nicholas Piggin,
Uwe Kleine-König.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmGZzGMTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgBrRD/4qE1A3+nXe+uZRJM3H5F8C/Ui2I/1G
JPekyfW9aZklsv8SMlz8BotDTlK8vNwiEtkAuwqLOfPXPi1p/Y1do4sPtXAjUpuX
mXZP3G9K2xXmALLedXMjJNO6YJjTT5LE7OT42QziSfY1ScS7iqfGNANg1zRjkCRW
yf2cpBbMRnWdDhCgWyE/V/V4xdPyOTTnnWn3d4F3qNshV0luKgTJl/9yo0OmQrGe
/T4Cw8jG5p+pSblNyFaACnYlKWF4bYTQIl5NWsvJY0A2cg3I5ah6+hexdGRN/JdI
K3PWpJ8rx5RjICkTFE4cADI6xIF1bHhjMh3ytcaMH5USBMmW3fTUUfcFwjRkRDHa
b8Z6V631mgK1v3L0RlrAn+PZ9R212wpupvQT6YOf4pFb5+BzOyaCQCzyQv+BnwoI
Fwran0HEO6NUODq4off9MADEpNTjwhV2mDFojxiCJ9eb1oCIilLbs8BOUWRSHYe0
1S22pdj9XSR7yxXt5DnjQBwhR47ZS7D3jXf9gjbmJ/qn6cRPAFzt/m/woSY2Vv7T
UrZVjz5lb+skjij7vxw+L9jUIwLBd99cvBiHzJpWUNc0RTQeBlAh4QBK/1MNixCP
93LTN7tsRdGknLRTJ5yfRhEhwuhTTH8SEPp3H+qOZj9sXwq3Bftl4Nm40AgoATHO
G4kPlgrCMQBcRQ==
=Ss4y
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc fixes from Michael Ellerman:
- Fix a bug in copying of sigset_t for 32-bit systems, which caused X
to not start.
- Fix handling of shared LSIs (rare) with the xive interrupt controller
(Power9/10).
- Fix missing TOC setup in some KVM code, which could result in oopses
depending on kernel data layout.
- Fix DMA mapping when we have persistent memory and only one DMA
window available.
- Fix further problems with STRICT_KERNEL_RWX on 8xx, exposed by a
recent fix.
- A couple of other minor fixes.
Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Cédric Le Goater,
Christian Zigotzky, Christophe Leroy, Daniel Axtens, Finn Thain, Greg
Kurz, Masahiro Yamada, Nicholas Piggin, and Uwe Kleine-König.
* tag 'powerpc-5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/xive: Change IRQ domain to a tree domain
powerpc/8xx: Fix pinned TLBs with CONFIG_STRICT_KERNEL_RWX
powerpc/signal32: Fix sigset_t copy
powerpc/book3e: Fix TLBCAM preset at boot
powerpc/pseries/ddw: Do not try direct mapping with persistent memory and one window
powerpc/pseries/ddw: simplify enable_ddw()
powerpc/pseries/ddw: Revert "Extend upper limit for huge DMA window for persistent memory"
powerpc/pseries: Fix numa FORM2 parsing fallback code
powerpc/pseries: rename numa_dist_table to form2_distances
powerpc: clean vdso32 and vdso64 directories
powerpc/83xx/mpc8349emitx: Drop unused variable
KVM: PPC: Book3S HV: Use GLOBAL_TOC for kvmppc_h_set_dabr/xdabr()
Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
cleanups.
This is essentially the change you suggested with all of i's dotted
and the t's crossed so that ptrace can intercept all of the cases it
has been able to intercept the past, and all of the cases that made it
to exit without giving ptrace a chance still don't give ptrace a
chance"
* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Replace force_fatal_sig with force_exit_sig when in doubt
signal: Don't always set SA_IMMUTABLE for forced signals
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGWF6YACgkQUqAMR0iA
lPJJ+RAAm9pi/EElKKl+lOlBl+ehJlKuNLnPQWFmmaRc9xd0ruUipp1nsaktLJ8f
R/PkSwR/YWpBWlF8P4o+x9sOFyTNyLasoHtqsinEcAJI4lb7d1KOrPliTXyr15Ai
A303djwJmwCw5KxAPOjkG/nMBlpMIAQRee9GDWs1ykfSlIsI4jp7isVbCFNCQNVF
auHYq1bfJ5MJYPjxIDZUt+NF7kg7dD4k4g+VCVjaH1u8pGeaCUCtnNjMFOk1XfU8
yFQnaDcrAu4zJPq3d74z4eN9Bk+su8+DhnfrAEFjuFxGTgYc2MyRt0gGFeiUtNs4
rvST/eHBO4zeuL18S8G+fLcig/9ZYE73xzjdOCzRzLDjn0VQr9t06ez1QqJOb4D6
A4SSufwek5NIqYKMlhV/az2EceQYK8Wv3KAz8w98KDfwvVVhUSgE23MbTCO0hvQU
PWF35d3hQ+9oH0ZGYRumb8OpXtKJ+2KmzyN8Z0xhivHFBIKlcW6IBGhWRANclJO8
jNAM3jiwi8fRDVM2wI1fmgeEmMhG+WuTI3dJVu3tu4vI923FW5GdY6ev5EvH0Ts0
khTwIjtmCHUJGSeWajy3Gi9irdyhPyPNRMqgal4GS+gGpVU2mMMKTG+NzxxtCRKR
BUgfCjFDoDJWrNWIzzOwNqgF0Y+V9GgCZOkb73u/y+xVx0Rmc6U=
=wbBy
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixes from Petr Mladek:
- Try to flush backtraces from other CPUs also on the local one. This
was a regression caused by printk_safe buffers removal.
- Remove header dependency warning.
* tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: Remove printk.h inclusion in percpu.h
printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces
As spotted and explained in commit c12ab8dbc4 ("powerpc/8xx: Fix
Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST"), the selection
of STRICT_KERNEL_RWX without selecting DEBUG_RODATA_TEST has spotted
the lack of the DIRTY bit in the pinned kernel data TLBs.
This problem should have been detected a lot earlier if things had
been working as expected. But due to an incredible level of chance or
mishap, this went undetected because of a set of bugs: In fact the
DTLBs were not pinned, because instead of setting the reserve bit
in MD_CTR, it was set in MI_CTR that is the register for ITLBs.
But then, another huge bug was there: the physical address was
reset to 0 at the boundary between RO and RW areas, leading to the
same physical space being mapped at both 0xc0000000 and 0xc8000000.
This had by miracle no consequence until now because the entry was
not really pinned so it was overwritten soon enough to go undetected.
Of course, now that we really pin the DTLBs, it must be fixed as well.
Fixes: f76c8f6d25 ("powerpc/8xx: Add function to set pinned TLBs")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Depends-on: c12ab8dbc4 ("powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a21e9a057fe2d247a535aff0d157a54eefee017a.1636963688.git.christophe.leroy@csgroup.eu
The conversion from __copy_from_user() to __get_user() by
commit d3ccc97815 ("powerpc/signal: Use __get_user() to copy
sigset_t") introduced a regression in __get_user_sigset() for
powerpc/32. The bug was subsequently moved into
unsafe_get_user_sigset().
The bug is due to the copied 64 bit value being truncated to
32 bits while being assigned to dst->sig[0]
The regression was reported by users of the Xorg packages distributed in
Debian/powerpc --
"The symptoms are that the fb screen goes blank, with the backlight
remaining on and no errors logged in /var/log; wdm (or startx) run
with no effect (I tried logging in in the blind, with no effect).
And they are hard to kill, requiring 'kill -KILL ...'"
Fix the regression by copying each word of the sigset, not only the
first one.
__get_user_sigset() was tentatively optimised to copy 64 bits at once
in order to minimise KUAP unlock/lock impact, but the unsafe variant
doesn't suffer that, so it can just copy words.
Fixes: 887f3ceb51 ("powerpc/signal32: Convert do_setcontext[_tm]() to user access block")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: Finn Thain <fthain@linux-m68k.org>
Reported-and-tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/99ef38d61c0eb3f79c68942deb0c35995a93a777.1636966353.git.christophe.leroy@csgroup.eu
Since commit bce74491c3 ("powerpc/vdso: fix unnecessary rebuilds of
vgettimeofday.o"), "make ARCH=powerpc clean" does not clean up the
arch/powerpc/kernel/{vdso32,vdso64} directories.
Use the subdir- trick to let "make clean" descend into them.
Fixes: bce74491c3 ("powerpc/vdso: fix unnecessary rebuilds of vgettimeofday.o")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211109185015.615517-1-masahiroy@kernel.org
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
printk from NMI context relies on irq work being raised on the local CPU
to print to console. This can be a problem if the NMI was raised by a
lockup detector to print lockup stack and regs, because the CPU may not
enable irqs (because it is locked up).
Introduce printk_trigger_flush() that can be called another CPU to try
to get those messages to the console, call that where printk_safe_flush
was previously called.
Fixes: 93d102f094 ("printk: remove safe buffers")
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211107045116.1754411-1-npiggin@gmail.com
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmGFXBkUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vx6Tg/7BsGWm8f+uw/mr9lLm47q2mc4XyoO
7bR9KDp5NM84W/8ZOU7dqqqsnY0ddrSOLBRyhJJYMW3SwJd1y1ajTBsL1Ujqv+eN
z+JUFmhq4Laqm4k6Spc9CEJE+Ol5P6gGUtxLYo6PM2R0VxnSs/rDxctT5i7YOpCi
COJ+NVT/mc/by2loz1kLTSR9GgtBBgd+Y8UA33GFbHKssROw02L0OI3wffp81Oba
EhMGPoD+0FndAniDw+vaOSoO+YaBuTfbM92T/O00mND69Fj1PWgmNWZz7gAVgsXb
3RrNENUFxgw6CDt7LZWB8OyT04iXe0R2kJs+PA9gigFCGbypwbd/Nbz5M7e9HUTR
ray+1EpZib6+nIksQBL2mX8nmtyHMcLiM57TOEhq0+ECDO640MiRm8t0FIG/1E8v
3ZYd9w20o/NxlFNXHxxpZ3D/osGH5ocyF5c5m1rfB4RGRwztZGL172LWCB0Ezz9r
eHB8sWxylxuhrH+hp2BzQjyddg7rbF+RA4AVfcQSxUpyV01hoRocKqknoDATVeLH
664nJIINFxKJFwfuL3E6OhrInNe1LnAhCZsHHqbS+NNQFgvPRznbixBeLkI9dMf5
Yf6vpsWO7ur8lHHbRndZubVu8nxklXTU7B/w+C11sq6k9LLRJSHzanr3Fn9WA80x
sznCxwUvbTCu1r0=
=nsMh
-----END PGP SIGNATURE-----
Merge tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Conserve IRQs by setting up portdrv IRQs only when there are users
(Jan Kiszka)
- Rework and simplify _OSC negotiation for control of PCIe features
(Joerg Roedel)
- Remove struct pci_dev.driver pointer since it's redundant with the
struct device.driver pointer (Uwe Kleine-König)
Resource management:
- Coalesce contiguous host bridge apertures from _CRS to accommodate
BARs that cover more than one aperture (Kai-Heng Feng)
Sysfs:
- Check CAP_SYS_ADMIN before parsing user input (Krzysztof
Wilczyński)
- Return -EINVAL consistently from "store" functions (Krzysztof
Wilczyński)
- Use sysfs_emit() in endpoint "show" functions to avoid buffer
overruns (Kunihiko Hayashi)
PCIe native device hotplug:
- Ignore Link Down/Up caused by resets during error recovery so
endpoint drivers can remain bound to the device (Lukas Wunner)
Virtualization:
- Avoid bus resets on Atheros QCA6174, where they hang the device
(Ingmar Klein)
- Work around Pericom PI7C9X2G switch packet drop erratum by using
store and forward mode instead of cut-through (Nathan Rossi)
- Avoid trying to enable AtomicOps on VFs; the PF setting applies to
all VFs (Selvin Xavier)
MSI:
- Document that /sys/bus/pci/devices/.../irq contains the legacy INTx
interrupt or the IRQ of the first MSI (not MSI-X) vector (Barry
Song)
VPD:
- Add pci_read_vpd_any() and pci_write_vpd_any() to access anywhere
in the possible VPD space; use these to simplify the cxgb3 driver
(Heiner Kallweit)
Peer-to-peer DMA:
- Add (not subtract) the bus offset when calculating DMA address
(Wang Lu)
ASPM:
- Re-enable LTR at Downstream Ports so they don't report Unsupported
Requests when reset or hot-added devices send LTR messages
(Mingchuang Qiao)
Apple PCIe controller driver:
- Add driver for Apple M1 PCIe controller (Alyssa Rosenzweig, Marc
Zyngier)
Cadence PCIe controller driver:
- Return success when probe succeeds instead of falling into error
path (Li Chen)
HiSilicon Kirin PCIe controller driver:
- Reorganize PHY logic and add support for external PHY drivers
(Mauro Carvalho Chehab)
- Support PERST# GPIOs for HiKey970 external PEX 8606 bridge (Mauro
Carvalho Chehab)
- Add Kirin 970 support (Mauro Carvalho Chehab)
- Make driver removable (Mauro Carvalho Chehab)
Intel VMD host bridge driver:
- If IOMMU supports interrupt remapping, leave VMD MSI-X remapping
enabled (Adrian Huang)
- Number each controller so we can tell them apart in
/proc/interrupts (Chunguang Xu)
- Avoid building on UML because VMD depends on x86 bare metal APIs
(Johannes Berg)
Marvell Aardvark PCIe controller driver:
- Define macros for PCI_EXP_DEVCTL_PAYLOAD_* (Pali Rohár)
- Set Max Payload Size to 512 bytes per Marvell spec (Pali Rohár)
- Downgrade PIO Response Status messages to debug level (Marek Behún)
- Preserve CRS SV (Config Request Retry Software Visibility) bit in
emulated Root Control register (Pali Rohár)
- Fix issue in configuring reference clock (Pali Rohár)
- Don't clear status bits for masked interrupts (Pali Rohár)
- Don't mask unused interrupts (Pali Rohár)
- Avoid code repetition in advk_pcie_rd_conf() (Marek Behún)
- Retry config accesses on CRS response (Pali Rohár)
- Simplify emulated Root Capabilities initialization (Pali Rohár)
- Fix several link training issues (Pali Rohár)
- Fix link-up checking via LTSSM (Pali Rohár)
- Fix reporting of Data Link Layer Link Active (Pali Rohár)
- Fix emulation of W1C bits (Marek Behún)
- Fix MSI domain .alloc() method to return zero on success (Marek
Behún)
- Read entire 16-bit MSI vector in MSI handler, not just low 8 bits
(Marek Behún)
- Clear Root Port I/O Space, Memory Space, and Bus Master Enable bits
at startup; PCI core will set those as necessary (Pali Rohár)
- When operating as a Root Port, set class code to "PCI Bridge"
instead of the default "Mass Storage Controller" (Pali Rohár)
- Add emulation for PCI_BRIDGE_CTL_BUS_RESET since aardvark doesn't
implement this per spec (Pali Rohár)
- Add emulation of option ROM BAR since aardvark doesn't implement
this per spec (Pali Rohár)
MediaTek MT7621 PCIe controller driver:
- Add MediaTek MT7621 PCIe host controller driver and DT binding
(Sergio Paracuellos)
Qualcomm PCIe controller driver:
- Add SC8180x compatible string (Bjorn Andersson)
- Add endpoint controller driver and DT binding (Manivannan
Sadhasivam)
- Restructure to use of_device_get_match_data() (Prasad Malisetty)
- Add SC7280-specific pcie_1_pipe_clk_src handling (Prasad Malisetty)
Renesas R-Car PCIe controller driver:
- Remove unnecessary includes (Geert Uytterhoeven)
Rockchip DesignWare PCIe controller driver:
- Add DT binding (Simon Xue)
Socionext UniPhier Pro5 controller driver:
- Serialize INTx masking/unmasking (Kunihiko Hayashi)
Synopsys DesignWare PCIe controller driver:
- Run dwc .host_init() method before registering MSI interrupt
handler so we can deal with pending interrupts left by bootloader
(Bjorn Andersson)
- Clean up Kconfig dependencies (Andy Shevchenko)
- Export symbols to allow more modular drivers (Luca Ceresoli)
TI DRA7xx PCIe controller driver:
- Allow host and endpoint drivers to be modules (Luca Ceresoli)
- Enable external clock if present (Luca Ceresoli)
TI J721E PCIe driver:
- Disable PHY when probe fails after initializing it (Christophe
JAILLET)
MicroSemi Switchtec management driver:
- Return error to application when command execution fails because an
out-of-band reset has cleared the device BARs, Memory Space Enable,
etc (Kelvin Cao)
- Fix MRPC error status handling issue (Kelvin Cao)
- Mask out other bits when reading of management VEP instance ID
(Kelvin Cao)
- Return EOPNOTSUPP instead of ENOTSUPP from sysfs show functions
(Kelvin Cao)
- Add check of event support (Logan Gunthorpe)
Miscellaneous:
- Remove unused pci_pool wrappers, which have been replaced by
dma_pool (Cai Huoqing)
- Use 'unsigned int' instead of bare 'unsigned' (Krzysztof
Wilczyński)
- Use kstrtobool() directly, sans strtobool() wrapper (Krzysztof
Wilczyński)
- Fix some sscanf(), sprintf() format mismatches (Krzysztof
Wilczyński)
- Update PCI subsystem information in MAINTAINERS (Krzysztof
Wilczyński)
- Correct some misspellings (Krzysztof Wilczyński)"
* tag 'pci-v5.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (137 commits)
PCI: Add ACS quirk for Pericom PI7C9X2G switches
PCI: apple: Configure RID to SID mapper on device addition
iommu/dart: Exclude MSI doorbell from PCIe device IOVA range
PCI: apple: Implement MSI support
PCI: apple: Add INTx and per-port interrupt support
PCI: kirin: Allow removing the driver
PCI: kirin: De-init the dwc driver
PCI: kirin: Disable clkreq during poweroff sequence
PCI: kirin: Move the power-off code to a common routine
PCI: kirin: Add power_off support for Kirin 960 PHY
PCI: kirin: Allow building it as a module
PCI: kirin: Add MODULE_* macros
PCI: kirin: Add Kirin 970 compatible
PCI: kirin: Support PERST# GPIOs for HiKey970 external PEX 8606 bridge
PCI: apple: Set up reference clocks when probing
PCI: apple: Add initial hardware bring-up
PCI: of: Allow matching of an interrupt-map local to a PCI device
of/irq: Allow matching of an interrupt-map local to an interrupt controller
irqdomain: Make of_phandle_args_to_fwspec() generally available
PCI: Do not enable AtomicOps on VFs
...
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org> [kselftest]
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename memblock_free_ptr() to memblock_free() and use memblock_free()
when freeing a virtual pointer so that memblock_free() will be a
counterpart of memblock_alloc()
The callers are updated with the below semantic patch and manual
addition of (void *) casting to pointers that are represented by
unsigned long variables.
@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)
[sfr@canb.auug.org.au: fixup]
Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since memblock_free() operates on a physical range, make its name
reflect it and rename it to memblock_phys_free(), so it will be a
logical counterpart to memblock_phys_alloc().
The callers are updated with the below semantic patch:
@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);
Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Drop the struct pci_dev.driver pointer, which is redundant with the
struct device.driver pointer (Uwe Kleine-König)
* pci/driver:
PCI: Remove struct pci_dev->driver
PCI: Use to_pci_driver() instead of pci_dev->driver
x86/pci/probe_roms: Use to_pci_driver() instead of pci_dev->driver
perf/x86/intel/uncore: Use to_pci_driver() instead of pci_dev->driver
powerpc/eeh: Use to_pci_driver() instead of pci_dev->driver
usb: xhci: Use to_pci_driver() instead of pci_dev->driver
cxl: Use to_pci_driver() instead of pci_dev->driver
cxl: Factor out common dev->driver expressions
xen/pcifront: Use to_pci_driver() instead of pci_dev->driver
xen/pcifront: Drop pcifront_common_process() tests of pcidev, pdrv
nfp: use dev_driver_string() instead of pci_dev->driver->name
mlxsw: pci: Use dev_driver_string() instead of pci_dev->driver->name
net: marvell: prestera: use dev_driver_string() instead of pci_dev->driver->name
net: hns3: use dev_driver_string() instead of pci_dev->driver->name
crypto: hisilicon - use dev_driver_string() instead of pci_dev->driver->name
powerpc/eeh: Use dev_driver_string() instead of struct pci_dev->driver->name
ssb: Use dev_driver_string() instead of pci_dev->driver->name
bcma: simplify reference to driver name
crypto: qat - simplify adf_enable_aer()
scsi: message: fusion: Remove unused mpt_pci driver .probe() 'id' parameter
PCI/ERR: Factor out common dev->driver expressions
PCI: Drop pci_device_probe() test of !pci_dev->driver
PCI: Drop pci_device_remove() test of pci_dev->driver
PCI: Return NULL for to_pci_driver(NULL)
- Enable STRICT_KERNEL_RWX for Freescale 85xx platforms.
- Activate CONFIG_STRICT_KERNEL_RWX by default, while still allowing it to be disabled.
- Add support for out-of-line static calls on 32-bit.
- Fix oopses doing bpf-to-bpf calls when STRICT_KERNEL_RWX is enabled.
- Fix boot hangs on e5500 due to stale value in ESR passed to do_page_fault().
- Fix several bugs on pseries in handling of device tree cache information for hotplugged
CPUs, and/or during partition migration.
- Various other small features and fixes.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Anatolij Gustschin, Andrew Donnellan,
Athira Rajeev, Bixuan Cui, Bjorn Helgaas, Cédric Le Goater, Christophe Leroy, Daniel
Axtens, Daniel Henrique Barboza, Denis Kirjanov, Fabiano Rosas, Frederic Barrat, Gustavo
A. R. Silva, Hari Bathini, Jacques de Laval, Joel Stanley, Kai Song, Kajol Jain, Laurent
Vivier, Leonardo Bras, Madhavan Srinivasan, Nathan Chancellor, Nathan Lynch, Naveen N.
Rao, Nicholas Piggin, Nick Desaulniers, Niklas Schnelle, Oliver O'Halloran, Rob Herring,
Russell Currey, Srikar Dronamraju, Stan Johnson, Tyrel Datwyler, Uwe Kleine-König, Vasant
Hegde, Wan Jiabing, Xiaoming Ni,
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmGFDPoTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgNbsEACVczMVwMBEny5a7W1tqq1bnY9RFVw3
K+/rE7/FpSLX+RrwgoMkmqfPvfyc9WISVLlIQDKz4XhkBjaXv0+Y4OMsymuDCbxL
Qk7F1ff22UfLmPjjJk39gHJ8QZQqZk3wmFu2QzTO4yBZbz2SqqXFLxwyoLpZ0LJ8
pdGl51+bIsTkDJzrdkhX9X4AKS/fYyjbQxq5u7FS89ZCCs+KvzjLcDRo0GZYaOK/
hgDBa60DCCszL/9yjbh0ANZxmM2Z3+6AXkvAAXrtXzIGk4JzenZfiV+VEzmq8Tt0
UpWSsUEe7VgykMR3MTrL9G8op70PpKX6OMUPegJq4iRQ24h4mpFCK4oV9OMKJqpF
ifN9NO2ZZKOz1ke4l7Xe8SEHLX7rq5U/P7INh3AsKYNYwo6HkJhSPxiCBWUTlnZ3
OYoZ7czyO4gMPHWP92z4CoSiTYVBFuyhYexRcnQskg60TIwbr+lMXziRyPRGI8b6
taf2rD8eAiyUJnvkFUsyAHtYHpkSkuMeiVqY2CDQdh2SdtIFgwKzB2RjFL0gzaBZ
XP9RWD+HernGQAJSlIk7cVthont3JHklcKk+ohhDbsWzPeUJcz6t4ChtgRq0x4q4
Hpes1lsVfXpyxj5ouBK/E/t+diwPvBLM0dCcarQJE6ExgMzBC/C7Br7jCSgyVJA2
VhtcZaCYh+vRlQ==
=f7HE
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Enable STRICT_KERNEL_RWX for Freescale 85xx platforms.
- Activate CONFIG_STRICT_KERNEL_RWX by default, while still allowing it
to be disabled.
- Add support for out-of-line static calls on 32-bit.
- Fix oopses doing bpf-to-bpf calls when STRICT_KERNEL_RWX is enabled.
- Fix boot hangs on e5500 due to stale value in ESR passed to
do_page_fault().
- Fix several bugs on pseries in handling of device tree cache
information for hotplugged CPUs, and/or during partition migration.
- Various other small features and fixes.
Thanks to Alexey Kardashevskiy, Alistair Popple, Anatolij Gustschin,
Andrew Donnellan, Athira Rajeev, Bixuan Cui, Bjorn Helgaas, Cédric Le
Goater, Christophe Leroy, Daniel Axtens, Daniel Henrique Barboza, Denis
Kirjanov, Fabiano Rosas, Frederic Barrat, Gustavo A. R. Silva, Hari
Bathini, Jacques de Laval, Joel Stanley, Kai Song, Kajol Jain, Laurent
Vivier, Leonardo Bras, Madhavan Srinivasan, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nicholas Piggin, Nick Desaulniers, Niklas
Schnelle, Oliver O'Halloran, Rob Herring, Russell Currey, Srikar
Dronamraju, Stan Johnson, Tyrel Datwyler, Uwe Kleine-König, Vasant
Hegde, Wan Jiabing, and Xiaoming Ni,
* tag 'powerpc-5.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (73 commits)
powerpc/8xx: Fix Oops with STRICT_KERNEL_RWX without DEBUG_RODATA_TEST
powerpc/32e: Ignore ESR in instruction storage interrupt handler
powerpc/powernv/prd: Unregister OPAL_MSG_PRD2 notifier during module unload
powerpc: Don't provide __kernel_map_pages() without ARCH_SUPPORTS_DEBUG_PAGEALLOC
MAINTAINERS: Update powerpc KVM entry
powerpc/xmon: fix task state output
powerpc/44x/fsp2: add missing of_node_put
powerpc/dcr: Use cmplwi instead of 3-argument cmpli
KVM: PPC: Tick accounting should defer vtime accounting 'til after IRQ handling
powerpc/security: Use a mutex for interrupt exit code patching
powerpc/83xx/mpc8349emitx: Make mcu_gpiochip_remove() return void
powerpc/fsl_booke: Fix setting of exec flag when setting TLBCAMs
powerpc/book3e: Fix set_memory_x() and set_memory_nx()
powerpc/nohash: Fix __ptep_set_access_flags() and ptep_set_wrprotect()
powerpc/bpf: Fix write protecting JIT code
selftests/powerpc: Use date instead of EPOCHSECONDS in mitigation-patching.sh
powerpc/64s/interrupt: Fix check_return_regs_valid() false positive
powerpc/boot: Set LC_ALL=C in wrapper script
powerpc/64s: Default to 64K pages for 64 bit book3s
Revert "powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC"
...
- Convert /reserved-memory bindings to schemas
- Convert a bunch of NFC bindings to schemas
- Convert bindings to schema: Xilinx USB, Freescale DDR controller, Arm
CCI-400, UBlox Neo-6M, 1-Wire GPIO, MSI controller, ASpeed LPC, OMAP
and Inside-Secure HWRNG, register-bit-led, OV5640, Silead GSL1680,
Elan ekth3000, Marvell bluetooth, TI wlcore, TI bluetooth, ESP ESP8089,
tlm,trusted-foundations, Microchip cap11xx, Ralink SoCs and boards,
and TI sysc
- New binding schemas for: msi-ranges, Aspeed UART routing controller,
palmbus, Xylon LogiCVC display controller, Mediatek's MT7621 SDRAM
memory controller, and Apple M1 PCIe host
- Run schema checks for %.dtb targets
- Improve build time when using DT_SCHEMA_FILES
- Improve error message when dtschema is not found
- Various doc reference fixes in MAINTAINERS
- Convert architectures to common CPU h/w ID parsing function
of_get_cpu_hwid().
- Allow for empty NUMA node IDs which may be hotplugged
- Cleanup of __fdt_scan_reserved_mem()
- Constify device_node parameters
- Update dtc to upstream v1.6.1-19-g0a3a9d3449c8. Adds new checks
'node_name_vs_property_name' and 'interrupt_map'.
- Enable dtc 'unit_address_format' warning by default
- Fix unittest EXPECT text for gpio hog errors
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmGBrj8QHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw3M1D/9gpaVBqp+Q5hZZLWOjz/WkAsExZ71N/8Lh
rn64XWYQNJ6R1PINkBtlooJy6wTCIMfNs3IEmkAVEXVEj1Nvu7uEZwYbb96B4dJ4
EiMv/Vz0EphoqnBvICT86XfNZduP1sZ5M11pdv2dNvwJrEvvi98VLDvSucvxorn8
sm5jsqWOAwroiCR+u8BWW3qH3sugL1BOAwraMoUbosZAo0SpNH4WBdcBz4+v8lUS
5N8Y8Q6dB6fEqdbVpzMblN2B9c/TEb1VYaeGXRUyQsIUQJajX3xnR8RDnTKLBtsS
FAKGQORemLwVzBVKeZKbhlqXAJbl701LuKHRLiVerb9UGi+tk4AX9Rgg1Whrp7w4
UYi+k4Ozus1vDaKsemB1voabSgYYY+aNTRezltdtPz0a+eQJWPUt1xQB5m68cGO4
TZI+KfExxyGVa8iDgv4AWhvXqbR3+PUTUvel2xEIkRscWmMjXF/+oQXy8QYn2Aok
S9750/3EUQCbKi9ZUjPLRzd5CuPP2E97i8V2WdOgRse3+H7pPg5IcEq7oQYe9A62
SnRFjPz1X5g4Hh3bRVmcAGmDzbZJrl9dULvYVdiUWiqzfmHxN7MXO9FIxv3NKVfp
6jgr5vVVi1ShDnCh3ns4mYUwQ7j72dsONyklbVBbNtGjeeZcv5MEeg9ZAoVvO+lh
9DNNSGSd2g==
=dQa6
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
- Convert /reserved-memory bindings to schemas
- Convert a bunch of NFC bindings to schemas
- Convert bindings to schema: Xilinx USB, Freescale DDR controller, Arm
CCI-400, UBlox Neo-6M, 1-Wire GPIO, MSI controller, ASpeed LPC, OMAP
and Inside-Secure HWRNG, register-bit-led, OV5640, Silead GSL1680,
Elan ekth3000, Marvell bluetooth, TI wlcore, TI bluetooth, ESP
ESP8089, tlm,trusted-foundations, Microchip cap11xx, Ralink SoCs and
boards, and TI sysc
- New binding schemas for: msi-ranges, Aspeed UART routing controller,
palmbus, Xylon LogiCVC display controller, Mediatek's MT7621 SDRAM
memory controller, and Apple M1 PCIe host
- Run schema checks for %.dtb targets
- Improve build time when using DT_SCHEMA_FILES
- Improve error message when dtschema is not found
- Various doc reference fixes in MAINTAINERS
- Convert architectures to common CPU h/w ID parsing function
of_get_cpu_hwid().
- Allow for empty NUMA node IDs which may be hotplugged
- Cleanup of __fdt_scan_reserved_mem()
- Constify device_node parameters
- Update dtc to upstream v1.6.1-19-g0a3a9d3449c8. Adds new checks
'node_name_vs_property_name' and 'interrupt_map'.
- Enable dtc 'unit_address_format' warning by default
- Fix unittest EXPECT text for gpio hog errors
* tag 'devicetree-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (97 commits)
dt-bindings: net: ti,bluetooth: Document default max-speed
dt-bindings: pci: rcar-pci-ep: Document r8a7795
dt-bindings: net: qcom,ipa: IPA does support up to two iommus
of/fdt: Remove of_scan_flat_dt() usage for __fdt_scan_reserved_mem()
of: unittest: document intentional interrupt-map provider build warning
of: unittest: fix EXPECT text for gpio hog errors
of/unittest: Disable new dtc node_name_vs_property_name and interrupt_map warnings
scripts/dtc: Update to upstream version v1.6.1-19-g0a3a9d3449c8
dt-bindings: arm: firmware: tlm,trusted-foundations: Convert txt bindings to yaml
dt-bindings: display: tilcd: Fix endpoint addressing in example
dt-bindings: input: microchip,cap11xx: Convert txt bindings to yaml
dt-bindings: ufs: exynos-ufs: add exynosautov9 compatible
dt-bindings: ufs: exynos-ufs: add io-coherency property
dt-bindings: mips: convert Ralink SoCs and boards to schema
dt-bindings: display: xilinx: Fix example with psgtr
dt-bindings: net: nfc: nxp,pn544: Convert txt bindings to yaml
dt-bindings: Add a help message when dtschema tools are missing
dt-bindings: bus: ti-sysc: Update to use yaml binding
dt-bindings: sram: Allow numbers in sram region node name
dt-bindings: display: Document the Xylon LogiCVC display controller
...
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock. In the most basic scenario, that buffer will not be
resident and it will be mapped to the same file. Accessing the buffer
will trigger a page fault, and gfs2 will deadlock trying to take the
same inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
5CmgeXto8FRUFjI=
=WeNd
-----END PGP SIGNATURE-----
Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
"Functions gfs2_file_read_iter and gfs2_file_write_iter are both
accessing the user buffer to write to or read from while holding the
inode glock.
In the most basic deadlock scenario, that buffer will not be resident
and it will be mapped to the same file. Accessing the buffer will
trigger a page fault, and gfs2 will deadlock trying to take the same
inode glock again while trying to handle that fault.
Fix that and similar, more complex scenarios by disabling page faults
while accessing user buffers. To make this work, introduce a small
amount of new infrastructure and fix some bugs that didn't trigger so
far, with page faults enabled"
* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Fix mmap + page fault deadlocks for direct I/O
iov_iter: Introduce nofault flag to disable page faults
gup: Introduce FOLL_NOFAULT flag to disable page faults
iomap: Add done_before argument to iomap_dio_rw
iomap: Support partial direct I/O on user copy failures
iomap: Fix iomap_dio_rw return value for user copies
gfs2: Fix mmap + page fault deadlocks for buffered I/O
gfs2: Eliminate ip->i_gh
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Clean up function may_grant
gfs2: Add wrapper for iomap_file_buffered_write
iov_iter: Introduce fault_in_iov_iter_writeable
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
powerpc/kvm: Fix kvm_use_magic_page
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmGANdUUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOmihAAgKSTv4Jf0s4yopdcxfuLweiyqHX1
719QJzdLZohmllrJPq/83FZL9qodCzxy87nAm67Ht0baSKiEjtVgRaVCqJWEE+l6
oQL+wUsGLP7CmExOP503Uh6tW35AhETQA4Uwu6QtiUYLYG17kAgeR3cTFuekUsJS
iL4K65PXE2bBxMe7Ta1YIZqcxptbknMgpqYkdne7xs7RS+UiVj8TyRle6ACrfzEX
IVy4LTk+spHCy1a494g9pt/21xOnbiLHr/FpckALscnvJiUThxbfQHGSQeMpM4uM
BnwCqFrj860vMeh52M11/GAAXmdPh6AjoLhaSIW2I3M2GbV8ZP2hu1HYUz3osmrT
f+aeMPJ4feX1xVj6qAC+1G83XRO83tP/YIEuocGiwyepImB25NHPin21xepf6Ru0
wJX+aXC9O1eG6E2ghT6tBim/MpeNH5OT0hNO3uhGmEQ6xZpArRVVaBwlEdufJiCx
ZljqEFUT7wA9nGEQif6GdLnGezGr/aNL65caTkIAzHKamd79QIr7VZXYjYIfHSqE
p74Aro6E8qoQJjsTSkvZceM0u1LRzwS4wPRroE6eGz98oYDpiDm1RPb+9Gw5jyJf
JN7UjJKO9+iPGAi3KivGBqpBskw4cCp2y/nHrMYmpGUPELcr5kQtDfQ6yp59tVZ8
Dwo5GeSlG6khmiI=
=WrEw
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Add some additional audit logging to capture the openat2() syscall
open_how struct info.
Previous variations of the open()/openat() syscalls allowed audit
admins to inspect the syscall args to get the information contained in
the new open_how struct used in openat2()"
* tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: return early if the filter rule has a lower priority
audit: add OPENAT2 record to list "how" info
audit: add support for the openat2 syscall
audit: replace magic audit syscall class numbers with macros
lsm_audit: avoid overloading the "key" audit field
audit: Convert to SPDX identifier
audit: rename struct node to struct audit_node to prevent future name collisions
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Cross-architecture update to move task_struct::cpu back into thread_info
on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers.
Quoting Ard Biesheuvel:
"Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that define
raw_smp_processor_id() in terms of this field, as it requires
linux/sched.h to be included, which causes a lot of pain in terms of
circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from) for all
architectures that enable THREAD_INFO_IN_TASK, addressing the header
soup issue as well as some pointless differences in the implementations
of task_cpu() and set_task_cpu()."
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmGAEPYWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJq4wEACItgLuyzPgB2eSLVMc3sHPIWcn
EUWbAWsuzJH79wmJtn2AKxW/C5OLBNGeoNjkXQvFN3ULkQDPrfCpB4x/tB6CjIQI
WRDf8kO7oaAD85ZrbSwyFl/MFfrD67f6H1HZoB9FKWAzuv/Bp2xQ0Kf06Dv4HEZp
CzprzZuWtjHB+qgyy+EpGOge3zbFmCuYPE2QpMYLWgs1rcVW9OYvoCI6AYtNefrC
6Kl6CbmBb1k6lFxkhM7wvRcIJthBl6Bajpc3Z2uL1aLb27dVpQZs3YpY859Knb6U
ZpOQCRJOMui3HOxyF3bDUI37y0XVLm6xaNM6C/7i0XS1GiFlSxkGVamg+Mp7anpI
+hdK5kqtSagaBC9CaJvRHnWIex1npQAfiyDNdyiEbrsUJ1dp6/zZcQSe4/m/XRbi
vywQPGxU9f1ASshzHsGU2TJf7Ps7qHulUsS5fKwmHU2ZjQnbYCoPN10JGO9gKjOX
yioN5xsKnbPY9j0ys3l9XBqaMJ8KAr1XspplTGIMZIVbjNMlqrfgbg8Qn8T8WGM7
oUqudMIxczilj0/iEGfGRxBeFaYAfhGQCDnxNlNX9g7Xe/gHTJgNYlHVxL55jHNu
AoPE3Gd0X8K9fbov0BCB6a21XwGJ6Wj+FSrnvuyWrRuy8JWiDFJaVKUBEcalKr7a
MhoUNQPu5M83OdC42A==
=PzvV
-----END PGP SIGNATURE-----
Merge tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull thread_info update to move 'cpu' back from task_struct from Kees Cook:
"Cross-architecture update to move task_struct::cpu back into
thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch
maintainers.
Quoting Ard Biesheuvel:
'Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that
define raw_smp_processor_id() in terms of this field, as it
requires linux/sched.h to be included, which causes a lot of pain
in terms of circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from)
for all architectures that enable THREAD_INFO_IN_TASK, addressing
the header soup issue as well as some pointless differences in the
implementations of task_cpu() and set_task_cpu()'"
* tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
riscv: rely on core code to keep thread_info::cpu updated
powerpc: smp: remove hack to obtain offset of task_struct::cpu
sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y
powerpc: add CPU field to struct thread_info
s390: add CPU field to struct thread_info
x86: add CPU field to struct thread_info
arm64: add CPU field to struct thread_info
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
Now that force_fatal_sig exists it is unnecessary and a bit confusing
to use force_sigsegv in cases where the simpler force_fatal_sig is
wanted. So change every instance we can to make the code clearer.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Link: https://lkml.kernel.org/r/877de7jrev.fsf@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
A e5500 machine running a 32-bit kernel sometimes hangs at boot,
seemingly going into an infinite loop of instruction storage interrupts.
The ESR (Exception Syndrome Register) has a value of 0x800000 (store)
when this happens, which is likely set by a previous store. An
instruction TLB miss interrupt would then leave ESR unchanged, and if no
PTE exists it calls directly to the instruction storage interrupt
handler without changing ESR.
access_error() does not cause a segfault due to a store to a read-only
vma because is_exec is true. Most subsequent fault handling does not
check for a write fault on a read-only vma, and might do strange things
like create a writeable PTE or call page_mkwrite on a read only vma or
file. It's not clear what happens here to cause the infinite faulting in
this case, a fault handler failure or low level PTE or TLB handling.
In any case this can be fixed by having the instruction storage
interrupt zero regs->dsisr rather than storing the ESR value to it.
Fixes: a01a3f2ddb ("powerpc: remove arguments from fault handler functions")
Cc: stable@vger.kernel.org # v5.12+
Reported-by: Jacques de Laval <jacques.delaval@protonmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Jacques de Laval <jacques.delaval@protonmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211028133043.4159501-1-npiggin@gmail.com
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.
And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.
This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.
Link: https://lkml.kernel.org/r/13bde807-779c-aa4c-0672-20515ae365ea@linux.alibaba.com
CC: Petr Mladek <pmladek@suse.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
[ Removed extra line in comment - SDR ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The check_return_regs_valid() can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt type
matching to fail.
This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.
Fix it by using the TRAP() accessor which masks out the low bit.
Fixes: 6eaaf9de35 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing")
Cc: stable@vger.kernel.org # v5.14+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com
This reverts commit 566af8cda3.
This caused some conflicts vs the audit tree, and the audit maintainers
would prefer we postpone this to the next merge window so we have more
time for testing.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the register state may be partial and corrupted instead of calling
do_exit, call force_sigsegv(SIGSEGV). Which properly kills the
process with SIGSEGV and does not let any more userspace code execute,
instead of just killing one thread of the process and potentially
confusing everything.
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.")
Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt")
Link: https://lkml.kernel.org/r/20211020174406.17889-7-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
dcbz instruction shouldn't be used on non-cached memory. Using
it on non-cached memory can result in alignment exception and
implies a heavy handling.
Instead of silentely emulating the instruction and resulting in high
performance degradation, warn whenever an alignment exception is
taken in kernel mode due to dcbz, so that the user is made aware that
dcbz instruction has been used unexpectedly by the kernel.
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/2e3acfe63d289c6fba366e16973c9ab8369e8b75.1631803922.git.christophe.leroy@csgroup.eu
Add support for out-of-line static calls on PPC32. This change
improve performance of calls to global function pointers by
using direct calls instead of indirect calls.
The trampoline is initialy populated with a 'blr' or branch to target,
followed by an unreachable long jump sequence.
In order to cater with parallele execution, the trampoline needs to
be updated in a way that ensures it remains consistent at all time.
This means we can't use the traditional lis/addi to load r12 with
the target address, otherwise there would be a window during which
the first instruction contains the upper part of the new target
address while the second instruction still contains the lower part of
the old target address. To avoid that the target address is stored
just after the 'bctr' and loaded from there with a single instruction.
Then, depending on the target distance, arch_static_call_transform()
will either replace the first instruction by a direct 'bl <target>' or
'nop' in order to have the trampoline fall through the long jump
sequence.
For the special case of __static_call_return0(), to avoid the risk of
a far branch, a version of it is inlined at the end of the trampoline.
Performancewise the long jump sequence is probably not better than
the indirect calls set by GCC when we don't use static calls, but
such calls are unlikely to be required on powerpc32: With most
configurations the kernel size is far below 32 Mbytes so only
modules may happen to be too far. And even modules are likely to
be close enough as they are allocated below the kernel core and
as close as possible of the kernel text.
static_call selftest is running successfully with this change.
With this patch, __do_irq() has the following sequence to trace
irq entries:
c0004a00 <__SCT__tp_func_irq_entry>:
c0004a00: 48 00 00 e0 b c0004ae0 <__traceiter_irq_entry>
c0004a04: 3d 80 c0 00 lis r12,-16384
c0004a08: 81 8c 4a 1c lwz r12,18972(r12)
c0004a0c: 7d 89 03 a6 mtctr r12
c0004a10: 4e 80 04 20 bctr
c0004a14: 38 60 00 00 li r3,0
c0004a18: 4e 80 00 20 blr
c0004a1c: 00 00 00 00 .long 0x0
...
c0005654 <__do_irq>:
...
c0005664: 7c 7f 1b 78 mr r31,r3
...
c00056a0: 81 22 00 00 lwz r9,0(r2)
c00056a4: 39 29 00 01 addi r9,r9,1
c00056a8: 91 22 00 00 stw r9,0(r2)
c00056ac: 3d 20 c0 af lis r9,-16209
c00056b0: 81 29 74 cc lwz r9,29900(r9)
c00056b4: 2c 09 00 00 cmpwi r9,0
c00056b8: 41 82 00 10 beq c00056c8 <__do_irq+0x74>
c00056bc: 80 69 00 04 lwz r3,4(r9)
c00056c0: 7f e4 fb 78 mr r4,r31
c00056c4: 4b ff f3 3d bl c0004a00 <__SCT__tp_func_irq_entry>
Before this patch, __do_irq() was doing the following to trace irq
entries:
c0005700 <__do_irq>:
...
c0005710: 7c 7e 1b 78 mr r30,r3
...
c000574c: 93 e1 00 0c stw r31,12(r1)
c0005750: 81 22 00 00 lwz r9,0(r2)
c0005754: 39 29 00 01 addi r9,r9,1
c0005758: 91 22 00 00 stw r9,0(r2)
c000575c: 3d 20 c0 af lis r9,-16209
c0005760: 83 e9 f4 cc lwz r31,-2868(r9)
c0005764: 2c 1f 00 00 cmpwi r31,0
c0005768: 41 82 00 24 beq c000578c <__do_irq+0x8c>
c000576c: 81 3f 00 00 lwz r9,0(r31)
c0005770: 80 7f 00 04 lwz r3,4(r31)
c0005774: 7d 29 03 a6 mtctr r9
c0005778: 7f c4 f3 78 mr r4,r30
c000577c: 4e 80 04 21 bctrl
c0005780: 85 3f 00 0c lwzu r9,12(r31)
c0005784: 2c 09 00 00 cmpwi r9,0
c0005788: 40 82 ff e4 bne c000576c <__do_irq+0x6c>
Behind the fact of now using a direct 'bl' instead of a
'load/mtctr/bctr' sequence, we can also see that we get one less
register on the stack.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
ppc_md.iommu_save() is not set anymore by any platform after
commit c40785ad30 ("powerpc/dart: Use a cachable DART").
So iommu_save() has become a nop and can be removed.
ppc_md.show_percpuinfo() is not set anymore by any platform after
commit 4350147a81 ("[PATCH] ppc64: SMU based macs cpufreq support").
Last users of ppc_md.rtc_read_val() and ppc_md.rtc_write_val() were
removed by commit 0f03a43b8f ("[POWERPC] Remove todc code from
ARCH=powerpc")
Last user of kgdb_map_scc() was removed by commit 17ce452f7e ("kgdb,
powerpc: arch specific powerpc kgdb support").
ppc.machine_kexec_prepare() has not been used since
commit 8ee3e0d696 ("powerpc: Remove the main legacy iSerie platform
code"). This allows the removal of machine_kexec_prepare() and the
rename of default_machine_kexec_prepare() into machine_kexec_prepare()
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
[mpe: Drop prototype for default_machine_kexec_prepare() as noted by dja]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/24d4ca0ada683c9436a5f812a7aeb0a1362afa2b.1630398606.git.christophe.leroy@csgroup.eu
Commit d75d68cfef ("powerpc: Clean up obsolete code relating to
decrementer and timebase") made generic_suspend_enable_irqs() and
generic_suspend_disable_irqs() static.
Fold them into their only caller.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c3f9ec9950394ef939014f7934268e6ee30ca04f.1630398566.git.christophe.leroy@csgroup.eu
Commit e65e1fc2d2 ("[PATCH] syscall class hookup for all normal
targets") added generic support for AUDIT but that didn't include
support for bi-arch like powerpc.
Commit 4b58841149 ("audit: Add generic compat syscall support")
added generic support for bi-arch.
Convert powerpc to that bi-arch generic audit support.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/a4b3951d1191d4183d92a07a6097566bde60d00a.1629812058.git.christophe.leroy@csgroup.eu
Fix a bug exposed by a previous fix, where running guests with certain SMT topologies
could crash the host on Power8.
Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is enabled.
Thanks to: Nathan Lynch, Srikar Dronamraju, Valentin Schneider.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmFxTaUTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgFj/D/9J/rz1zYKxvPd+o5WsJBWrgsVBxgGr
cIhtKARlh0Yoa9fhIYF6ir4HUT2TOhokwJJco7fcND73rfrHfbWshWEzwaR1Lz/2
WXr0Np509GQw1KCC9nv5DoE5qqLLL+rS+ul75Khkoi99wa/TDD8kQ5o3zUWtDoad
D6paUIOf+DjnzzD7hIuTljfuOL/P3S4OyarQEopD69n8IgCTCrSsKdzOzAvlKcuL
c3OoGAY5gasHowfxLfOigkCYloT4gJES7Kxl92QnPByOJUfqhjfKxiN4l0ydR6Bl
4dUFUp/PkvvmTZiIA3tFzu0fHOgjUCvICyEAewLX7Bh9fXAVYMRc1a2l4LrCWD8T
hQqQFskxa/viG7Uq2N58Rts9GtF5nG5d6+4l5Ndxus2+tUledjiuzOv3nsSq64TF
y+Gws66a+A8a7DjcHo281BWzWn3XCIuug4A0doyvJFNdiy/+zj3ojfUhppZam8ca
DYo7ot71l0XGvFM0Yhr/NQ9MEdpgT19VC0xU2pafnkp6yGUqjH1b9xelvg+uMZEc
vXb7h+XV0eY4Rdr+3wTnXIwIWnEslJ23YQxFvhZeWeF321wsR5kbBGvNf/aY6xQB
rLkNX63RdpyYjHGcn3qwcYTkk8wAjCR719tFal8hRMLw+CCjzqcinb7P5FLi4hQv
cUV/vDffiwyReg==
=572c
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a bug exposed by a previous fix, where running guests with
certain SMT topologies could crash the host on Power8.
- Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is
enabled.
Thanks to Nathan Lynch, Srikar Dronamraju, and Valentin Schneider.
* tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/smp: do not decrement idle task preempt count in CPU offline
powerpc/idle: Don't corrupt back chain when going idle
Replace open coded parsing of CPU nodes' 'reg' property with
of_get_cpu_hwid().
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211006164332.1981454-8-robh@kernel.org
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
get:
BUG: scheduling while atomic: swapper/1/0/0x00000000
no locks held by swapper/1/0.
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
Call Trace:
dump_stack_lvl+0xac/0x108
__schedule_bug+0xac/0xe0
__schedule+0xcf8/0x10d0
schedule_idle+0x3c/0x70
do_idle+0x2d8/0x4a0
cpu_startup_entry+0x38/0x40
start_secondary+0x2ec/0x3a0
start_secondary_prolog+0x10/0x14
This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
preempt count, for reasons explained in commit a7c2bb8279 ("powerpc:
Re-enable preemption before cpu_die()"), specifically "start_secondary()
expects a preempt_count() of 0."
However, since commit 2c669ef697 ("powerpc/preempt: Don't touch the idle
task's preempt_count during hotplug") and commit f1a0a376ca ("sched/core:
Initialize the idle task with preemption disabled"), that justification no
longer holds.
The idle task isn't supposed to re-enable preemption, so remove the
vestigial preempt_enable() from the CPU offline path.
Tested with pseries and powernv in qemu, and pseries on PowerVM.
Fixes: 2c669ef697 ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
In isa206_idle_insn_mayloss() we store various registers into the stack
red zone, which is allowed.
However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
to 0(r1), which corrupts the stack back chain.
We used to do the same in isa206_idle_insn_mayloss() itself, but we
fixed that in 73287caa92 ("powerpc64/idle: Fix SP offsets when saving
GPRs"), however we missed that the macro also corrupts the back chain.
Corrupting the back chain is bad for debuggability but doesn't
necessarily cause a bug.
However we recently changed the stack handling in some KVM code, and it
now relies on the stack back chain being valid when it returns. The
corruption causes that code to return with r1 pointing somewhere in
kernel data, at some point LR is restored from the stack and we branch
to NULL or somewhere else invalid.
Only affects Power8 hosts running KVM guests, with dynamic_mt_modes
enabled (which it is by default).
The fixes tag below points to the commit that changed the KVM stack
handling, exposing this bug. The actual corruption of the back chain has
always existed since 948cf67c47 ("powerpc: Add NAP mode support on
Power7 in HV mode").
Fixes: 9b4416c509 ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.
Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.
Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.
Replace pdev->driver with to_pci_driver(). This is a step toward removing
pci_dev->driver.
[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
.opd section contains function descriptors used to locate
functions in the kernel. If someone is able to modify a
function descriptor he will be able to run arbitrary
kernel function instead of another.
To avoid that, move .opd section inside read-only memory.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/3cd40b682fb6f75bb40947b55ca0bac20cb3f995.1634136222.git.christophe.leroy@csgroup.eu
We fix the following warnings when building kernel with W=1:
arch/powerpc/kernel/eeh.c:598: warning: Function parameter or member 'function' not described in 'eeh_pci_enable'
arch/powerpc/kernel/eeh.c:774: warning: Function parameter or member 'edev' not described in 'eeh_set_dev_freset'
arch/powerpc/kernel/eeh.c:774: warning: expecting prototype for eeh_set_pe_freset(). Prototype was for eeh_set_dev_freset() instead
arch/powerpc/kernel/eeh.c:814: warning: Function parameter or member 'include_passed' not described in 'eeh_pe_reset_full'
arch/powerpc/kernel/eeh.c:944: warning: Function parameter or member 'ops' not described in 'eeh_init'
arch/powerpc/kernel/eeh.c:1451: warning: Function parameter or member 'include_passed' not described in 'eeh_pe_reset'
arch/powerpc/kernel/eeh.c:1526: warning: Function parameter or member 'func' not described in 'eeh_pe_inject_err'
arch/powerpc/kernel/eeh.c:1526: warning: Excess function parameter 'function' described in 'eeh_pe_inject_err'
Signed-off-by: Kai Song <songkai01@inspur.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211009041630.4135-1-songkai01@inspur.com
Replace pdev->driver->name by dev_driver_string() for the corresponding
struct device. This is a step toward removing pci_dev->driver.
Move the function nearer its only user and instead of the ?: operator use a
normal "if" which is more readable.
Link: https://lore.kernel.org/r/20211004125935.2300113-6-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Fix a regression hit by the IPR SCSI driver, introduced by the recent addition of MSI
domains on pseries.
A big series including 8 BPF fixes, some with potential security impact and the rest
various code generation issues.
Fix our program check assembler entry path, which was accidentally jumping into a gas
macro and generating strange stack frames, which could confuse find_bug().
A couple of fixes, and related changes, to fix corner cases in our machine check handling.
Fix our DMA IOMMU ops, which were not always returning the optimal DMA mask, leading to
at least one device falling back to 32-bit DMA when it shouldn't.
A fix for KUAP handling on 32-bit Book3S.
Fix crashes seen when kdumping on some pseries systems.
Thanks to: Naveen N. Rao, Nicholas Piggin, Alexey Kardashevskiy, Cédric Le Goater,
Christophe Leroy, Mahesh Salgaonkar, Abdul Haleem, Christoph Hellwig, Johan Almbladh, Stan
Johnson.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmFi044THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgGtXD/0a8dy70W8lXq3k1Zz33vzp/NSnq5Cj
d1lTI6hgbow5KbMgoKi3f2RC1S1qd0Kenz4zDD+rljRg8xrUCswUB+al2mrkeaBm
9s82lWwemmdB8LUbWAV/SscaJ6Pc5ODmjwGpsB36gdBgaYnmedQky4knh1m0NNCS
nyY6ugQIP6GnAZ2mojKnWx84Gj8hIYs5rEqAeDhHGNvLKSt2yIQv5LzqGM423j2O
44nbMmEt8Jq6oJCSeMRf3j5rgc6RTkwnL/543faRe5BBNbuMU01oIhDrGALXY8Ip
gOjyRsQYy1s+dU1rypaP/rlE3hRAUGmShCYQXA6kcESgwZc+BMSSShM/FhvVx2C7
UnFiFHyWnrP3PrhDGW+WGhrUN2kyoZZ7V6x5bggW93pyVCmA9RW62J8HU8DwzucU
IBO7TBNoHsscKEw8dSwG6jBOfx30bEwD2tl2PjNaDFIk7WndMOT4MbjPTDlXj38u
PbLPlY1xEghUafAdLwWL46wQMc0pLHyAaiM/Wl8nUaES73tdtMTLtg7QH/+uBc+a
iS1ji1U9VOpRjtBOjQHmPtRiit9aL1UJ0PDMrJWQpKxALv139jdlhjdzexnPgNAe
8l/9bvTsCTn5j5C3m1XaUSSGM2AyawMEC0+9bJWMKuD0F8JmZfUYMSjSk0CWJFRG
e+xG8TGXClfTxg==
=q+1t
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A bit of a big batch, partly because I didn't send any last week, and
also just because the BPF fixes happened to land this week.
Summary:
- Fix a regression hit by the IPR SCSI driver, introduced by the
recent addition of MSI domains on pseries.
- A big series including 8 BPF fixes, some with potential security
impact and the rest various code generation issues.
- Fix our program check assembler entry path, which was accidentally
jumping into a gas macro and generating strange stack frames, which
could confuse find_bug().
- A couple of fixes, and related changes, to fix corner cases in our
machine check handling.
- Fix our DMA IOMMU ops, which were not always returning the optimal
DMA mask, leading to at least one device falling back to 32-bit DMA
when it shouldn't.
- A fix for KUAP handling on 32-bit Book3S.
- Fix crashes seen when kdumping on some pseries systems.
Thanks to Naveen N. Rao, Nicholas Piggin, Alexey Kardashevskiy, Cédric
Le Goater, Christophe Leroy, Mahesh Salgaonkar, Abdul Haleem,
Christoph Hellwig, Johan Almbladh, Stan Johnson"
* tag 'powerpc-5.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
pseries/eeh: Fix the kdump kernel crash during eeh_pseries_init
powerpc/32s: Fix kuap_kernel_restore()
powerpc/pseries/msi: Add an empty irq_write_msi_msg() handler
powerpc/64s: Fix unrecoverable MCE calling async handler from NMI
powerpc/64/interrupt: Reconcile soft-mask state in NMI and fix false BUG
powerpc/64: warn if local irqs are enabled in NMI or hardirq context
powerpc/traps: do not enable irqs in _exception
powerpc/64s: fix program check interrupt emergency stack path
powerpc/bpf ppc32: Fix BPF_SUB when imm == 0x80000000
powerpc/bpf ppc32: Do not emit zero extend instruction for 64-bit BPF_END
powerpc/bpf ppc32: Fix JMP32_JSET_K
powerpc/bpf ppc32: Fix ALU32 BPF_ARSH operation
powerpc/bpf: Emit stf barrier instruction sequences for BPF_NOSPEC
powerpc/security: Add a helper to query stf_barrier type
powerpc/bpf: Fix BPF_SUB when imm == 0x80000000
powerpc/bpf: Fix BPF_MOD when imm == 1
powerpc/bpf: Validate branch ranges
powerpc/lib: Add helper to check if offset is within conditional branch range
powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
If, due to bugs elsewhere, we get into unregister_cpu_online() with a CPU
that isn't marked hotpluggable, we can emit a warning and return an
appropriate error instead of crashing.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210927201933.76786-3-nathanl@linux.ibm.com
When check_kvm_guest() succeeds in looking up a /hypervisor OF node, it
returns without performing a matching put for the lookup, leaving the
node's reference count elevated.
Add the necessary call to of_node_put(), rearranging the code slightly to
avoid repetition or goto.
Fixes: 107c55005f ("powerpc/pseries: Add KVM guest doorbell restrictions")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210928124550.132020-1-nathanl@linux.ibm.com
The machine check handler is not considered NMI on 64s. The early
handler is the true NMI handler, and then it schedules the
machine_check_exception handler to run when interrupts are enabled.
This works fine except the case of an unrecoverable MCE, where the true
NMI is taken when MSR[RI] is clear, it can not recover, so it calls
machine_check_exception directly so something might be done about it.
Calling an async handler from NMI context can result in irq state and
other things getting corrupted. This can also trigger the BUG at
arch/powerpc/include/asm/interrupt.h:168
BUG_ON(!arch_irq_disabled_regs(regs) && !(regs->msr & MSR_EE));
Fix this by making an _async version of the handler which is called
in the normal case, and a NMI version that is called for unrecoverable
interrupts.
Fixes: 2b43dd7653 ("powerpc/64: enable MSR[EE] in irq replay pt_regs")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-6-npiggin@gmail.com
This can help catch bugs such as the one fixed by the previous change
to prevent _exception() from enabling irqs.
ppc32 could have a similar warning but it has no good config option to
debug this stuff (the test may be overkill to add for production
kernels).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-4-npiggin@gmail.com
_exception can be called by machine check handlers when the MCE hits
user code (e.g., pseries and powernv). This will enable local irqs
because, which is a dicey thing to do in NMI or hard irq context.
This seemed to worked out okay because a userspace MCE can basically be
treated like a synchronous interrupt (after async / imprecise MCEs are
filtered out). Since NMI and hard irq handlers have started growing
nmi_enter / irq_enter, and more irq state sanity checks, this has
started to cause problems (or at least trigger warnings).
The Fixes tag to the commit which introduced this rather than try to
work out exactly which commit was the first that could possibly cause a
problem because that may be difficult to prove.
Fixes: 9f2f79e3a3 ("powerpc: Disable interrupts in 64-bit kernel FP and vector faults")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20211004145642.1331214-3-npiggin@gmail.com
Replace audit syscall class magic numbers with macros.
This required putting the macros into new header file
include/linux/audit_arch.h since the syscall macros were
included for both 64 bit and 32 bit in any compat code, causing
redefinition warnings.
Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: renamed header to audit_arch.h after consulting with Richard]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Since now there is kretprobe_trampoline_addr() for referring the
address of kretprobe trampoline code, we don't need to access
kretprobe_trampoline directly.
Make it harder to refer by renaming it to __kretprobe_trampoline().
Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:
dereference_kernel_function_descriptor(kretprobe_trampoline)
In fact we had bugs in arch code where it wasn't replicated correctly.
So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.
Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
~15 years ago kprobes grew the 'arch_deref_entry_point()' __weak function:
3d7e33825d: ("jprobes: make jprobes a little safer for users")
But this is just open-coded dereference_symbol_descriptor() in essence, and
its obscure nature was causing bugs.
Just use the real thing and remove arch_deref_entry_point().
Link: https://lkml.kernel.org/r/163163043630.489837.7924988885652708696.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since get_optimized_kprobe() is only used inside kprobes,
it doesn't need to use 'unsigned long' type for 'addr' parameter.
Make it use 'kprobe_opcode_t *' for the 'addr' parameter and
subsequent call of arch_within_optimized_kprobe() also should use
'kprobe_opcode_t *'.
Note that MAX_OPTIMIZED_LENGTH and RELATIVEJUMP_SIZE are defined
by byte-size, but the size of 'kprobe_opcode_t' depends on the
architecture. Therefore, we must be careful when calculating
addresses using those macros.
Link: https://lkml.kernel.org/r/163163040680.489837.12133032364499833736.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of relying on awful hacks to obtain the offset of the cpu field
in struct task_struct, move it back into struct thread_info, which does
not create the same level of circular dependency hell when trying to
include the header file that defines it.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this
causes some issues on architectures that define raw_smp_processor_id()
in terms of this field, due to the fact that #include'ing linux/sched.h
to get at struct task_struct is problematic in terms of circular
dependencies.
Given that thread_info and task_struct are the same data structure
anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having
access to the type definition of struct thread_info is sufficient to
reference the CPU number of the current task.
Note that this requires THREAD_INFO_IN_TASK's definition of the
task_thread_info() helper to be updated, as task_cpu() takes a
pointer-to-const, whereas task_thread_info() (which is used to generate
lvalues as well), needs a non-const pointer. So make it a macro instead.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
According to dma-api.rst, the dma_get_required_mask() helper should return
"the mask that the platform requires to operate efficiently". Which in
the case of PPC64 means the bypass mask and not a mask from an IOMMU table
which is shorter and slower to use due to map/unmap operations (especially
expensive on "pseries").
However the existing implementation ignores the possibility of bypassing
and returns the IOMMU table mask on the pseries platform which makes some
drivers (mpt3sas is one example) choose 32bit DMA even though bypass is
supported. The powernv platform sort of handles it by having a bigger
default window with a mask >=40 but it only works as drivers choose
63/64bit if the required mask is >32 which is rather pointless.
This reintroduces the bypass capability check to let drivers make
a better choice of the DMA mask.
Fixes: f1565c24b5 ("powerpc: use the generic dma_ops_bypass mode")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210930034454.95794-1-aik@ozlabs.ru
Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
that the two function are always called back-to-back by architectures
that have rseq. The rseq helper is stubbed out for architectures that
don't support rseq, i.e. this is a nop across the board.
Note, tracehook_notify_resume() is horribly named and arguably does not
belong in tracehook.h as literally every line of code in it has nothing
to do with tracing. But, that's been true since commit a42c6ded82
("move key_repace_session_keyring() into tracehook_notify_resume()")
first usurped tracehook_notify_resume() back in 2012. Punt cleaning that
mess up to future patches.
No functional change intended.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210901203030.1292304-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The general convention for pcibios_* hooks is that they're named after the
corresponding pci_* function they provide a hook for. The exception is
pcibios_add_device() which provides a hook for pci_device_add().
Rename pcibios_add_device() to pcibios_device_add() so it matches
pci_device_add().
Also, remove the export of the microblaze version. The only caller must be
compiled as a built-in so there's no reason for the export.
Link: https://lore.kernel.org/r/20210913152709.48013-1-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Niklas Schnelle <schnelle@linux.ibm.com> # s390
We queue an irq work for deferred processing of mce event in realmode
mce handler, where translation is disabled. Queuing of the work may
result in accessing memory outside RMO region, such access needs the
translation to be enabled for an LPAR running with hash mmu else the
kernel crashes.
After enabling translation in mce_handle_error() we used to leave it
enabled to avoid crashing here, but now with the commit
74c3354bc1 ("powerpc/pseries/mce: restore msr before returning from
handler") we are restoring the MSR to disable translation.
Hence to fix this enable the translation before queuing the work.
Without this change following trace is seen on injecting SLB multihit in
an LPAR running with hash mmu.
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
CPU: 5 PID: 1883 Comm: insmod Tainted: G OE 5.14.0-mce+ #137
NIP: c000000000735d60 LR: c000000000318640 CTR: 0000000000000000
REGS: c00000001ebff9a0 TRAP: 0300 Tainted: G OE (5.14.0-mce+)
MSR: 8000000000001003 <SF,ME,RI,LE> CR: 28008228 XER: 00000001
CFAR: c00000000031863c DAR: c00000027fa8fe08 DSISR: 40000000 IRQMASK: 0
...
NIP llist_add_batch+0x0/0x40
LR __irq_work_queue_local+0x70/0xc0
Call Trace:
0xc00000001ebffc0c (unreliable)
irq_work_queue+0x40/0x70
machine_check_queue_event+0xbc/0xd0
machine_check_early_common+0x16c/0x1f4
Fixes: 74c3354bc1 ("powerpc/pseries/mce: restore msr before returning from handler")
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
[mpe: Fix comment formatting, trim oops in change log for readability]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210909064330.312432-1-ganeshgr@linux.ibm.com
The rfscv instruction does not work correctly with the fake-suspend mode
in POWER9, which can end up with the hypervisor restoring an incorrect
checkpoint.
Work around this by setting the _TIF_RESTOREALL flag if a system call
returns to a transaction active state, causing rfid to be used instead
of rfscv to return, which will do the right thing. The contents of the
registers are irrelevant because they will be overwritten in this case
anyway.
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Reported-by: Eirik Fuller <efuller@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210908101718.118522-1-npiggin@gmail.com
If a system call is made with a transaction active, the kernel
immediately aborts it and returns. scv system calls disable irqs even
earlier in their interrupt handler, and tabort_syscall does not fix this
up.
This can result in irq soft-mask state being messed up on the next
kernel entry, and crashing at BUG_ON(arch_irq_disabled_regs(regs)) in
the kernel exit handlers, or possibly worse.
This can't easily be fixed in asm because at this point an async irq may
have hit, which is soft-masked and marked pending. The pending interrupt
has to be replayed before returning to userspace. The fix is to move the
tabort_syscall code to C in the main syscall handler, and just skip the
system call but otherwise return as usual, which will take care of the
pending irqs. This also does a bunch of other things including possible
signal delivery to the process, but the doomed transaction should still
be aborted when it is eventually returned to.
The sc system call path is changed to use the new C function as well to
reduce code and path differences. This slows down how quickly system
calls are aborted when called while a transaction is active, which could
potentially impact TM performance. But making any system call is already
bad for performance, and TM is on the way out, so go with simpler over
faster.
Fixes: 7fa95f9ada ("powerpc/64s: system call support for scv/rfscv instructions")
Reported-by: Eirik Fuller <efuller@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Use #ifdef rather than IS_ENABLED() to fix build error on 32-bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210903125707.1601269-1-npiggin@gmail.com
These are all handled correctly when calling the native system call entry
point, so remove the special cases.
Link: https://lkml.kernel.org/r/20210727144859.4150043-6-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
There are some empty trap_init() definitions in different ARCHs, Introduce
a new weak trap_init() function to clean them up.
Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm32]
Acked-by: Vineet Gupta [arc]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <palmerdabbelt@google.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmExXHoVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGAZwP/iHdEZzuQ4cz2uXUaV0fevj9jjPU
zJ8wrrNabAiT6f5x861DsARQSR4OSt3zN0tyBNgZwUdotbe7ED5GegrgIUBMWlML
QskhTEIZj7TexAX/20vx671gtzI3JzFg4c9BuriXCFRBvychSevdJPr65gMDOesL
vOJnXe+SGXG2+fPWi/PxrcOItNRcveqo2GiWHT3g0Cv/DJUulu81gEkz3hrufnMR
cjMeSkV0nJJcvI755OQBOUnEuigW64k4m2WxHPG24tU8cQOCqV6lqwOfNQBAn4+F
OoaCMyPQT9gvGYwGExQMCXGg0wbUt1qnxzOVoA2qFCwbo+MFhqjBvPXab6VJm7CE
mY3RrTtvxSqBdHI6EGcYeLjhycK9b+LLoJ1qc3S9FK8It6NoFFp4XV0R6ItPBls7
mWi9VSpyI6k0AwLq+bGXEHvaX/bnnf/vfqn8H+w6mRZdXjFV8EB2DiOSRX/OqjVG
RnvTtXzWWThLyXvWR3Jox4+7X6728oL7akLemoeZI6oTbJDm7dQgwpz5HbSyHXLh
d+gUF3Y/6lqxT5N9GSVDxpD1bEMh2I7nGQ4M7WGbGas/3yUemF8wbBqGQo4a+YeD
d9vGAUxDp2PQTtL2sjFo5Gd4PZEM9g7vwWzRvHe0o5NxKEXcBg25b8cD1hxrN9Y4
Y1AAnc0kLO+My3PC
=lw3M
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...
- Convert pseries & powernv to use MSI IRQ domains.
- Rework the pseries CPU numbering so that CPUs that are removed, and later re-added, are
given a CPU number on the same node as previously, when possible.
- Add support for a new more flexible device-tree format for specifying NUMA distances.
- Convert powerpc to GENERIC_PTDUMP.
- Retire sbc8548 and sbc8641d board support.
- Various other small features and fixes.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard, Cédric Le Goater,
Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas, Fangrui Song, Finn Thain, Gautham R.
Shenoy, Hari Bathini, Joel Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo
Bras, Lukas Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan Chancellor,
Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R. Sampat, Randy Dunlap, Sebastian
Andrzej Siewior, Srikar Dronamraju, Wan Jiabing, Xiongwei Song, Zheng Yongjun.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmEyHTYTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDo3D/9aXMVP2wsEMNB0XhTiJ1UUdi311Uq9
PvkAaGZH14ZqZLVigeiD3gt6YzTH0cEuGj6qgwsJrPDjF8FESnMbBsprMLr5/qE1
itWRGMAMCFaeTcB9ogYVJkzwg6RN2ZgIqoq4NVswNSXoAQGWb+1bvXq3RnXXNuGR
TQmLL02poNC6nX0YbRaQoT1Xx4nfUTiKHhU+Aok9uOCMJIyYZVATR6Qafb7/j7tO
UvjwOHztbu84lcJOGmSnw4LcmwNORLuP9IwR0r+O1M3ijEZqDo9TPkvtSz8HZwjU
mxdJwhrUmN0euMcghuiFxW+1XG2eM49ugsdJugiezG2RaIijbIp0nAIvdeaKAgT1
OSSwvWCQ0fkTPyLXE+O6tVqMhlUMdqQlRcyNwmN9svIip9VnwGNq3vA4ePlJm6Fi
i0i/tLqVNlJwFokZ7blW5g8SRgGRuFfXd5XUYLFvy5Teez+/7b1mW95gPQZSJ8kV
Tbx2e0nHAPX4hCAxJ1AB3/zTlnjY+4+WJ9bD5XdgXkeVE8PPh1BEkulhMi1R1OMj
57D1W6OgsBu/Pze78wjAvwO8+NAb1T/2mv2Bd/LY6Q+7hNDqOOhuajyBTxbH41FG
sqx5bKjKOwgTybfV9A0Eo0e4FQBX07yXltBFHaPlyA4sOsIhM59+PxNrEwN1eZrQ
LVVsdBXg8pHxrw==
=EbN0
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Convert pseries & powernv to use MSI IRQ domains.
- Rework the pseries CPU numbering so that CPUs that are removed, and
later re-added, are given a CPU number on the same node as
previously, when possible.
- Add support for a new more flexible device-tree format for specifying
NUMA distances.
- Convert powerpc to GENERIC_PTDUMP.
- Retire sbc8548 and sbc8641d board support.
- Various other small features and fixes.
Thanks to Alexey Kardashevskiy, Aneesh Kumar K.V, Anton Blanchard,
Cédric Le Goater, Christophe Leroy, Emmanuel Gil Peyrot, Fabiano Rosas,
Fangrui Song, Finn Thain, Gautham R. Shenoy, Hari Bathini, Joel
Stanley, Jordan Niethe, Kajol Jain, Laurent Dufour, Leonardo Bras, Lukas
Bulwahn, Marc Zyngier, Masahiro Yamada, Michal Suchanek, Nathan
Chancellor, Nicholas Piggin, Parth Shah, Paul Gortmaker, Pratik R.
Sampat, Randy Dunlap, Sebastian Andrzej Siewior, Srikar Dronamraju, Wan
Jiabing, Xiongwei Song, and Zheng Yongjun.
* tag 'powerpc-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (154 commits)
powerpc/bug: Cast to unsigned long before passing to inline asm
powerpc/ptdump: Fix generic ptdump for 64-bit
KVM: PPC: Fix clearing never mapped TCEs in realmode
powerpc/pseries/iommu: Rename "direct window" to "dma window"
powerpc/pseries/iommu: Make use of DDW for indirect mapping
powerpc/pseries/iommu: Find existing DDW with given property name
powerpc/pseries/iommu: Update remove_dma_window() to accept property name
powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
powerpc/pseries/iommu: Allow DDW windows starting at 0x00
powerpc/pseries/iommu: Add ddw_list_new_entry() helper
powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
powerpc/kernel/iommu: Add new iommu_table_in_use() helper
powerpc/pseries/iommu: Replace hard-coded page shift
powerpc/numa: Update cpu_cpu_map on CPU online/offline
powerpc/numa: Print debug statements only when required
powerpc/numa: convert printk to pr_xxx
powerpc/numa: Drop dbg in favour of pr_debug
powerpc/smp: Enable CACHE domain for shared processor
powerpc/smp: Update cpu_core_map on all PowerPc systems
...
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge our fixes branch into next.
That lets us resolve a conflict in arch/powerpc/sysdev/xive/common.c.
Between cbc06f051c ("powerpc/xive: Do not skip CPU-less nodes when
creating the IPIs"), which moved request_irq() out of xive_init_ipis(),
and 17df41fec5 ("powerpc: use IRQF_NO_DEBUG for IPIs") which added
IRQF_NO_DEBUG to that request_irq() call, which has now moved.
- fix debugfs initialization order (Anthony Iliopoulos)
- use memory_intersects() directly (Kefeng Wang)
- allow to return specific errors from ->map_sg
(Logan Gunthorpe, Martin Oliveira)
- turn the dma_map_sg return value into an unsigned int (me)
- provide a common global coherent pool іmplementation (me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmEvY+8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPaehAAsgnBzzzoLHO83pgs0KL92c+0DiMNHYmaMCJOvZXk
x2Irv+O74WikRJc4S7uQ26p2spjmUxjmiOjld+8+NN0liD4QO9BQ/SZpIp8emuKS
/yPG6Xh86xSl/OrPL1y7kGeHkRi5sm3mRhcTdILFQFPLcSReupe++GRfnvrpbOPk
tj3pBGXluD6iJH12BBt00ushUVzZ0F2xaF6xUDAs94RSZ3tlqsfx6c928Y1KxSZh
f89q/KuaokyogFG7Ujj/nYgIUETaIs2W6UmxBfRzdEMJFSffwomUMbw+M+qGJ7/d
2UjamFYRX16FReE8WNsndbX1E6k5JBW12E1qwV3dUwatlNLWEaRq3PNiWkF7zcFH
LDkpDYN6s5bIDPTfDp21XfPygoH8KQhnD9lVf0aB7n04uu8VJrGB9+10PpkCJVXD
0b2dcuSwCO7hAfTfNGVV8f3EI/1XPflr1hJvMgcVtY53CR96ldp+4QaElzWLXumN
MyptirmrVITNVyVwGzhGAblXBLWdarXD0EXudyiaF4Xbrj3AkIOSUCghEwKLpjQf
UwMFFwSE8yGxKTRK4HfU5gMzy6G751fU7TUe5lmxZLovDflQoSXMWgHE8e7r0Qel
o5v6lmUzoWz2fAISf3xjauo2ncgmfWMwYM6C7OJy5nG73QXLQId9J+ReXbmrgrrN
DgI=
=spje
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- fix debugfs initialization order (Anthony Iliopoulos)
- use memory_intersects() directly (Kefeng Wang)
- allow to return specific errors from ->map_sg (Logan Gunthorpe,
Martin Oliveira)
- turn the dma_map_sg return value into an unsigned int (me)
- provide a common global coherent pool іmplementation (me)
* tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping: (31 commits)
hexagon: use the generic global coherent pool
dma-mapping: make the global coherent pool conditional
dma-mapping: add a dma_init_global_coherent helper
dma-mapping: simplify dma_init_coherent_memory
dma-mapping: allow using the global coherent pool for !ARM
ARM/nommu: use the generic dma-direct code for non-coherent devices
dma-direct: add support for dma_coherent_default_memory
dma-mapping: return an unsigned int from dma_map_sg{,_attrs}
dma-mapping: disallow .map_sg operations from returning zero on error
dma-mapping: return error code from dma_dummy_map_sg()
x86/amd_gart: don't set failed sg dma_address to DMA_MAPPING_ERROR
x86/amd_gart: return error code from gart_map_sg()
xen: swiotlb: return error code from xen_swiotlb_map_sg()
parisc: return error code from .map_sg() ops
sparc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
sparc/iommu: return error codes from .map_sg() ops
s390/pci: don't set failed sg dma_address to DMA_MAPPING_ERROR
s390/pci: return error code from s390_dma_map_sg()
powerpc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
powerpc/iommu: return error code from .map_sg() ops
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmEt+hwACgkQUqAMR0iA
lPLppBAAiyrUNVmqqtdww+IJajEs1uD/4FqPsysHRwroHBFymJeQG1XCwUpDZ7jj
6gXT0chxyjQE18gT/W9nf+PSmA9XvIVA1WSR+WCECTNW3YoZXqtgwiHfgnitXYku
HlmoZLthYeuoXWw2wn+hVLfTRh6VcPHYEaC21jXrs6B1pOXHbvjJ5eTLHlX9oCfL
UKSK+jFTHAJcn/GskRzviBe0Hpe8fqnkRol2XX13ltxqtQ73MjaGNu7imEH6/Pa7
/MHXWtuWJtOvuYz17aztQP4Qwh1xy+kakMy3aHucdlxRBTP4PTzzTuQI3L/RYi6l
+ttD7OHdRwqFAauBLY3bq3uJjYb5v/64ofd8DNnT2CJvtznY8wrPbTdFoSdPcL2Q
69/opRWHcUwbU/Gt4WLtyQf3Mk0vepgMbbVg1B5SSy55atRZaXMrA2QJ/JeawZTB
KK6D/mE7ccze/YFzsySunCUVKCm0veoNxEAcakCCZKXSbsvd1MYcIRC0e+2cv6e5
2NEH7gL4dD+5tqu5nzvIuKDn3NrDQpbi28iUBoFbkxRgcVyvHJ9AGSa62wtb5h3D
OgkqQMdVKBbjYNeUodPlQPzmXZDasytavyd0/BC/KENOcBvU/8gW++2UZTfsh/1A
dLjgwFBdyJncQcCS9Abn20/EKntbIMEX8NLa97XWkA3fuzMKtak=
=yEVq
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Optionally, provide an index of possible printk messages via
<debugfs>/printk/index/. It can be used when monitoring important
kernel messages on a farm of various hosts. The monitor has to be
updated when some messages has changed or are not longer available by
a newly deployed kernel.
- Add printk.console_no_auto_verbose boot parameter. It allows to
generate crash dump even with slow consoles in a reasonable time
frame.
- Remove printk_safe buffers. The messages are always stored directly
to the main logbuffer, even in NMI or recursive context. Also it
allows to serialize syslog operations by a mutex instead of a spin
lock.
- Misc clean up and build fixes.
* tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/index: Fix -Wunused-function warning
lib/nmi_backtrace: Serialize even messages about idle CPUs
printk: Add printk.console_no_auto_verbose boot parameter
printk: Remove console_silent()
lib/test_scanf: Handle n_bits == 0 in random tests
printk: syslog: close window between wait and read
printk: convert @syslog_lock to mutex
printk: remove NMI tracking
printk: remove safe buffers
printk: track/limit recursion
lib/nmi_backtrace: explicitly serialize banner and regs
printk: Move the printk() kerneldoc comment to its new home
printk/index: Fix warning about missing prototypes
MIPS/asm/printk: Fix build failure caused by printk
printk: index: Add indexing support to dev_printk
printk: Userspace format indexing support
printk: Rework parse_prefix into printk_parse_prefix
printk: Straighten out log_flags into printk_info_flags
string_helpers: Escape double quotes in escape_special
printk/console: Check consistent sequence number when handling race in console_unlock()
Pull exit cleanups from Eric Biederman:
"In preparation of doing something about PTRACE_EVENT_EXIT I have
started cleaning up various pieces of code related to do_exit. Most of
that code I did not manage to get tested and reviewed before the merge
window opened but a handful of very useful cleanups are ready to be
merged.
The first change is simply the removal of the bdflush system call. The
code has now been disabled long enough that even the oldest userspace
working userspace setups anyone can find to test are fine with the
bdflush system call being removed.
Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of
calling do_exit directly is interesting only in that it is nearly the
most difficult of the incorrect uses of do_exit to remove.
The change to the seccomp code to simply send a signal instead of
calling do_coredump directly is a very nice little cleanup made
possible by realizing the existing signal sending helpers were missing
a little bit of functionality that is easy to provide"
* 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal/seccomp: Dump core when there is only one live thread
signal/seccomp: Refactor seccomp signal and coredump generation
signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die
exit/bdflush: Remove the deprecated bdflush system call
Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.
It should be enough to replace all instances of !bitmap_empty(tbl...).
iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.
Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210817063929.38701-3-leobras.c@gmail.com
Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.
In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.
Here are some stats from Power9 shared LPAR with the changes.
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 3
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
NUMA node0 CPU(s): 16-23
NUMA node1 CPU(s): 0-15,24-31
Physical sockets: 2
Physical chips: 1
Physical cores/chip: 10
Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA
After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA
$ awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 's/00000000,//g'
Before
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 00ff0000
domain1 ff00ffff
domain2 ffffffff
After
domain0 00000055
domain0 000000aa
domain0 00005500
domain0 0000aa00
domain0 00550000
domain0 00aa0000
domain0 55000000
domain0 aa000000
domain1 000000ff
domain1 0000ff00
domain1 00ff0000
domain1 ff000000
domain2 ff00ffff
domain2 ffffffff
domain3 ffffffff
(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe | tail -n 2
Before
153.798 +- 0.142 seconds time elapsed ( +- 0.09% )
After
111.545 +- 0.652 seconds time elapsed ( +- 0.58% )
which is an improvement of 27.47%
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-4-srikar@linux.vnet.ibm.com
lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.
While optimizing the powerpc bootup path, Commit 4ca234a9cb
("powerpc/smp: Stop updating cpu_core_mask"). it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().
However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7a
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.
However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().
There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 3 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.
This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.
Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
upon and we need a clean way to handle PowerPc KVM guest which is
not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
could return a nid for the said CPU. However this is not needed at this
time.
Option C is the preferred approach for now.
While this is somewhat a revert of Commit 4ca234a9cb ("powerpc/smp:
Stop updating cpu_core_mask").
1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.
This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.
With the fix, lscpu o/p would be
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Thread(s) per core: 8
Core(s) per socket: 6
Socket(s): 2 <--------------
NUMA node(s): 2
Model: 2.2 (pvr 004e 0202)
Model name: POWER9 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-79
NUMA node1 CPU(s): 80-159
Fixes: 4ca234a9cb ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-3-srikar@linux.vnet.ibm.com
Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2
1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c000000008a67bd0]
pc: c00000000002557c: cpu_to_chip_id+0x3c/0x100
lr: c000000000058380: start_secondary+0x460/0xb00
sp: c000000008a67e70
msr: 8000000000001033
dar: 10
dsisr: 80000
current = 0xc00000000891bb00
paca = 0xc0000018ff981f80 irqmask: 0x03 irq_happened: 0x01
pid = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register ] c000000000058380 start_secondary+0x460/0xb00
[c000000008a67e70] c000000008a67eb0 (unreliable)
[c000000008a67eb0] c0000000000589d4 start_secondary+0xab4/0xb00
[c000000008a67f90] c00000000000c654 start_secondary_prolog+0x10/0x14
Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of threads_per_core.
Fixes: c1e53367da ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Debugged-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210826100401.412519-2-srikar@linux.vnet.ibm.com
In those hot functions that are called at every interrupt, any saved
cycle is worth it.
interrupt_exit_user_prepare() and interrupt_exit_kernel_prepare() are
called from three places:
- From entry_32.S
- From interrupt_64.S
- From interrupt_exit_user_restart() and interrupt_exit_kernel_restart()
In entry_32.S, there are inambiguously called based on MSR_PR:
interrupt_return:
lwz r4,_MSR(r1)
addi r3,r1,STACK_FRAME_OVERHEAD
andi. r0,r4,MSR_PR
beq .Lkernel_interrupt_return
bl interrupt_exit_user_prepare
...
.Lkernel_interrupt_return:
bl interrupt_exit_kernel_prepare
In interrupt_64.S, that's similar:
interrupt_return_\srr\():
ld r4,_MSR(r1)
andi. r0,r4,MSR_PR
beq interrupt_return_\srr\()_kernel
interrupt_return_\srr\()_user: /* make backtraces match the _kernel variant */
addi r3,r1,STACK_FRAME_OVERHEAD
bl interrupt_exit_user_prepare
...
interrupt_return_\srr\()_kernel:
addi r3,r1,STACK_FRAME_OVERHEAD
bl interrupt_exit_kernel_prepare
In interrupt_exit_user_restart() and interrupt_exit_kernel_restart(),
MSR_PR is verified respectively by BUG_ON(!user_mode(regs)) and
BUG_ON(user_mode(regs)) prior to calling interrupt_exit_user_prepare()
and interrupt_exit_kernel_prepare().
The verification in interrupt_exit_user_prepare() and
interrupt_exit_kernel_prepare() are therefore useless and can be removed.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/385ead49ccb66a259b25fee3eebf0bd4094068f3.1629707037.git.christophe.leroy@csgroup.eu
Create an anonymous union for dar and dear regsiters, we can reference
dear to get the effective address when CONFIG_4xx=y or CONFIG_BOOKE=y.
Otherwise, reference dar. This makes code more clear.
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
[mpe: Reword commit title]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210807010239.416055-4-sxwjean@me.com
Create an anonymous union for dsisr and esr regsiters, we can reference
esr to get the exception detail when CONFIG_4xx=y or CONFIG_BOOKE=y.
Otherwise, reference dsisr. This makes code more clear.
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
[mpe: Reword commit title]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210807010239.416055-2-sxwjean@me.com
Copied from commit 89bbe4c798 ("powerpc/64: indirect function call
use bctrl rather than blrl in ret_from_kernel_thread")
blrl is not recommended to use as an indirect function call, as it may
corrupt the link stack predictor.
This is not a performance critical path but this should be fixed for
consistency.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/91b1d242525307ceceec7ef6e832bfbacdd4501b.1629436472.git.christophe.leroy@csgroup.eu
This fixes a compile error with W=1.
arch/powerpc/kernel/prom.c: In function ‘early_reserve_mem’:
arch/powerpc/kernel/prom.c:625:10: error: variable ‘reserve_map’ set but not used [-Werror=unused-but-set-variable]
__be64 *reserve_map;
^~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210823090039.166120-2-clg@kaod.org
The implict soft-mask table addresses get relocated if they use a
relative symbol like a label. This is right for code that runs relocated
but not for unrelocated. The scv interrupt vectors run unrelocated, so
absolute addresses are required for their soft-mask table entry.
This fixes crashing with relocated kernels, usually an asynchronous
interrupt hitting in the scv handler, then hitting the trap that checks
whether r1 is in userspace.
Fixes: 325678fd05 ("powerpc/64s: add a table of implicit soft-masked addresses")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210820103431.1701240-1-npiggin@gmail.com
This patch prevents the following sparse warning.
arch/powerpc/kernel/tau_6xx.c:199:1: sparse: sparse: symbol 'tau_work'
was not declared. Should it be static?
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/44ab381741916a51e783c4a50d0b186abdd8f280.1629334014.git.fthain@linux-m68k.org
Ship minimal stdarg.h (1 type, 4 macros) as <linux/stdarg.h>.
stdarg.h is the only userspace header commonly used in the kernel.
GPL 2 version of <stdarg.h> can be extracted from
http://archive.debian.org/debian/pool/main/g/gcc-4.2/gcc-4.2_4.2.4.orig.tar.gz
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Delete/fixup few includes in anticipation of global -isystem compile
option removal.
Note: crypto/aegis128-neon-inner.c keeps <stddef.h> due to redefinition
of uintptr_t error (one definition comes from <stddef.h>, another from
<linux/types.h>).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
interrupt.c: asm/interrupt.h has been included at line 12, so remove the
duplicate one at line 10.
time.c: linux/sched/clock.h has been included at line 33,so remove the
duplicate one at line 56 and move sched/cputime.h under sched including
segament.
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210323062916.295346-1-wanjiabing@vivo.com
Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.
For that add an entry to the exception table so that
program_check_exception() knowns where to resume execution
after a WARNING.
Here are two exemples. The first one is done on PPC32 (which
benefits from the previous patch), the second is on PPC64.
unsigned long test(struct pt_regs *regs)
{
int ret;
WARN_ON(regs->msr & MSR_PR);
return regs->gpr[3];
}
unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}
Before the patch:
000003a8 <test>:
3a8: 81 23 00 84 lwz r9,132(r3)
3ac: 71 29 40 00 andi. r9,r9,16384
3b0: 40 82 00 0c bne 3bc <test+0x14>
3b4: 80 63 00 0c lwz r3,12(r3)
3b8: 4e 80 00 20 blr
3bc: 0f e0 00 00 twui r0,0
3c0: 80 63 00 0c lwz r3,12(r3)
3c4: 4e 80 00 20 blr
0000000000000bf0 <.test9w>:
bf0: 7c 89 00 74 cntlzd r9,r4
bf4: 79 29 d1 82 rldicl r9,r9,58,6
bf8: 0b 09 00 00 tdnei r9,0
bfc: 2c 24 00 00 cmpdi r4,0
c00: 41 82 00 0c beq c0c <.test9w+0x1c>
c04: 7c 63 23 92 divdu r3,r3,r4
c08: 4e 80 00 20 blr
c0c: 38 60 00 00 li r3,0
c10: 4e 80 00 20 blr
After the patch:
000003a8 <test>:
3a8: 81 23 00 84 lwz r9,132(r3)
3ac: 71 29 40 00 andi. r9,r9,16384
3b0: 40 82 00 0c bne 3bc <test+0x14>
3b4: 80 63 00 0c lwz r3,12(r3)
3b8: 4e 80 00 20 blr
3bc: 0f e0 00 00 twui r0,0
0000000000000c50 <.test9w>:
c50: 7c 89 00 74 cntlzd r9,r4
c54: 79 29 d1 82 rldicl r9,r9,58,6
c58: 0b 09 00 00 tdnei r9,0
c5c: 7c 63 23 92 divdu r3,r3,r4
c60: 4e 80 00 20 blr
c70: 38 60 00 00 li r3,0
c74: 4e 80 00 20 blr
In the first exemple, we see GCC doesn't need to duplicate what
happens after the trap.
In the second exemple, we see that GCC doesn't need to emit a test
and a branch in the likely path in addition to the trap.
We've got some WARN_ON() in .softirqentry.text section so it needs
to be added in the OTHER_TEXT_SECTIONS in modpost.c
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/389962b1b702e3c78d169e59bcfac56282889173.1618331882.git.christophe.leroy@csgroup.eu
PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-6-aneesh.kumar@linux.ibm.com
Also make related code cleanup that will allow adding FORM2_AFFINITY in
later patches. No functional change in this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132223.225214-3-aneesh.kumar@linux.ibm.com
No functional change in this patch. arch_debugfs_dir is the generic kernel
name declared in linux/debugfs.h for arch-specific debugfs directory.
Architectures like x86/s390 already use the name. Rename powerpc
specific powerpc_debugfs_root to arch_debugfs_dir.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210812132831.233794-2-aneesh.kumar@linux.ibm.com
single_step_exception() is called by emulate_single_step() which
is called from (at least) alignment exception() handler and
program_check_exception() handler.
Redefine it as a regular __single_step_exception() which is called
by both single_step_exception() handler and emulate_single_step()
function.
Fixes: 3a96570ffc ("powerpc: convert interrupt handlers to use wrappers")
Cc: stable@vger.kernel.org # v5.12+
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/aed174f5cbc06f2cf95233c071d8aac948e46043.1628611921.git.christophe.leroy@csgroup.eu
Two IRQ domains are added on top of default machine IRQ domain.
First, the top level "pSeries-PCI-MSI" domain deals with the MSI
specificities. In this domain, the HW IRQ numbers are generated by the
PCI MSI layer, they compose a unique ID for an MSI source with the PCI
device identifier and the MSI vector number.
These numbers can be quite large on a pSeries machine running under
the IBM Hypervisor and /sys/kernel/irq/ and /proc/interrupts will
require small fixes to show them correctly.
Second domain is the in-the-middle "pSeries-MSI" domain which acts as
a proxy between the PCI MSI subsystem and the machine IRQ subsystem.
It usually allocate the MSI vector numbers but, on pSeries machines,
this is done by the RTAS FW and RTAS returns IRQ numbers in the IRQ
number space of the machine. This is why the in-the-middle "pSeries-MSI"
domain has the same HW IRQ numbers as its parent domain.
Only the XIVE (P9/P10) parent domain is supported for now. We still
need to add support for IRQ domain hierarchy under XICS.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210701132750.1475580-6-clg@kaod.org
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210803141621.780504-4-bigeasy@linutronix.de
Setting the ->dma_address to DMA_MAPPING_ERROR is not part of
the ->map_sg calling convention, so remove it.
Link: https://lore.kernel.org/linux-mips/20210716063241.GC13345@lst.de/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Geoff Levand <geoff@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The .map_sg() op now expects an error code instead of zero on failure.
Propagate the error up if vio_dma_iommu_map_sg() fails.
ppc_iommu_map_sg() may fail either because of iommu_range_alloc() or
because of tbl->it_ops->set(). The former only supports returning an
error with DMA_MAPPING_ERROR and an examination of the latter indicates
that it may return arch-specific errors (for example,
tce_buildmulti_pSeriesLP()). Hence, coalesce all of those errors into
-EIO, per the documentation on dma_map_sgtable().
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Geoff Levand <geoff@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>