The next patch would use kvm_emulate_instruction() with
EMULTYPE_SKIP in complete_userspace_io callback to fix a
problem in msr access emulation. However, EMULTYPE_SKIP
only updates RIP, more things like updating interruptibility
state and injecting single-step #DBs would be done in the
callback. Since the emulator also does those things after
x86_emulate_insn(), add a new emulation type to pair with
EMULTYPE_SKIP to do those things for completion of user exits
within the emulator.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This bit is very close to mean "role.quadrant is not in use", except that
it is false also when the MMU is mapping guest physical addresses
directly. In that case, role.quadrant is indeed not in use, but there
are no guest PTEs at all.
Changing the name and direction of the bit removes the special case,
since a guest with paging disabled, or not considering guest paging
structures as is the case for two-dimensional paging, does not have
to deal with 4-byte guest PTEs.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reduce an indirect function call (retpoline) and some intialization
code.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra
FNAME(gva_to_gpa_nested) is needed.
Add the parameter can simplify the code. And it makes it explicit that
the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2
gpa in translate_nested_gpa() via the new parameter.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The body of __kvm_mmu_free_some_pages() has been removed.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-13-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are
nops.
No functional change intended.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for implementing in-place hugepage promotion, various
functions will need to be called from zap_collapsible_spte_range, which
has the const qualifier on its memslot argument. Propagate the const
qualifier to the various functions which will be needed. This just serves
to simplify the following patch.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-11-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_slot_page_track_is_active only uses its vCPU argument to get a
pointer to the assoicated struct kvm, so just pass in the struct KVM to
remove the need for a vCPU pointer.
No functional change intended.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115234603.2908381-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted. Use KVM's
cached nr_memslot_pages to compute the default max number of MMU pages.
Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely
multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can
possibly overflow an unsigned long variable.
Write this "* 20 / 1000" operation as "/ 50" instead to avoid such
overflow.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[sean: use common KVM field and rework changelog accordingly]
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>
Fix the number of bits in the role, and simplify the explanation of
why several bits or combinations of bits are redundant.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the switching code into a function so that it can be re-used and
add a global TLB flush. This makes sure that usage of memory which is
not mapped in the trampoline page-table is reliably caught.
Also move the clearing of CR4.PCIDE before the CR3 switch because the
cr4_clear_bits() function will access data not mapped into the
trampoline page-table.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-4-joro@8bytes.org
The AP bringup code uses the trampoline_pgd page-table which
establishes global mappings in the user range of the address space.
Flush the global TLB entries after the indentity mappings are removed so
no stale entries remain in the TLB.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211202153226.22946-3-joro@8bytes.org
- Use the proper operand types in __{get,put}_user() to prevent
truncation in SEV-ES string io
- Make sure the kernel mappings are present in trampoline_pgd in order
to prevent any potential accesses to unmapped memory after switching to
it
- Fix a trivial list corruption in objtool's pv_ops validation
- Disable the clocksource watchdog for TSC on platforms which claim
that the TSC is constant, doesn't stop in sleep states, CPU has TSC
adjust and the number of sockets of the platform are max 2, to prevent
erroneous markings of the TSC as unstable.
- Make sure TSC adjust is always checked not only when going idle
- Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
the FPU code
- Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsnWcACgkQEsHwGGHe
VUoR+g/9FcOP0/XLH+LKHumYc9JHXsp5BvYGihyypMFgU0fXQBORtGqdls8jZtiJ
kEdbW6iL0MRlyN8aHJCqr7dJqs7KJlpWes6hky7BY+U+7uewtjL5y3eSyZnA34T3
M/Raecx27Hh0L0kHQlHXTUN73v1cgDvq3dCXWsP7Jqgjf5cEmCcV/tPEateqhq/f
8TkLVIm55rJlbJ0LBO/cT0V3Q8QH9JPKm7nviOZuKCh9gcttFEPaM9MkaJyKUhoy
O13jlenDoVkVWRXIQec1EZp2pTLxVAm/3Y0plge1yEVsejzh07gsQnMpoNeF+yFC
8mDgSv8ZAED/vbsnB+BcgoRVj6ajG0+ilpLzcfPwUquiqS9pZrBSTddlvYDPjRMC
MEXO548xiYgxmipu3r62H89nqmLEYQPk914rJu6bDnDeJ1gaabh8RXbNtQcRqqj3
RETgVOp78iWn+aT33RLLD1EyodZb2IkMy087a3+TZICIXG81aDj9VgHvrVRnWnfY
yKuldyrEKzi60yMQkV6h1oc8KSWQhspUSLtOVS9zrulCinYphFOfYFrzFmcKUWIq
GdVb9eaP2oNBGfPybXP+TBLGZ4Zv9iXZmaEUk7ZGCjgv3ZmGMWJ18Hs/ufs2cwWK
RNNUo3sz/y3OsreHowkWIk1eSxI16MabB7G/PDMnBSHlioVT390=
=d6nS
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix a couple of SWAPGS fencing issues in the x86 entry code
- Use the proper operand types in __{get,put}_user() to prevent
truncation in SEV-ES string io
- Make sure the kernel mappings are present in trampoline_pgd in order
to prevent any potential accesses to unmapped memory after switching
to it
- Fix a trivial list corruption in objtool's pv_ops validation
- Disable the clocksource watchdog for TSC on platforms which claim
that the TSC is constant, doesn't stop in sleep states, CPU has TSC
adjust and the number of sockets of the platform are max 2, to
prevent erroneous markings of the TSC as unstable.
- Make sure TSC adjust is always checked not only when going idle
- Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
the FPU code
- Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
* tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
x86/entry: Use the correct fence macro after swapgs in kernel CR3
x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
x86/64/mm: Map all kernel memory into trampoline_pgd
objtool: Fix pv_ops noinstr validation
x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
x86/tsc: Add a timer to make sure TSC_adjust is always checked
x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()
x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
* New SEV-ES protocol for communicating invalid VMGEXIT requests
* Ensure APICv is considered inactive if there is no APIC
* Fix reserved bits for AMD PerfEvtSeln register
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGscuAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOnzAgAg/0tifAOC8F31xu+gbbFa2hJsfH1
sr2QXqVI6XUWeAHw5GEEcSFHGcFc0BrmfPEL4T3tEhkTGcijL2uTK8dwgq3ue4yg
5LzaZzh03AWi4x84rV3XNVHHatzF69tgbUG49rSlK2T6BkGWzh4gI5LZV7XqNqLh
acW+92YcCGo/O9RAUYYakofX4bp0rsQaZornQiD/R5X6AlrtMUyhAYHH5Wnv69n+
MHf4K9MzrtixXbTvkOXflN5yz6TkIGvCpCK+gmppeuJ3JyZshjn+y95XcDDZtB6o
+4Ypap3SmIkjTg4VwS8lRwIFkkY31hxhW2ohRnorfP2Q5Ckcz2/kVcIEew==
=uJMy
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
- Static analysis fix
- New SEV-ES protocol for communicating invalid VMGEXIT requests
- Ensure APICv is considered inactive if there is no APIC
- Fix reserved bits for AMD PerfEvtSeln register
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
KVM: VMX: Set failure code in prepare_vmcs02()
KVM: ensure APICv is considered inactive if there is no APIC
KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.
The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.
The new error codes are documented in the lasest update to the GHCB
specification.
Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the following commit:
025768a966 x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant
... we added the new task_size_max() inline, which uses X86_FEATURE_LA57,
but doesn't include <asm/cpufeatures.h> which defines the constant.
Due to the way alternatives macros work currently this doesn't get reported as an
immediate build error, only as a link error, if a .c file happens to include
<asm/page.h> first:
> ld: kernel/fork.o:(.altinstructions+0x98): undefined reference to `X86_FEATURE_LA57'
In the current upstream kernel no .c file includes <asm/page.h> before including
some other header that includes <asm/cpufeatures.h>, which is why this dependency
bug went unnoticed.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed. This triggers a WARN:
WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC. This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation for sharing MMIO instruction decode between SEV-ES and
TDX, factor out the common decode into a new insn_decode_mmio() helper.
For regular virtual machine, MMIO is handled by the VMM and KVM
emulates instructions that caused MMIO. But, this model doesn't work
for a secure VMs (like SEV or TDX) as VMM doesn't have access to the
guest memory and register state. So, for TDX or SEV VMM needs
assistance in handling MMIO. It induces exception in the guest. Guest
has to decode the instruction and handle it on its own.
The code is based on the current SEV MMIO implementation.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20211130184933.31005-4-kirill.shutemov@linux.intel.com
The helper returns a pointer to the register indicated by
ModRM byte.
It's going to replace vc_insn_get_reg() in the SEV MMIO
implementation. TDX MMIO implementation will also use it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lkml.kernel.org/r/20211130184933.31005-3-kirill.shutemov@linux.intel.com
Convention for all the other "lake" CPUs is all one word.
So s/RAPTOR_LAKE/RAPTORLAKE/
Fixes: fbdb5e8f29 ("x86/cpu: Add Raptor Lake to Intel family")
Reported-by: Rui Zhang <rui.zhang@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com
Kernel fails to compile with PARAVIRT_XXL=y if XEN_PV is not enabled:
ld.lld: error: undefined symbol: xen_iret
It happens because INTERRUPT_RETURN defined to use xen_iret if
CONFIG_PARAVIRT_XXL enabled regardless of CONFIG_XEN_PV.
The issue is not visible in the current kernel because CONFIG_XEN_PV is
the only user of CONFIG_PARAVIRT_XXL and there's no way to enable them
separately.
Rework code to define INTERRUPT_RETURN based on CONFIG_XEN_PV, not
CONFIG_PARAVIRT_XXL.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/20211130185533.32658-1-kirill.shutemov@linux.intel.com
The macro is unused after commit 00ecd54013 so it can be removed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd54013 ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYaD8mwAKCRCAXGG7T9hj
vspAAPwLA5SUorji33PTetwmcpLcoRJ3Q4HAPz+bOPdm9iL/PgD/V8MtxFrFebBs
AJoa+GmBarUNn7XCqKnCcA64iXhrpQw=
=6GY3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- Kconfig fix to make it possible to control building of the privcmd
driver
- three fixes for issues identified by the kernel test robot
- a five-patch series to simplify timeout handling for Xen PV driver
initialization
- two patches to fix error paths in xenstore/xenbus driver
initialization
* tag 'for-linus-5.16c-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: make HYPERVISOR_set_debugreg() always_inline
xen: make HYPERVISOR_get_debugreg() always_inline
xen: detect uninitialized xenbus in xenbus_init
xen: flag xen_snd_front to be not essential for system boot
xen: flag pvcalls-front to be not essential for system boot
xen: flag hvc_xen to be not essential for system boot
xen: flag xen_drm_front to be not essential for system boot
xen: add "not_essential" flag to struct xenbus_driver
xen/pvh: add missing prototype to header
xen: don't continue xenstore initialization in case of errors
xen/privcmd: make option visible in Kconfig
HYPERVISOR_set_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".
Fixes: 7361fac046 ("x86/xen: Make set_debugreg() noinstr")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-3-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
HYPERVISOR_get_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".
Fixes: f4afb713e5 ("x86/xen: Make get_debugreg() noinstr")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-2-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Increase the number of arguments supported by mixed mode calls, so that
we will be able to call into the TCG2 protocol to measure the initrd
and extend the associated PCR. This involves the TCG2 protocol's
hash_log_extend_event() method, which takes five arguments, three of
which are u64 and need to be split, producing a total of 8 outgoing
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20211119114745.1560453-3-ilias.apalodimas@linaro.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Define the right prototype for efi_tcg2_protocol::hash_log_extend_event()
and add the required structs so we can start using it to measure the initrd
into the TPM if it was loaded by the EFI stub itself.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20211119114745.1560453-2-ilias.apalodimas@linaro.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
The prototype of mem_map_via_hcall() is missing in its header, so add
it.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: a43fb7da53 ("xen/pvh: Move Xen code for getting mem map via hcall out of common file")
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211119153913.21678-1-jgross@suse.com
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
* Cleanups for the perf test infrastructure and mapping hugepages
* Avoid contention on mmap_sem when the guests start to run
* Add event channel upcall support to xen_shinfo_test
x86 changes:
* Fixes for Xen emulation
* Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache
* Fixes for migration of 32-bit nested guests on 64-bit hypervisor
* Compilation fixes
* More SEV cleanups
Generic:
* Cap the return value of KVM_CAP_NR_VCPUS to both KVM_CAP_MAX_VCPUS
and num_online_cpus(). Most architectures were only using one of the two.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGV/PAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMrogf/eAyilGRQL7lLETn3DTVlgLVv82+z
giX11HlUhUmATHIDluj/wVQUjVcY6AO4SnvFaudX7B+mibndkw4L19IubP/koQZu
xnKSJTn+mVANdzz3UdsHl0ujbPdQJaFCIPW6iewbn2GRRZMwA5F3vMK/H09XRApL
I7kq8CPA6sC0I3TPzPN3ROxigexzYunZmGQ4qQe0GUdtxHrJOYQN++ddmWbQoEIC
gdFTyF7CUQ+lmJe0b/Y88yhISFAJCEBuKFlg9tOTuxSfwvPX6lUu+pi+utEx9M+O
ckTSQli/apZ4RVcSzxMIwX/BciYqhqOz5uMG+w4DRlJixtGSHtjiEVxGxw==
=Iij4
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Selftest changes:
- Cleanups for the perf test infrastructure and mapping hugepages
- Avoid contention on mmap_sem when the guests start to run
- Add event channel upcall support to xen_shinfo_test
x86 changes:
- Fixes for Xen emulation
- Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache
- Fixes for migration of 32-bit nested guests on 64-bit hypervisor
- Compilation fixes
- More SEV cleanups
Generic:
- Cap the return value of KVM_CAP_NR_VCPUS to both KVM_CAP_MAX_VCPUS
and num_online_cpus(). Most architectures were only using one of
the two"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits)
KVM: x86: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
KVM: s390: Cap KVM_CAP_NR_VCPUS by num_online_cpus()
KVM: RISC-V: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
KVM: PPC: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
KVM: MIPS: Cap KVM_CAP_NR_VCPUS by KVM_CAP_MAX_VCPUS
KVM: arm64: Cap KVM_CAP_NR_VCPUS by kvm_arm_default_max_vcpus()
KVM: x86: Assume a 64-bit hypercall for guests with protected state
selftests: KVM: Add /x86_64/sev_migrate_tests to .gitignore
riscv: kvm: fix non-kernel-doc comment block
KVM: SEV: Fix typo in and tweak name of cmd_allowed_from_miror()
KVM: SEV: Drop a redundant setting of sev->asid during initialization
KVM: SEV: WARN if SEV-ES is marked active but SEV is not
KVM: SEV: Set sev_info.active after initial checks in sev_guest_init()
KVM: SEV: Disallow COPY_ENC_CONTEXT_FROM if target has created vCPUs
KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache
KVM: nVMX: Use a gfn_to_hva_cache for vmptrld
KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check
KVM: x86/xen: Use sizeof_field() instead of open-coding it
KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12
KVM: x86/xen: Fix get_attr of KVM_XEN_ATTR_TYPE_SHARED_INFO
...
* Fixes for Xen emulation
* Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache
* Fixes for migration of 32-bit nested guests on 64-bit hypervisor
* Compilation fixes
* More SEV cleanups
Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the
guest root level and is not reflected in kvm_mmu_page_role.level when TDP
is in use. When simply running the guest, it is impossible for EFER.LMA
and kvm_mmu.root_level to get out of sync, as the guest cannot transition
from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without
first bouncing through a different MMU context. And stuffing guest state
via KVM_SET_SREGS{,2} also ensures a full MMU context reset.
However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to
set guest state when migrating the VM while L2 is active, the vCPU state
will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will
have been configured using L2's state, despite not being used for L2. If
L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will
be configured for guest PAE paging, but will match the mmu_role for 64-bit
paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit.
Alternatively, the root_mmu's role could be invalidated after a successful
KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu,
i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily
tricky, and not even needed if L1 and L2 do have the same role (e.g., they
are both 64-bit guests and run with the same CR4).
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20211111020738.2512932-14-seanjc@google.com
Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.
Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com
Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.
Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-11-seanjc@google.com
To prepare for using static_calls to optimize perf's guest callbacks,
replace ->is_in_guest and ->is_user_mode with a new multiplexed hook
->state, tweak ->handle_intel_pt_intr to play nice with being called when
there is no active guest, and drop "guest" from ->get_guest_ip.
Return '0' from ->state and ->handle_intel_pt_intr to indicate "not in
guest" so that DEFINE_STATIC_CALL_RET0 can be used to define the static
calls, i.e. no callback == !guest.
[sean: extracted from static_call patch, fixed get_ip() bug, wrote changelog]
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20211111020738.2512932-7-seanjc@google.com
Override the Processor Trace (PT) interrupt handler for guest mode if and
only if PT is configured for host+guest mode, i.e. is being used
independently by both host and guest. If PT is configured for system
mode, the host fully controls PT and must handle all events.
Fixes: 8479e04e7d ("KVM: x86: Inject PMI for KVM guest")
Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Reported-by: Artem Kashkanov <artem.kashkanov@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211111020738.2512932-4-seanjc@google.com
Remove all the defines of masks and bit positions for the GHCB MSR
protocol and use comments instead which correspond directly to the spec
so that following those can be a lot easier and straightforward with the
spec opened in parallel to the code.
Aligh vertically while at it.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211110220731.2396491-6-brijesh.singh@amd.com
Add a call inside memory_failure() to call the arch specific code
to check if the address is an SGX EPC page and handle it.
Note the SGX EPC pages do not have a "struct page" entry, so the hook
goes in at the same point as the device mapping hook.
Pull the call to acquire the mutex earlier so the SGX errors are also
protected.
Make set_mce_nospec() skip SGX pages when trying to adjust
the 1:1 map.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com
by placing explicit signature bytes after the call trampoline to prevent
patching random other jumps like the CFI jump table entries.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDKsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZeNEADEFTbUJKd8812O9vkY9we1GDAtH7bY
z6sYkh0/rPvYjdPfHuwqW8tUAl+CO2ne2X8FRPKgEdRLg44BY4HaMHmujdbGh3fh
zpqynUBPoOIgtWxAPGdF+JxjrKlzjFd+WwjG3qBXOF3pjKgCc5knyjTucsl6ced3
wF293rSYrIJ6uRv2TTNbM5hWJdC0arWbdMFnwQTxeZR54WLpu7Wfm+CCK41w0fAU
nrfSsv73WEwpmAZNh04wsZsf7h6yCO7dCrIJD/3mpJtrUVBZXuZAKDzUzJPvHJal
T8LcKwxZQAgPv0ubmOCrolj98Qp6PAPSdDJbzNsCJUYEbBqaB2inJ0PeHcZPspy9
YyW00EHXD2UKm/GNF/DIlhoiNxOSh8Wn4b6H5ZRML50bS7jsMp8YVbticWEjItL6
N4/61c45/uPILBS+Lysj0aqyj4TvagiuffJFWjw3YAQ+Gp/pzlJwRNjrw7/4DxAx
KdpM881IKCR8UowBz3gIiA9FrJv2dGMqq31Rs1fjuauxkIX0gV3c64tAIRWrVscT
k6GKGvHSis5cT97K3yhmNH0BUND+Skeku8G/SnTkefvcB85aU/7HBkLLJpw0w84F
F6PTCaCJOEHrl3ADkilsi3z0sKWrph6aAzDEgp6Q6cmo9ulFAGw0bjuJb59xsvVK
flIvTLUY3n76FA==
=dgiF
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 static call update from Thomas Gleixner:
"A single fix for static calls to make the trampoline patching more
robust by placing explicit signature bytes after the call trampoline
to prevent patching random other jumps like the CFI jump table
entries"
* tag 'locking-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
static_call,x86: Robustify trampoline patching
the preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix a
booting of Xen PV guests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ8HcACgkQEsHwGGHe
VUouoA//WAZ/dZu7IiM06JhZWswa2yNsdU8qQHys81lEqstaBqiWuZdg1qJTVIir
2d0aN0keiPcsLyAsp1UJ2g/K/7D5vSJWDzsHKfEAToiAm8Tntai2LlSocWWfeSQm
10grDHWpEHbj0hTHTA6HYOr2WbY4/LnR4cdL0WobIzivIrRTx49d0XUOUfWLP5KX
60uM6dSjwpJrQUnvzk+bhGiHVmutFrEJy+UU/0o+nxkdhwraNiSbLi0007BGRCof
6dokRRvLLR09dl1LMG51gVjQch4j/lCx6EWWUhYOFeV3I3gibSCNkmu7dpmMCBTR
QWO01cR9gyFN4xQ2is4I36M5L0/8T+sbGvvXIXNDT/XWr0/p+g6p2mx0cd2XiYIr
ZthGRcxxV/KGmxfPaygKS9tpQseMEIrdd6VjAnGfZ3OS6CtUvYt8d0B2Soj8FALQ
N9fMXDIEP3uUZim8UvCT6HBKlj9LR5uI5n+dAQ6uzsenO9WqeGeldc/N26/+osdN
vo4lNYTqiXJPhJvunYW5t4j5JnUa3grDHioAPWaQRJlWtEZBGKs9SXTcweg/KURb
mNfe1RfSlGJt28RD3E18gXeSS7xWdKgpcVX1rmW/9tUjX04NNDWjq4sAzOj7c+Ir
4sr78XgCY0pUxFaFYxvQWFUy7wcm0zAczo1RGUhcDTf1edDEvjo=
=s2MX
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Avoid touching ~100 config files in order to be able to select the
preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix
a booting of Xen PV guests
* tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Restore preemption model selection configs
arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology()
sched/fair: Prevent dead task groups from regaining cfs_rq's
sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
x86/smp: Factor out parts of native_smp_prepare_cpus()
- Do not log spurious corrected MCEs on SKL too, due to an erratum
- Clarify the path of paravirt ops patches upstream
- Add an optimization to avoid writing out AMX components to sigframes
when former are in init state
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ3CgACgkQEsHwGGHe
VUoLAA/+NXRvcBHYkLaByT9f4OI6B79HzyguIBSfipYiw8ir0H7uEdV5FUCCUgCz
egBRVFpOsXWt1teeuu6ViO+WBHncUxG/ryZ0ka35lri/3kuVYnugZExWDs4MrGR5
vehRXehOxYNRaYc3oLYjubSbxqF1nWz3WWfGfhiBKk0jT/S1T9tX6lsRXlKsJCgj
M4x5aqBWP8HTbFQfqjdHwagNitmSKzgjZvMcC4UWcql33ZCycbjvRdrAzBtw7WRI
UBvgxWVmeMoagu5fqEOoph1oSoFxWuFrweFUjnxJmT6uZrTsfF7BVgXkxdG6eYUy
2Xogcd4bPDBiRgbs0vPEog1tyyrKHOQ6p1pvksySKMPq6ULcSZ6hBpEZRpgr6Y9u
0jB3P6weQgCckx5Hd+iwvX1a+GvEuHSEqAE+j160wFyrsBS5Cir3P1WqthWaPd5I
3nH3h955PokUHPUioUhdf+8cfuP6h6K0nz1gdYI8GR8+fJHhEceT+pLLeyIxj/VM
yr+bq+V7D6Cg62w3z3s9Dzg2XKpxStu1R9L1N/K8MtIGf6Uc7paL6xR27XxhmBp5
Y6bGZw0mxxFhp6AEsFWo3rwLL9Dl5DmFcfgUHHpPK5VP0pVWp48Uapx2Hi2/JzAo
c1o4UkPQa/EZJBPTklmGkS1JNp/2TsEL4Fw7sew+j7DWtsJpCfk=
=Ge2T
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add the model number of a new, Raptor Lake CPU, to intel-family.h
- Do not log spurious corrected MCEs on SKL too, due to an erratum
- Clarify the path of paravirt ops patches upstream
- Add an optimization to avoid writing out AMX components to sigframes
when former are in init state
* tag 'x86_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add Raptor Lake to Intel family
x86/mce: Add errata workaround for Skylake SKX37
MAINTAINERS: Add some information to PARAVIRT_OPS entry
x86/fpu: Optimize out sigframe xfeatures when in init state
Add model ID for Raptor Lake.
[ dhansen: These get added as soon as possible so that folks doing
development can leverage them. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211112182835.924977-1-tony.luck@intel.com
* Fix misuse of gfn-to-pfn cache when recording guest steal time / preempted status
* Fix selftests on APICv machines
* Fix sparse warnings
* Fix detection of KVM features in CPUID
* Cleanups for bogus writes to MSR_KVM_PV_EOI_EN
* Fixes and cleanups for MSR bitmap handling
* Cleanups for INVPCID
* Make x86 KVM_SOFT_MAX_VCPUS consistent with other architectures
Add support for AMD SEV and SEV-ES intra-host migration support. Intra
host migration provides a low-cost mechanism for userspace VMM upgrades.
In the common case for intra host migration, we can rely on the normal
ioctls for passing data from one VMM to the next. SEV, SEV-ES, and other
confidential compute environments make most of this information opaque, and
render KVM ioctls such as "KVM_GET_REGS" irrelevant. As a result, we need
the ability to pass this opaque metadata from one VMM to the next. The
easiest way to do this is to leave this data in the kernel, and transfer
ownership of the metadata from one KVM VM (or vCPU) to the next. In-kernel
hand off makes it possible to move any data that would be
unsafe/impossible for the kernel to hand directly to userspace, and
cannot be reproduced using data that can be handed to userspace.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
either num_online_cpus() or num_present_cpus(), s390 has multiple
constants depending on hardware features. On x86, KVM reports an
arbitrary value of '710' which is supposed to be the maximum tested
value but it's possible to test all KVM_MAX_VCPUS even when there are
less physical CPUs available.
Drop the arbitrary '710' value and return num_online_cpus() on x86 as
well. The recommendation will match other architectures and will mean
'no CPU overcommit'.
For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
when the requested vCPU number exceeds it. The static limit of '710'
is quite weird as smaller systems with just a few physical CPUs should
certainly "recommend" less.
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently when kvm_update_cpuid_runtime() runs, it assumes that the
KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
however, if Hyper-V support is enabled. In this case the KVM leaves will
be offset.
This patch introdues as new 'kvm_cpuid_base' field into struct
kvm_vcpu_arch to track the location of the KVM leaves and function
kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
target the correct leaf.
NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
into processor.h to avoid having duplicate code for the iteration
over possible hypervisor base leaves.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
standard kvm's inject_pending_event, and not via APICv/AVIC.
Since this is a debug feature, just inhibit APICv/AVIC while
KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.
Fixes: 61e5f69ef0 ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In commit b043138246 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
not missed") we switched to using a gfn_to_pfn_cache for accessing the
guest steal time structure in order to allow for an atomic xchg of the
preempted field. This has a couple of problems.
Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
guest vCPU using an IOMEM page for its steal time would never have its
preempted field set.
Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
should have been. There are two stages to the GFN->PFN conversion;
first the GFN is converted to a userspace HVA, and then that HVA is
looked up in the process page tables to find the underlying host PFN.
Correct invalidation of the latter would require being hooked up to the
MMU notifiers, but that doesn't happen---so it just keeps mapping and
unmapping the *wrong* PFN after the userspace page tables change.
In the !IOMEM case at least the stale page *is* pinned all the time it's
cached, so it won't be freed and reused by anyone else while still
receiving the steal time updates. The map/unmap dance only takes care
of the KVM administrivia such as marking the page dirty.
Until the gfn_to_pfn cache handles the remapping automatically by
integrating with the MMU notifiers, we might as well not get a
kernel mapping of it, and use the perfectly serviceable userspace HVA
that we already have. We just need to implement the atomic xchg on
the userspace address with appropriate exception handling, which is
fairly trivial.
Cc: stable@vger.kernel.org
Fixes: b043138246 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
[I didn't entirely agree with David's assessment of the
usefulness of the gfn_to_pfn cache, and integrated the outcome
of the discussion in the above commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV to work with intra host migration, contents of the SEV info struct
such as the ASID (used to index the encryption key in the AMD SP) and
the list of memory regions need to be transferred to the target VM.
This change adds a commands for a target VMM to get a source SEV VM's sev
info.
Signed-off-by: Peter Gonda <pgonda@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20211021174303.385706-3-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add guest api and guest kernel support for SEV live migration.
Introduces a new hypercall to notify the host of changes to the page
encryption status. If the page is encrypted then it must be migrated
through the SEV firmware or a helper VM sharing the key. If page is
not encrypted then it can be migrated normally by userspace. This new
hypercall is invoked using paravirt_ops.
Conflicts: sev_active() replaced by cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT).
The guest support for detecting and enabling SEV Live migration
feature uses the following logic :
- kvm_init_plaform() checks if its booted under the EFI
- If not EFI,
i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl()
to enable the SEV live migration support
- If EFI,
i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read
the UEFI variable which indicates OVMF support for live migration
ii) the variable indicates live migration is supported, issue a wrmsrl() to
enable the SEV live migration support
The EFI live migration check is done using a late_initcall() callback.
Also, ensure that _bss_decrypted section is marked as decrypted in the
hypervisor's guest page encryption status tracking.
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Steve Rutherford <srutherford@google.com>
Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM hypercall framework relies on alternative framework to patch the
VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
apply_alternative() is called then it defaults to VMCALL. The approach
works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
will be able to decode the instruction and do the right things. But
when SEV is active, guest memory is encrypted with guest key and
hypervisor will not be able to decode the instruction bytes.
To highlight the need to provide this interface, capturing the
flow of apply_alternatives() :
setup_arch() call init_hypervisor_platform() which detects
the hypervisor platform the kernel is running under and then the
hypervisor specific initialization code can make early hypercalls.
For example, KVM specific initialization in case of SEV will try
to mark the "__bss_decrypted" section's encryption state via early
page encryption status hypercalls.
Now, apply_alternatives() is called much later when setup_arch()
calls check_bugs(), so we do need some kind of an early,
pre-alternatives hypercall interface. Other cases of pre-alternatives
hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES
and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for
SEV generally.
Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
will be used by the SEV guest to notify encrypted pages to the hypervisor.
This kvm_sev_hypercall3() function is abstracted and used as follows :
All these early hypercalls are made through early_set_memory_XX() interfaces,
which in turn invoke pv_ops (paravirt_ops).
This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed()
is a generic interface and can easily have SEV, TDX and any other
future platform specific abstractions added to it.
Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to
invoke kvm_sev_hypercall3() in case of SEV.
Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed()
can be setup to a TDX specific callback.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Reviewed-by: Venu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 66558b730f ("sched: Add cluster scheduler level for x86")
introduced cpu_l2c_shared_map mask which is expected to be initialized
by smp_op.smp_prepare_cpus(). That commit only updated
native_smp_prepare_cpus() version but not xen_pv_smp_prepare_cpus().
As result Xen PV guests crash in set_cpu_sibling_map().
While the new mask can be allocated in xen_pv_smp_prepare_cpus() one can
see that both versions of smp_prepare_cpus ops share a number of common
operations that can be factored out. So do that instead.
Fixes: 66558b730f ("sched: Add cluster scheduler level for x86")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/1635896196-18961-1-git-send-email-boris.ostrovsky@oracle.com
Add a few signature bytes after the static call trampoline and verify
those bytes match before patching the trampoline. This avoids patching
random other JMPs (such as CFI jump-table entries) instead.
These bytes decode as:
d: 53 push %rbx
e: 43 54 rex.XB push %r12
And happen to spell "SCT".
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211030074758.GT174703@worktop.programming.kicks-ass.net
This is a single cleanup from Peter Collingbourne, removing
some dead code.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmGKm+kACgkQmmx57+YA
GNkksg/7BUxJrWFrQmLA3fhzh4wG3KswdrKTGQMf0jRVI1n77vmdfig3hEJmMekH
0SZoIYmPztOSj34+6p4NxuqY/Sk62oYLr8Awo8ZLDhIBWNUJE8UWC07Qb/3DpGbp
JfD7/8mXu10htgM85aGlhaVLnHvvqUBR8PlkJUGNxuY5Gy2L+eCkpwCAYlZpSOqr
vs0BJWFY1LzO1POcjIq4/IM0PvcU2ncLB5XoxDjjIIWnWKyHzWY21ZvoeaNumvou
xqQ/Hj8Sc+ufS0yNlSgIC+bJP0bp1bSw/dALKr8oYxLt7X9LELVY3WXyRH+It4nS
b6HPYmga26NVq/u7RrylBA+2fRCDB8E6z73gHt4SeHrRDEvFjhNzIyV+aXPae/dY
XI/pjiwpG6k6FpSnF69YZJ/Y+GmUA90V/Jq8aLFZhGz8SgpjRl+2foEUSDhRVXCA
jGB1Y388m0e6jPlVJROB7ORzXMd8K5iciyUGqtAI87QCOtPozn10ruh5RdziguKm
kDW5IKy9E4l1ch8WRprVbgV5Ew+QWKS1JIbyjDaX3jN0lUPCqgwkwvRxxtgdFnVA
Lq5BiUdraSWFUr84rLU3gCRU0+VoEdyZYI+bQGGNlQ4ovmLYU1nLmCU/azMvSEgz
ZsoM5YdffShxUtfwMg+W67RpYOjSnV55ZzwN0+d1C2oqz9xECXc=
=8EFP
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic cleanup from Arnd Bergmann:
"This is a single cleanup from Peter Collingbourne, removing some dead
code"
* tag 'asm-generic-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: remove unused function syscall_set_arguments()
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYYp8HgAKCRCAXGG7T9hj
vmuVAP4whjbyIi4IxYEOnE6On0aD0AgUMiFa7QXrDZi6NXUQIwEAnggLFe+rEG5C
Fwi/cEXSHrRgveqrgD4GYEr6l0GTxwM=
=/fMa
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.16b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- a series to speed up the boot of Xen PV guests
- some cleanups in Xen related code
- replacement of license texts with the appropriate SPDX headers and
fixing of wrong SPDX headers in Xen header files
- a small series making paravirtualized interrupt masking much simpler
and at the same time removing complaints of objtool
- a fix for Xen ballooning hogging workqueues for too long
- enablement of the Xen pciback driver for Arm
- some further small fixes/enhancements
* tag 'for-linus-5.16b-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (22 commits)
xen/balloon: fix unused-variable warning
xen/balloon: rename alloc/free_xenballooned_pages
xen/balloon: add late_initcall_sync() for initial ballooning done
x86/xen: remove 32-bit awareness from startup_xen
xen: remove highmem remnants
xen: allow pv-only hypercalls only with CONFIG_XEN_PV
x86/xen: remove 32-bit pv leftovers
xen-pciback: allow compiling on other archs than x86
x86/xen: switch initial pvops IRQ functions to dummy ones
x86/xen: remove xen_have_vcpu_info_placement flag
x86/pvh: add prototype for xen_pvh_init()
xen: Fix implicit type conversion
xen: fix wrong SPDX headers of Xen related headers
xen/pvcalls-back: Remove redundant 'flush_workqueue()' calls
x86/xen: Remove redundant irq_enter/exit() invocations
xen-pciback: Fix return in pm_ctrl_init()
xen/x86: restrict PV Dom0 identity mapping
xen/x86: there's no highmem anymore in PV mode
xen/x86: adjust handling of the L3 user vsyscall special page table
xen/x86: adjust xen_set_fixmap()
...
Including:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmGD6NQACgkQK/BELZcB
GuOSfg/9FKXl5ym86BP3tAS1fREKH7p59JRGZrrIR89NyHAcEUjtNG3YLPao+YxU
3CDgLkru+vlDpYY54QoyqcY5FgIHT3Cna/Cdk4zekRmSO/14gHp47jtZRheOUzLF
rvwfaplcbbtT8akpsVFzvw8YpQLGSDiDQSl7xL2+40Z9hiYX/gS9Af+PH98tAXsa
yZKZj6gU+JXM58VihO3M7umyE06tovyBaYgcsBZtbf66bGc0ySu+fe75UVWbueRt
Z8jwqa7TUfVXiYC8h+LqtGET6gtzNSsxAU3VllRe7Brf6K8i/yaRs/TO2Hp83d7/
q/fcK3vNQ5v3aDNci/DjBB8SEySzCmRz/9ocCOCx8ByuRp+5lwVRPPq3WcUMtsZY
QpYo9Fk7luFz2Gj5LObKAVBvOoeBZ5Km3oPs4HVmQ6epxn/rVckJDnJnVSLJuATq
tSZC2heRfFlg1dT6WFaynCTP2RI1LlNEdKhHirV6L368rSjmF0ZdQxdTpHULsHr1
yMjqL21OfcSkLW91rvfb3g68EsIwDbCPGTOlQWZLmAtwOWtHSCLPgwwEG7WefZbH
yaslpmlUTOurUnFmpxlfLicy5sqsBL2ASzGJkEKrgunw82Ke96zzkRzi+9j9HeS6
g0AyIWMi1cUAjONVUZtV4yjImXh63HIPiKx730a9teodusoxm+Q=
=waUR
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- Intel IOMMU Updates fro Lu Baolu:
- Dump DMAR translation structure when DMA fault occurs
- An optimization in the page table manipulation code
- Use second level for GPA->HPA translation
- Various cleanups
- Arm SMMU Updates from Will
- Minor optimisations to SMMUv3 command creation and submission
- Numerous new compatible string for Qualcomm SMMUv2 implementations
- Fixes for the SWIOTLB based implemenation of dma-iommu code for
untrusted devices
- Add support for r8a779a0 to the Renesas IOMMU driver and DT matching
code for r8a77980
- A couple of cleanups and fixes for the Apple DART IOMMU driver
- Make use of generic report_iommu_fault() interface in the AMD IOMMU
driver
- Various smaller fixes and cleanups
* tag 'iommu-updates-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (35 commits)
iommu/dma: Fix incorrect error return on iommu deferred attach
iommu/dart: Initialize DART_STREAMS_ENABLE
iommu/dma: Use kvcalloc() instead of kvzalloc()
iommu/tegra-smmu: Use devm_bitmap_zalloc when applicable
iommu/dart: Use kmemdup instead of kzalloc and memcpy
iommu/vt-d: Avoid duplicate removing in __domain_mapping()
iommu/vt-d: Convert the return type of first_pte_in_page to bool
iommu/vt-d: Clean up unused PASID updating functions
iommu/vt-d: Delete dev_has_feat callback
iommu/vt-d: Use second level for GPA->HPA translation
iommu/vt-d: Check FL and SL capability sanity in scalable mode
iommu/vt-d: Remove duplicate identity domain flag
iommu/vt-d: Dump DMAR translation structure when DMA fault occurs
iommu/vt-d: Do not falsely log intel_iommu is unsupported kernel option
iommu/arm-smmu-qcom: Request direct mapping for modem device
iommu: arm-smmu-qcom: Add compatible for QCM2290
dt-bindings: arm-smmu: Add compatible for QCM2290 SoC
iommu/arm-smmu-qcom: Add SM6350 SMMU compatible
dt-bindings: arm-smmu: Add compatible for SM6350 SoC
iommu/arm-smmu-v3: Properly handle the return value of arm_smmu_cmdq_build_cmd()
...
Here is the big set of driver core changes for 5.16-rc1.
All of these have been in linux-next for a while now with no reported
problems.
Included in here are:
- big update and cleanup of the sysfs abi documentation files
and scripts from Mauro. We are almost at the place where we
can properly check that the running kernel's sysfs abi is
documented fully.
- firmware loader updates
- dyndbg updates
- kernfs cleanups and fixes from Christoph
- device property updates
- component fix
- other minor driver core cleanups and fixes
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYYPbjQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ync9gCfXKMUI1GAnCfJWAwTdTcd18q5akoAoMw32/AH
0yh5TjAWFyFd7xz5d7qs
=itsC
-----END PGP SIGNATURE-----
Merge tag 'driver-core-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core changes for 5.16-rc1.
All of these have been in linux-next for a while now with no reported
problems.
Included in here are:
- big update and cleanup of the sysfs abi documentation files and
scripts from Mauro. We are almost at the place where we can
properly check that the running kernel's sysfs abi is documented
fully.
- firmware loader updates
- dyndbg updates
- kernfs cleanups and fixes from Christoph
- device property updates
- component fix
- other minor driver core cleanups and fixes"
* tag 'driver-core-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (122 commits)
device property: Drop redundant NULL checks
x86/build: Tuck away built-in firmware under FW_LOADER
vmlinux.lds.h: wrap built-in firmware support under FW_LOADER
firmware_loader: move struct builtin_fw to the only place used
x86/microcode: Use the firmware_loader built-in API
firmware_loader: remove old DECLARE_BUILTIN_FIRMWARE()
firmware_loader: formalize built-in firmware API
component: do not leave master devres group open after bind
dyndbg: refine verbosity 1-4 summary-detail
gpiolib: acpi: Replace custom code with device_match_acpi_handle()
i2c: acpi: Replace custom function with device_match_acpi_handle()
driver core: Provide device_match_acpi_handle() helper
dyndbg: fix spurious vNpr_info change
dyndbg: no vpr-info on empty queries
dyndbg: vpr-info on remove-module complete, not starting
device property: Add missed header in fwnode.h
Documentation: dyndbg: Improve cli param examples
dyndbg: Remove support for ddebug_query param
dyndbg: make dyndbg a known cli param
dyndbg: show module in vpr-info in dd-exec-queries
...
tl;dr: AMX state is ~8k. Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.
This is a user-visible change to the sigframe ABI.
== Hardware XSAVE Background ==
XSAVE state components may be tracked by the processor as being
in their initial configuration. Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.
Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header, However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer. XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.
Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV. They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.
== Software Signal / XSAVE Background ==
Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.
In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.
== Problem ==
This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame. This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.
This problem also exists to a lesser degree with AVX-512 and its
2k of state. However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.
== Solution ==
Stop writing out AMX xfeatures which are in their initial state
to the signal frame. This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior. Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.
For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior. These require new ABI to use anyway, which
makes their users very unlikely to be broken. This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures. It
may also be extended to legacy features like AVX-512 in the
future.
Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.
This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.
Fixes: 2308ee57d9 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
core:
- improve dma_fence, lease and resv documentation
- shmem-helpers: allocate WC pages on x86, use vmf_insert_pin
- sched fixes/improvements
- allow empty drm leases
- add dma resv iterator
- add more DP 2.0 headers
- DP MST helper improvements for DP2.0
dma-buf:
- avoid warnings, remove fence trace macros
bridge:
- new helper to get rid of panels
- probe improvements for it66121
- enable DSI EOTP for anx7625
fbdev:
- efifb: release runtime PM on destroy
ttm:
- kerneldoc switch
- helper to clear all DMA mappings
- pool shrinker optimizaton
- remove ttm_tt_destroy_common
- update ttm_move_memcpy for async use
panel:
- add new panel-edp driver
amdgpu:
- Initial DP 2.0 support
- Initial USB4 DP tunnelling support
- Aldebaran MCE support
- Modifier support for DCC image stores for GFX 10.3
- Display rework for better FP code handling
- Yellow Carp/Cyan Skillfish updates
- Cyan Skillfish display support
- convert vega/navi to IP discovery asic enumeration
- validate IP discovery table
- RAS improvements
- Lots of fixes
i915:
- DG1 PCI IDs + LMEM discovery/placement
- DG1 GuC submission by default
- ADL-S PCI IDs updated + enabled by default
- ADL-P (XE_LPD) fixed and updates
- DG2 display fixes
- PXP protected object support for Gen12 integrated
- expose multi-LRC submission interface for GuC
- export logical engine instance to user
- Disable engine bonding on Gen12+
- PSR cleanup
- PSR2 selective fetch by default
- DP 2.0 prep work
- VESA vendor block + MSO use of it
- FBC refactor
- try again to fix fast-narrow vs slow-wide eDP training
- use THP when IOMMU enabled
- LMEM backup/restore for suspend/resume
- locking simplification
- GuC major reworking
- async flip VT-D workaround changes
- DP link training improvements
- misc display refactorings
bochs:
- new PCI ID
rcar-du:
- Non-contiguious buffer import support for rcar-du
- r8a779a0 support prep
omapdrm:
- COMPILE_TEST fixes
sti:
- COMPILE_TEST fixes
msm:
- fence ordering improvements
- eDP support in DP sub-driver
- dpu irq handling cleanup
- CRC support for making igt happy
- NO_CONNECTOR bridge support
- dsi: 14nm phy support for msm8953
- mdp5: msm8x53, sdm450, sdm632 support
stm:
- layer alpha + zpo support
v3d:
- fix Vulkan CTS failure
- support multiple sync objects
gud:
- add R8/RGB332/RGB888 pixel formats
vc4:
- convert to new bridge helpers
vgem:
- use shmem helpers
virtio:
- support mapping exported vram
zte:
- remove obsolete driver
rockchip:
- use bridge attach no connector for LVDS/RGB
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmGByPYACgkQDHTzWXnE
hr6fxA//cXUvTHlEtF7UJDBRAYv+9lXH39NbGYU4aLJuBNlZztCuUi5JOSyDFDH1
N9VI5biVseev2PEnCzJUubWxTqbUO7FBQTw0TyvZ4Eqn+UZMuFeo0dvdKZRAkvjV
VHSUc0fm0+WSYanKUK7XK0fwG8aE6JVyYngzgKPSjifhszTdiiRsbU21iTinFhkS
rgh3HEVELp+LqfoG4qzAYqFUjYqUjvCjd/hX/UkzCII8ZXKr38/4127e95443WOk
+jes0gWGJe9TvSDrqo9TMx4qukcOniINFUvnzoD2RhOS+Jzr/i5rBh51Xy92g3NO
Q7hy6byZdk/ZO/MXCDQ2giUOkBiqn5fQjlRGQp4iAZYw9pb3HU+/xrTq0BWVWd8o
/vmzZYEKKU/sCGpxVDMZxsHV3mXIuVBvuZq6bjmSGcybgOBCiDx5F/Rum4nY2yHp
lr3cuc0HP3m3f4b/HVvACO4tGd1nDDpVcon7CuhBB7HB7t6Zl9u18qc/qFw0tCTh
3sgAhno6XFXtPFcSX2KAeeg0mhKDKKrsOnq5y3bDRr05Z0jLocJk95aXEKs6em4j
gbyHwNaX3CHtiCnFn2/5169+n1K7zqHBtVSGmQlmFDv55rcdx7L3Spk7tCahQeSQ
ur24r+sEggm8d5Wjl+MYq6wW3oP31s04JFaeV6oCkaSp1wS+alg=
=jdhH
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2021-11-03' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Summary below. i915 starts to add support for DG2 GPUs, enables DG1
and ADL-S support by default, lots of work to enable DisplayPort 2.0
across drivers. Lots of documentation updates and fixes across the
board.
core:
- improve dma_fence, lease and resv documentation
- shmem-helpers: allocate WC pages on x86, use vmf_insert_pin
- sched fixes/improvements
- allow empty drm leases
- add dma resv iterator
- add more DP 2.0 headers
- DP MST helper improvements for DP2.0
dma-buf:
- avoid warnings, remove fence trace macros
bridge:
- new helper to get rid of panels
- probe improvements for it66121
- enable DSI EOTP for anx7625
fbdev:
- efifb: release runtime PM on destroy
ttm:
- kerneldoc switch
- helper to clear all DMA mappings
- pool shrinker optimizaton
- remove ttm_tt_destroy_common
- update ttm_move_memcpy for async use
panel:
- add new panel-edp driver
amdgpu:
- Initial DP 2.0 support
- Initial USB4 DP tunnelling support
- Aldebaran MCE support
- Modifier support for DCC image stores for GFX 10.3
- Display rework for better FP code handling
- Yellow Carp/Cyan Skillfish updates
- Cyan Skillfish display support
- convert vega/navi to IP discovery asic enumeration
- validate IP discovery table
- RAS improvements
- Lots of fixes
i915:
- DG1 PCI IDs + LMEM discovery/placement
- DG1 GuC submission by default
- ADL-S PCI IDs updated + enabled by default
- ADL-P (XE_LPD) fixed and updates
- DG2 display fixes
- PXP protected object support for Gen12 integrated
- expose multi-LRC submission interface for GuC
- export logical engine instance to user
- Disable engine bonding on Gen12+
- PSR cleanup
- PSR2 selective fetch by default
- DP 2.0 prep work
- VESA vendor block + MSO use of it
- FBC refactor
- try again to fix fast-narrow vs slow-wide eDP training
- use THP when IOMMU enabled
- LMEM backup/restore for suspend/resume
- locking simplification
- GuC major reworking
- async flip VT-D workaround changes
- DP link training improvements
- misc display refactorings
bochs:
- new PCI ID
rcar-du:
- Non-contiguious buffer import support for rcar-du
- r8a779a0 support prep
omapdrm:
- COMPILE_TEST fixes
sti:
- COMPILE_TEST fixes
msm:
- fence ordering improvements
- eDP support in DP sub-driver
- dpu irq handling cleanup
- CRC support for making igt happy
- NO_CONNECTOR bridge support
- dsi: 14nm phy support for msm8953
- mdp5: msm8x53, sdm450, sdm632 support
stm:
- layer alpha + zpo support
v3d:
- fix Vulkan CTS failure
- support multiple sync objects
gud:
- add R8/RGB332/RGB888 pixel formats
vc4:
- convert to new bridge helpers
vgem:
- use shmem helpers
virtio:
- support mapping exported vram
zte:
- remove obsolete driver
rockchip:
- use bridge attach no connector for LVDS/RGB"
* tag 'drm-next-2021-11-03' of git://anongit.freedesktop.org/drm/drm: (1259 commits)
drm/amdgpu/gmc6: fix DMA mask from 44 to 40 bits
drm/amd/display: MST support for DPIA
drm/amdgpu: Fix even more out of bound writes from debugfs
drm/amdgpu/discovery: add SDMA IP instance info for soc15 parts
drm/amdgpu/discovery: add UVD/VCN IP instance info for soc15 parts
drm/amdgpu/UAPI: rearrange header to better align related items
drm/amd/display: Enable dpia in dmub only for DCN31 B0
drm/amd/display: Fix USB4 hot plug crash issue
drm/amd/display: Fix deadlock when falling back to v2 from v3
drm/amd/display: Fallback to clocks which meet requested voltage on DCN31
drm/amd/display: move FPU associated DCN301 code to DML folder
drm/amd/display: fix link training regression for 1 or 2 lane
drm/amd/display: add two lane settings training options
drm/amd/display: decouple hw_lane_settings from dpcd_lane_settings
drm/amd/display: implement decide lane settings
drm/amd/display: adopt DP2.0 LT SCR revision 8
drm/amd/display: FEC configuration for dpia links in MST mode
drm/amd/display: FEC configuration for dpia links
drm/amd/display: Add workaround flag for EDID read on certain docks
drm/amd/display: Set phy_mux_sel bit in dmub scratch register
...
* More progress on the protected VM front, now with the full
fixed feature set as well as the limitation of some hypercalls
after initialisation.
* Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
* Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
* More memcg accounting of the memory allocated on behalf of a guest
* Timer and vgic selftests
* Workarounds for the Apple M1 broken vgic implementation
* KConfig cleanups
* New kvmarm.mode=none option, for those who really dislike us
RISC-V:
* New KVM port.
x86:
* New API to control TSC offset from userspace
* TSC scaling for nested hypervisors on SVM
* Switch masterclock protection from raw_spin_lock to seqcount
* Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
* Convey the exit reason to userspace on emulation failure
* Configure time between NX page recovery iterations
* Expose Predictive Store Forwarding Disable CPUID leaf
* Allocate page tracking data structures lazily (if the i915
KVM-GT functionality is not compiled in)
* Cleanups, fixes and optimizations for the shadow MMU code
s390:
* SIGP Fixes
* initial preparations for lazy destroy of secure VMs
* storage key improvements/fixes
* Log the guest CPNC
Starting from this release, KVM-PPC patches will come from
Michael Ellerman's PPC tree.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmGBOiEUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNowwf/axlx3g9sgCwQHr12/6UF/7hL/RwP
9z+pGiUzjl2YQE+RjSvLqyd6zXh+h4dOdOKbZDLSkSTbcral/8U70ojKnQsXM0XM
1LoymxBTJqkgQBLm9LjYreEbzrPV4irk4ygEmuk3CPOHZu8xX1ei6c5LdandtM/n
XVUkXsQY+STkmnGv4P3GcPoDththCr0tBTWrFWtxa0w9hYOxx0ay1AZFlgM4FFX0
QFuRc8VBLoDJpIUjbkhsIRIbrlHc/YDGjuYnAU7lV/CIME8vf2BW6uBwIZJdYcDj
0ejozLjodEnuKXQGnc8sXFioLX2gbMyQJEvwCgRvUu/EU7ncFm1lfs7THQ==
=UxKM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- More progress on the protected VM front, now with the full fixed
feature set as well as the limitation of some hypercalls after
initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
RISC-V:
- New KVM port.
x86:
- New API to control TSC offset from userspace
- TSC scaling for nested hypervisors on SVM
- Switch masterclock protection from raw_spin_lock to seqcount
- Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
- Convey the exit reason to userspace on emulation failure
- Configure time between NX page recovery iterations
- Expose Predictive Store Forwarding Disable CPUID leaf
- Allocate page tracking data structures lazily (if the i915 KVM-GT
functionality is not compiled in)
- Cleanups, fixes and optimizations for the shadow MMU code
s390:
- SIGP Fixes
- initial preparations for lazy destroy of secure VMs
- storage key improvements/fixes
- Log the guest CPNC
Starting from this release, KVM-PPC patches will come from Michael
Ellerman's PPC tree"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
RISC-V: KVM: fix boolreturn.cocci warnings
RISC-V: KVM: remove unneeded semicolon
RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions
RISC-V: KVM: Factor-out FP virtualization into separate sources
KVM: s390: add debug statement for diag 318 CPNC data
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: Fix handle_sske page fault handling
KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
KVM: x86: Clarify the kvm_run.emulation_failure structure layout
KVM: s390: Add a routine for setting userspace CPU state
KVM: s390: Simplify SIGP Set Arch handling
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: add macros for UVC CC values
s390/mm: optimize reset_guest_reference_bit()
s390/mm: optimize set_guest_storage_key()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmGBMQUTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXmE5B/9MK3Ju+tc6C8eyR3Ic4XBYHJ3voEKO
M+R90gggBriDOgkz4B8vF+k0aD8wevXAUtmCSXonDzCh5H7GoyfrVZmJEVkwlioH
ZMSMlFHcjGhCPIXhLbNtfo/NsAYEtT/lRM2lLGCSbdGuKabylXKujVdhuSIcRPdj
Rj5innUgcAywOoxG6WzFt3JBzM33UQErCGfUF2b7Rvp9E+Zii4vIMxkMzUpnkEHH
F8WMEdL0DqH5ThOs0MslNgy03pUC9wk1d5DNd9ytYHqiSQtcQZhFHw/P6dxzUFlW
OptWv31PXUIsiJf4Zi9hmfjgUl+KZHeacZ2hXtidAo86VPcIjVs25OQW
=40fn
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Initial patch set for Hyper-V isolation VM support (Tianyu Lan)
- Fix a warning on preemption (Vitaly Kuznetsov)
- A bunch of misc cleanup patches
* tag 'hyperv-next-signed-20211102' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Protect set_hv_tscchange_cb() against getting preempted
Drivers: hv : vmbus: Adding NULL pointer check
x86/hyperv: Remove duplicate include
x86/hyperv: Remove duplicated include in hv_init
Drivers: hv: vmbus: Remove unused code to check for subchannels
Drivers: hv: vmbus: Initialize VMbus ring buffer for Isolation VM
Drivers: hv: vmbus: Add SNP support for VMbus channel initiate message
x86/hyperv: Add ghcb hvcall support for SNP VM
x86/hyperv: Add Write/Read MSR registers via ghcb page
Drivers: hv: vmbus: Mark vmbus ring buffer visible to host in Isolation VM
x86/hyperv: Add new hvcall guest address host visibility support
x86/hyperv: Initialize shared memory boundary in the Isolation VM.
x86/hyperv: Initialize GHCB page in Isolation VM
keep old userspace from breaking. Adjust the corresponding iopl selftest
to that.
- Improve stack overflow warnings to say which stack got overflowed and
raise the exception stack sizes to 2 pages since overflowing the single
page of exception stack is very easy to do nowadays with all the tracing
machinery enabled. With that, rip out the custom mapping of AMD SEV's
too.
- A bunch of changes in preparation for FGKASLR like supporting more
than 64K section headers in the relocs tool, correct ORC lookup table
size to cover the whole kernel .text and other adjustments.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uugACgkQEsHwGGHe
VUroKw//e8BJ3Aun8bg00FHxfiMGbPYcozjLGDkaoMtMDZ8WlfCUrvtqYICEr8eB
UU0eRyygAPI167dre1O9JvAcbilkNTKntaU6qbu/ZVyUwS3+Jkjwsotbqn3xKtkd
QDDTDNiCU+beCJ2ZbspbrPgEh13+H0MwMHUfRxZB9Scpmo6aGSEaU3g295f6GX57
VFGJ/LNov5MV1dTD7Pp/h6/Nb+R6WmflKcBzJmQxYuKyKX+g1xsSv0VSga+t+uf3
M9pUkizqTiUxzC2eLgtcEZTqqBHu810E8M76FmhKBUMilsFJT5YAJTiqyahwHXds
HYarOFRgcnFuJPd29vn8UHjqeeoi6ru8GtcZYzccEc7U3ku/gXPaDJ9ffmvhs7vU
pJX5Um3GiiFm0w/ZZOKDqh78wRAsCKLN+jIoyszuhkkNchZSj/jKfOgdd3EmcZst
6L6rxBA4oRHwNOgM7uVMp+jFeRe1/prR280OWWH0D4QmmuqybThOdO23Iuh/Deth
W3qPUH3UQtfSWxGy2yODzJ1ciuGAr/AzJZ9zjg04e3Vl0DkEpyWtLKJiG3ClXZag
Nj+3xc4xYH2Aw+M0HRaONk5XVKLpqVjuAfgU5iLQa0YSUbtrR+wCWvY8KgQNbAqK
xZmzYzQ89stwVCuGKx10gPsL3jSJ3VCylMfqdHD2Ajmld1yApr0=
=DOZU
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Do not #GP on userspace use of CLI/STI but pretend it was a NOP to
keep old userspace from breaking. Adjust the corresponding iopl
selftest to that.
- Improve stack overflow warnings to say which stack got overflowed and
raise the exception stack sizes to 2 pages since overflowing the
single page of exception stack is very easy to do nowadays with all
the tracing machinery enabled. With that, rip out the custom mapping
of AMD SEV's too.
- A bunch of changes in preparation for FGKASLR like supporting more
than 64K section headers in the relocs tool, correct ORC lookup table
size to cover the whole kernel .text and other adjustments.
* tag 'x86_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/x86/iopl: Adjust to the faked iopl CLI/STI usage
vmlinux.lds.h: Have ORC lookup cover entire _etext - _stext
x86/boot/compressed: Avoid duplicate malloc() implementations
x86/boot: Allow a "silent" kaslr random byte fetch
x86/tools/relocs: Support >64K section headers
x86/sev: Make the #VC exception stacks part of the default stacks storage
x86: Increase exception stack sizes
x86/mm/64: Improve stack overflow warnings
x86/iopl: Fake iopl(3) CLI/STI usage
Put the definitions of the hypercalls usable only by pv guests inside
CONFIG_XEN_PV sections.
On Arm two dummy functions related to pv hypercalls can be removed.
While at it remove the no longer supported tmem hypercall definition.
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028081221.2475-3-jgross@suse.com
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
There are some remaining 32-bit pv-guest support leftovers in the Xen
hypercall interface. Remove them.
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028081221.2475-2-jgross@suse.com
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Xen-pciback driver was designed to be built for x86 only. But it
can also be used by other architectures, e.g. Arm.
Currently PCI backend implements multiple functionalities at a time,
such as:
1. It is used as a database for assignable PCI devices, e.g. xl
pci-assignable-{add|remove|list} manipulates that list. So, whenever
the toolstack needs to know which PCI devices can be passed through
it reads that from the relevant sysfs entries of the pciback.
2. It is used to hold the unbound PCI devices list, e.g. when passing
through a PCI device it needs to be unbound from the relevant device
driver and bound to pciback (strictly speaking it is not required
that the device is bound to pciback, but pciback is again used as a
database of the passed through PCI devices, so we can re-bind the
devices back to their original drivers when guest domain shuts down)
3. Device reset for the devices being passed through
4. Para-virtualised use-cases support
The para-virtualised part of the driver is not always needed as some
architectures, e.g. Arm or x86 PVH Dom0, are not using backend-frontend
model for PCI device passthrough.
For such use-cases make the very first step in splitting the
xen-pciback driver into two parts: Xen PCI stub and PCI PV backend
drivers.
For that add new configuration options CONFIG_XEN_PCI_STUB and
CONFIG_XEN_PCIDEV_STUB, so the driver can be limited in its
functionality, e.g. no support for para-virtualised scenario.
x86 platform will continue using CONFIG_XEN_PCIDEV_BACKEND for the
fully featured backend driver.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Anastasiia Lukianenko <anastasiia_lukianenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028143620.144936-1-andr2000@gmail.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The initial pvops functions handling irq flags will only ever be called
before interrupts are being enabled.
So switch them to be dummy functions:
- xen_save_fl() can always return 0
- xen_irq_disable() is a nop
- xen_irq_enable() can BUG()
Add some generic paravirt functions for that purpose.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211028072748.29862-3-jgross@suse.com
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
xen_pvh_init() is lacking a prototype in a header, add it.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211006061950.9227-1-jgross@suse.com
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
- kprobes: Restructured stack unwinder to show properly on x86 when a stack
dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only denying
others). There's been pressure to allow non root to tracefs in a
controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function tracer
instead of having its own trampoline (this change will happen on an arch
by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform calculations
against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent warnings
from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over if
branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYYBdxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qp1sAQD2oYFwaG3sx872gj/myBcHIBSKdiki
Hry5csd8zYDBpgD+Poylopt5JIbeDuoYw/BedgEXmscZ8Qr7VzjAXdnv/Q4=
=Loz8
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- kprobes: Restructured stack unwinder to show properly on x86 when a
stack dump happens from a kretprobe callback.
- Fix to bootconfig parsing
- Have tracefs allow owner and group permissions by default (only
denying others). There's been pressure to allow non root to tracefs
in a controlled fashion, and using groups is probably the safest.
- Bootconfig memory managament updates.
- Bootconfig clean up to have the tools directory be less dependent on
changes in the kernel tree.
- Allow perf to be traced by function tracer.
- Rewrite of function graph tracer to be a callback from the function
tracer instead of having its own trampoline (this change will happen
on an arch by arch basis, and currently only x86_64 implements it).
- Allow multiple direct trampolines (bpf hooks to functions) be batched
together in one synchronization.
- Allow histogram triggers to add variables that can perform
calculations against the event's fields.
- Use the linker to determine architecture callbacks from the ftrace
trampoline to allow for proper parameter prototypes and prevent
warnings from the compiler.
- Extend histogram triggers to key off of variables.
- Have trace recursion use bit magic to determine preempt context over
if branches.
- Have trace recursion disable preemption as all use cases do anyway.
- Added testing for verification of tracing utilities.
- Various small clean ups and fixes.
* tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits)
tracing/histogram: Fix semicolon.cocci warnings
tracing/histogram: Fix documentation inline emphasis warning
tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together
tracing: Show size of requested perf buffer
bootconfig: Initialize ret in xbc_parse_tree()
ftrace: do CPU checking after preemption disabled
ftrace: disable preemption when recursion locked
tracing/histogram: Document expression arithmetic and constants
tracing/histogram: Optimize division by a power of 2
tracing/histogram: Covert expr to const if both operands are constants
tracing/histogram: Simplify handling of .sym-offset in expressions
tracing: Fix operator precedence for hist triggers expression
tracing: Add division and multiplication support for hist triggers
tracing: Add support for creating hist trigger variables from literal
selftests/ftrace: Stop tracing while reading the trace file by default
MAINTAINERS: Update KPROBES and TRACING entries
test_kprobes: Move it from kernel/ to lib/
docs, kprobes: Remove invalid URL and add new reference
samples/kretprobes: Fix return value if register_kretprobe() failed
lib/bootconfig: Fix the xbc_get_info kerneldoc
...
Cross-architecture update to move task_struct::cpu back into thread_info
on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers.
Quoting Ard Biesheuvel:
"Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that define
raw_smp_processor_id() in terms of this field, as it requires
linux/sched.h to be included, which causes a lot of pain in terms of
circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from) for all
architectures that enable THREAD_INFO_IN_TASK, addressing the header
soup issue as well as some pointless differences in the implementations
of task_cpu() and set_task_cpu()."
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmGAEPYWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJq4wEACItgLuyzPgB2eSLVMc3sHPIWcn
EUWbAWsuzJH79wmJtn2AKxW/C5OLBNGeoNjkXQvFN3ULkQDPrfCpB4x/tB6CjIQI
WRDf8kO7oaAD85ZrbSwyFl/MFfrD67f6H1HZoB9FKWAzuv/Bp2xQ0Kf06Dv4HEZp
CzprzZuWtjHB+qgyy+EpGOge3zbFmCuYPE2QpMYLWgs1rcVW9OYvoCI6AYtNefrC
6Kl6CbmBb1k6lFxkhM7wvRcIJthBl6Bajpc3Z2uL1aLb27dVpQZs3YpY859Knb6U
ZpOQCRJOMui3HOxyF3bDUI37y0XVLm6xaNM6C/7i0XS1GiFlSxkGVamg+Mp7anpI
+hdK5kqtSagaBC9CaJvRHnWIex1npQAfiyDNdyiEbrsUJ1dp6/zZcQSe4/m/XRbi
vywQPGxU9f1ASshzHsGU2TJf7Ps7qHulUsS5fKwmHU2ZjQnbYCoPN10JGO9gKjOX
yioN5xsKnbPY9j0ys3l9XBqaMJ8KAr1XspplTGIMZIVbjNMlqrfgbg8Qn8T8WGM7
oUqudMIxczilj0/iEGfGRxBeFaYAfhGQCDnxNlNX9g7Xe/gHTJgNYlHVxL55jHNu
AoPE3Gd0X8K9fbov0BCB6a21XwGJ6Wj+FSrnvuyWrRuy8JWiDFJaVKUBEcalKr7a
MhoUNQPu5M83OdC42A==
=PzvV
-----END PGP SIGNATURE-----
Merge tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull thread_info update to move 'cpu' back from task_struct from Kees Cook:
"Cross-architecture update to move task_struct::cpu back into
thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch
maintainers.
Quoting Ard Biesheuvel:
'Move task_struct::cpu back into thread_info
Keeping CPU in task_struct is problematic for architectures that
define raw_smp_processor_id() in terms of this field, as it
requires linux/sched.h to be included, which causes a lot of pain
in terms of circular dependencies (aka 'header soup')
This series moves it back into thread_info (where it came from)
for all architectures that enable THREAD_INFO_IN_TASK, addressing
the header soup issue as well as some pointless differences in the
implementations of task_cpu() and set_task_cpu()'"
* tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
riscv: rely on core code to keep thread_info::cpu updated
powerpc: smp: remove hack to obtain offset of task_struct::cpu
sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y
powerpc: add CPU field to struct thread_info
s390: add CPU field to struct thread_info
x86: add CPU field to struct thread_info
arm64: add CPU field to struct thread_info
- Non-urgent fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/xXMACgkQEsHwGGHe
VUpFohAAn1FcRfgUh4a7SZQudhWaYPye0Yaf9c9acJIDYfls4Qg3ZLvSNGS0QChW
pcjNQzr42UymxZKq1t6JGaUlD0vkfW0p+w5wueeIxMltWG0oZXgUPhqWrFTLwBtR
g5Gio3Jum1CULCMokS6W4MjJSkTtX5NyYPg+m5Siowy10cbBdYA4wJaKnwGslPT7
4pCDQP5159cjmG9WthKppxUdFy/vql0NJhjxmUkha39eVJ7yLoWvJoubQqqGnqXF
XHwFolZGBxm4Ed4XoUjtz4HgI0VD1JOImUBPqnaE/uyrU7bqqywe5/PpZP051xtF
anpWBm8KbZFsh220bSRJdFQxQBiXaIA41tfBiqVQhrgPy6TKgq7glhD4/ZjvUAdu
DDg2HYEnK3dBAOCa7zIj/+uTijD1nvvuhQblGB2PnvnD2RWWgl+0vZ9Wqspo0EyW
ry5V7hGCMC3mgFexTtvwd1hvMJVYrKfyn2XcP9B+zdgpUJ9DprB+g1O1J6NkGe1r
SKS6itMokVRd+I+16iFQh0PuywqldbNv9dby6bd+dtvxAcVER2vUA0C7wmjqX4Mx
bpftPrNhdNmgQAYlN/tRIfh2t2cFTJnWegVBBErdEfafiqKL9lU8gQlMVgwY10o+
a1ALQ5cUI9Y0xS4cJtfVBVIekqIwEbmniS66iMlMiEJx+Ar6T8g=
=Gql9
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Export sev_es_ghcb_hv_call() so that HyperV Isolation VMs can use it
too
- Non-urgent fixes and cleanups
* tag 'x86_sev_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV
x86/sev: Allow #VC exceptions on the VC2 stack
x86/sev: Fix stack type check in vc_switch_off_ist()
x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.c
x86/sev: Carve out HV call's return value verification
clears the segment base when a null selector is written. Do the explicit
detection on older CPUs, zen2 and hygon specifically, which have the
functionality but do not advertize the CPUID bit. Factor in the presence
of a hypervisor underneath the kernel and avoid doing the explicit check
there which the HV might've decided to not advertize for migration
safety reasons, a.o.
- Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting
those CPUs in the hardware vulnerabilities detection
- Force the compiler to use rIP-relative addressing in the fallback path of
static_cpu_has(), in order to avoid unnecessary register pressure
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/wRgACgkQEsHwGGHe
VUoGQBAAk9V9//FMoENuGFGul/IK8+VBibTfztYgaPvm7vjMDYaYuRBCQiZg5Y8U
D14pwkg7CuRa6iwZmrk/X/y6FVjo5BJA//ROk/n/9JNvV5QUp3/o00uLiziv80K3
H6Wm3PUyGgkpBuJg+/K8SLE9UQ6uSh4nsykS+70Dcd45DtkC/vH8pkDs5Q1fVQwb
7AuOuWTCWKUYOMFYWFI3a9D8tZYhg99ABREbXBaJGiGdIlZKNVe/7W8qQw5s6cVA
cD5Q2ILY2RCGP55ZQiWoFy3XNP3/ygvZ7Zm1ARYUvUMR2Y5X2XJWN/B6oMbc0oEu
OZsDDA/ILYcah9eBV/zk4ON/1djksp1iWNXNxjct0cNBPAKxi6T/HhHuIHBtzvW+
zDyBWUMLlv1m2i1oW4J4NuNJJi9Gaz+7PesmI7C0OQPgywR8UqqfMD+TzlEHWya1
YqYqI0f3aiyC/sLjUp3GSA7a9sWSd3BZfyAlLBJZCxyXAxX92tXX5BRPh/KYbnJn
c/NaYA6X4m4Rdvr0gKKtCklaC6w4GLzVak6wIvftzHlUYsWX21BhnTkQrciKbqc+
AKWed41AO+4pDHROePxc409x3UZolti+1RandikrztIVAolVJ6W/OkHWxXfy28Fg
iSrtl4M3omv8fCHDaJ26STrXqxH8pIK8noVolwQoXKyAFVyvXTk=
=rlVy
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- Start checking a CPUID bit on AMD Zen3 which states that the CPU
clears the segment base when a null selector is written. Do the
explicit detection on older CPUs, zen2 and hygon specifically, which
have the functionality but do not advertize the CPUID bit. Factor in
the presence of a hypervisor underneath the kernel and avoid doing
the explicit check there which the HV might've decided to not
advertize for migration safety reasons, or similar.
- Add support for a new X86 CPU vendor: VORTEX. Needed for whitelisting
those CPUs in the hardware vulnerabilities detection
- Force the compiler to use rIP-relative addressing in the fallback
path of static_cpu_has(), in order to avoid unnecessary register
pressure
* tag 'x86_cpu_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix migration safety with X86_BUG_NULL_SEL
x86/CPU: Add support for Vortex CPUs
x86/umip: Downgrade warning messages to debug loglevel
x86/asm: Avoid adding register pressure for the init case in static_cpu_has()
x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/ucwACgkQEsHwGGHe
VUpLMQ//d4xim4zD4hQVleYkGWqA2nB050QtutIto1nvsiZdjrUSMjJGZnos2nLd
9tY3NZtgrFfAyjUkal098L+2zed+U6UemIV6kT1F3TnWg4dYByxYABNutOsQGUgw
o4sTwjG7ELC273yPt/WY9TMwfMCiX7t80QkjoSeWbkApdfB0aZoxB0CvdLBKwCl/
bxdfX1uvqW7sc6fatcI634hC1HDw8GJThym4/lrMHq2Pr8n/U6pEWoBFsdlprnLk
pqb3IGX3kNnpjTmCpZxvd4ZQV8xUlMcJkdEFjKDf7BLtWjwZxPIdGcfnxrpf2EJQ
yVZklcabaBNz/zNkoQeyD6Ix1ZCFSxcHRhg0BJpvvhzQ91My2pGZgLuzUYz3Fk7G
GjWZje8WZcL3ViL9oGbOYMLSw76wov95+8WMiyKqPaNuzZbS3py5C/ThgqpCdg5b
WyQe0GhUvthzLsVz9Gu7OFrbZl6VBz8q7/bxuo+vpFhgC1EiOj2yPSZNUJBRKdcd
cFSfybcjk3Qyf7YXmZ/NcD9TQARQO1ediRY6ZNeZr7JYPzyebY+wTfHqDvdX65S5
i/zgeAX4XAuX4pl28nJvDe8x1P7t5T8L6Qno9Lnd1xMG7jWift9RSEOo29rUp0sw
gA9xV/BsmApvyM8pgD/lAqxAFzGkYfSy8bB6uav8HccHprVfJE0=
=4BVs
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
"The usual round of random minor fixes and cleanups all over the place"
* tag 'x86_cleanups_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Makefile: Remove unneeded whitespaces before tabs
x86/of: Kill unused early_init_dt_scan_chosen_arch()
x86: Fix misspelled Kconfig symbols
x86/Kconfig: Remove references to obsolete Kconfig symbols
x86/smp: Remove unnecessary assignment to local var freq_scale
by confidential computing solutions to query different aspects of the
system. The intent behind it is to unify testing of such aspects instead
of having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/uLUACgkQEsHwGGHe
VUqGbQ/+LOmz8hmL5vtbXw/lVonCSBRKI2KVefnN2VtQ3rjtCq8HlNoq/hAdi15O
WntABFV8u4daNAcssp+H/p+c8Mt/NzQa60TRooC5ZIynSOCj4oZQxTWjcnR4Qxrf
oABy4sp09zNW31qExtTVTwPC/Ejzv4hA0Vqt9TLQOSxp7oYVYKeDJNp79VJK64Yz
Ky7epgg8Pauk0tAT76ATR4kyy9PLGe4/Ry0bOtAptO4NShL1RyRgI0ywUmptJHSw
FV/MnoexdAs4V8+4zPwyOkf8YMDnhbJcvFcr7Yd9AEz2q9Z1wKCgi1M3aZIoW8lV
YMXECMGe9DfxmEJbnP5zbnL6eF32x+tbq+fK8Ye4V2fBucpWd27zkcTXjoP+Y+zH
NLg+9QykR9QCH75YCOXcAg1Q5hSmc4DaWuJymKjT+W7MKs89ywjq+ybIBpLBHbQe
uN9FM/CEKXx8nQwpNQc7mdUE5sZeCQ875028RaLbLx3/b6uwT6rBlNJfxl/uxmcZ
iF1kG7Cx4uO+7G1a9EWgxtWiJQ8GiZO7PMCqEdwIymLIrlNksAk7nX2SXTuH5jIZ
YDuBj/Xz2UUVWYFm88fV5c4ogiFlm9Jeo140Zua/BPdDJd2VOP013rYxzFE/rVSF
SM2riJxCxkva8Fb+8TNiH42AMhPMSpUt1Nmd1H2rcEABRiT83Ow=
=Na0U
-----END PGP SIGNATURE-----
Merge tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull generic confidential computing updates from Borislav Petkov:
"Add an interface called cc_platform_has() which is supposed to be used
by confidential computing solutions to query different aspects of the
system.
The intent behind it is to unify testing of such aspects instead of
having each confidential computing solution add its own set of tests
to code paths in the kernel, leading to an unwieldy mess"
* tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
treewide: Replace the use of mem_encrypt_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
x86/sev: Replace occurrences of sev_active() with cc_platform_has()
x86/sme: Replace occurrences of sme_active() with cc_platform_has()
powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
x86/sev: Add an x86 version of cc_platform_has()
arch/cc: Introduce a function to check for confidential computing features
x86/ioremap: Selectively build arch override encryption functions
of normal functions. This is in preparation of making the MCA code
noinstr-aware
- When the kernel copies data from user addresses and it encounters a
machine check, a SIGBUS is sent to that process. Change this action to
either an -EFAULT which is returned to the user or a short write, making
the recovery action a lot more user-friendly
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmF/s8sACgkQEsHwGGHe
VUqnaQ/8DIHkIOF6vy2w56snJwCj0XQYNLO+Clf6sHJ7ukWpWDoAi6HzvjqrBmaa
bQEdOLeO92wGtVutCQ5ndzq2SJ6UFcZtOulpHyzCpwNinhY2QMsPG6pkSzeaAy/e
aR4gpTY6pyCJyWl5DXXr7FMzBZVaWYdtZ2szPKmW1d1mLeDIdv5d3hInDbZ48XJF
o+fZx0uuK0CIuDjDujRNvkPbHXLbBSqSLCTRf66o+sCY5ZXHlAipabxa3UmhHKvd
dBxMrlObAaDBmDjqpc/YpS4IfWZb7+rHQfVmiq5O85ExXx6cyF6vlM7GI/5VBxSA
2dVcZX/3TsSqGbFdVygbcF6e/Yl1xhP5AE+pBb5jpzbzEaf4oiM8MDhoMAai3lEL
7CFsXL2oyAzho7QQsUSkv/hffHHrph2/aUZbGJlz6SdeRF9aoIjZANpcwm44TZrk
c11Fh1MLTDxx8uhCGrYFXqR8QgeTi4B+8d/CEXNJnkLXZMfSUtoL1iIzhBpsGkv3
r0JOIG2o5dGX2lLhQOiHZ+us33O1e8mvOli9P1jLoDttoKvNqSqLUuwpBCz4sc0E
ugfarf7v/R07NN+7SIT+O83ZG8dXxIRPzHm/g7wjZYgyOfEBgFSMBKVWXRotPo/f
aY88sDVyvF5sbYnUcA6zZANBCKAVfilqdMgCyaoGegoNGzDOCYE=
=bIZq
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Get rid of a bunch of function pointers used in MCA land in favor of
normal functions. This is in preparation of making the MCA code
noinstr-aware
- When the kernel copies data from user addresses and it encounters a
machine check, a SIGBUS is sent to that process. Change this action
to either an -EFAULT which is returned to the user or a short write,
making the recovery action a lot more user-friendly
* tag 'ras_core_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Sort mca_config members to get rid of unnecessary padding
x86/mce: Get rid of the ->quirk_no_way_out() indirect call
x86/mce: Get rid of msr_ops
x86/mce: Get rid of machine_check_vector
x86/mce: Get rid of the mce_severity function pointer
x86/mce: Drop copyin special case for #MC
x86/mce: Change to not send SIGBUS error during copy from user
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from explicit
error codes to a boolean fail/success as that's all what the calling
code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX support:
- Distangle the public header maze and remove especially the misnomed
kitchen sink internal.h which is despite it's name included all over
the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime by
flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code into
the FPU core which removes the number of exports and avoids adding
even more export when AMX has to be supported in KVM. This also
removes duplicated code which was of course unnecessary different and
incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new fpstate
container and just switching the buffer pointer from the user space
buffer to the KVM guest buffer when entering vcpu_run() and flipping
it back when leaving the function. This cuts the memory requirements
of a vCPU for FPU buffers in half and avoids pointless memory copy
operations.
This also solves the so far unresolved problem of adding AMX support
because the current FPU buffer handling of KVM inflicted a circular
dependency between adding AMX support to the core and to KVM. With
the new scheme of switching fpstate AMX support can be added to the
core code without affecting KVM.
- Replace various variables with proper data structures so the extra
information required for adding dynamically enabled FPU features (AMX)
can be added in one place
- Add AMX (Advanved Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
which allows to trap the (first) use of an AMX related instruction,
which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra 8K
or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and cleared
on exec(). The permission policy of the kernel is restricted to
sigaltstack size validation, but the syscall obviously allows
further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
takes granted permissions and the potentially resulting larger
signal frame into account. This mechanism can also be used to
enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support was
added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the use
of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have been
disabled in XCR0. If permission has been granted, then a new fpstate
which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler sends
SIGSEGV to the task. That's not elegant, but unavoidable as the
other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused by
unexpected memory allocation failures is not a fundamentally new
concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is disarmed
for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with the
same life time rules as the FPU register state itself. The mechanism
is keyed off with a static key which is default disabled so !AMX
equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
is limited by comparing the tasks XFD value with a per CPU shadow
variable to avoid redundant MSR writes. In case of switching from a
AMX using task to a non AMX using task or vice versa, the extra MSR
write is obviously inevitable.
All other places which need to be aware of the variable feature sets
and resulting variable sizes are not affected at all because they
retrieve the information (feature set, sizes) unconditonally from
the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support is in
the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which has
not been caught in review and testing right away was restricted to AMX
enabled systems, which is completely irrelevant for anyone outside Intel
and their early access program. There might be dragons lurking as usual,
but so far the fine grained refactoring has held up and eventual yet
undetected fallout is bisectable and should be easily addressable before
the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity to
follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for inclusion
into 5.16-rc1.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/NkITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodDkEADH4+/nN/QoSUHIuuha5Zptj3g2b16a
/3TxT9fhwPen/kzMGsUk70s3iWJMA+I5dCfkSZexJ2hfhcRe9cBzZIa1HCawKwf3
YCISTsO/M+LpeORuZ+TpfFLJKnxNr1SEOl+EYffGhq0AkCjifb9Cnr0JZuoMUzGU
jpfJZ2bj28ri5lG812DtzSMBM9E3SAwgJv+GNjmZbxZKb9mAfhbAMdBUXHirX7Ej
jmx6koQjYOKwYIW8w1BrdC270lUKQUyJTbQgdRkN9Mh/HnKyFixQ18JqGlgaV2cT
EtYePUfTEdaHdAhUINLIlEug1MfOslHU+HyGsdywnoChNB4GHPQuePC5Tz60VeFN
RbQ9aKcBUu8r95rjlnKtAtBijNMA4bjGwllVxNwJ/ZoA9RPv1SbDZ07RX3qTaLVY
YhVQl8+shD33/W24jUTJv1kMMexpHXIlv0gyfMryzpwI7uzzmGHRPAokJdbYKctC
dyMPfdE90rxTiMUdL/1IQGhnh3awjbyfArzUhHyQ++HyUyzCFh0slsO0CD18vUy8
FofhCugGBhjuKw3XwLNQ+KsWURz5qHctSzBc3qMOSyqFHbAJCVRANkhsFvWJo2qL
75+Z7OTRebtsyOUZIdq26r4roSxHrps3dupWTtN70HWx2NhQG1nLEw986QYiQu1T
hcKvDmehQLrUvg==
=x3WL
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from
explicit error codes to a boolean fail/success as that's all what the
calling code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX
support:
- Distangle the public header maze and remove especially the
misnomed kitchen sink internal.h which is despite it's name
included all over the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime
by flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code
into the FPU core which removes the number of exports and avoids
adding even more export when AMX has to be supported in KVM.
This also removes duplicated code which was of course
unnecessary different and incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new
fpstate container and just switching the buffer pointer from the
user space buffer to the KVM guest buffer when entering
vcpu_run() and flipping it back when leaving the function. This
cuts the memory requirements of a vCPU for FPU buffers in half
and avoids pointless memory copy operations.
This also solves the so far unresolved problem of adding AMX
support because the current FPU buffer handling of KVM inflicted
a circular dependency between adding AMX support to the core and
to KVM. With the new scheme of switching fpstate AMX support can
be added to the core code without affecting KVM.
- Replace various variables with proper data structures so the
extra information required for adding dynamically enabled FPU
features (AMX) can be added in one place
- Add AMX (Advanced Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR
(MSR_XFD) which allows to trap the (first) use of an AMX related
instruction, which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra
8K or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and
cleared on exec(). The permission policy of the kernel is
restricted to sigaltstack size validation, but the syscall
obviously allows further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2)
which takes granted permissions and the potentially resulting
larger signal frame into account. This mechanism can also be used
to enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support
was added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the
use of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have
been disabled in XCR0. If permission has been granted, then a new
fpstate which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler
sends SIGSEGV to the task. That's not elegant, but unavoidable as
the other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused
by unexpected memory allocation failures is not a fundamentally
new concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is
disarmed for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with
the same life time rules as the FPU register state itself. The
mechanism is keyed off with a static key which is default
disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
CPUs the overhead is limited by comparing the tasks XFD value
with a per CPU shadow variable to avoid redundant MSR writes. In
case of switching from a AMX using task to a non AMX using task
or vice versa, the extra MSR write is obviously inevitable.
All other places which need to be aware of the variable feature
sets and resulting variable sizes are not affected at all because
they retrieve the information (feature set, sizes) unconditonally
from the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support
is in the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which
has not been caught in review and testing right away was restricted
to AMX enabled systems, which is completely irrelevant for anyone
outside Intel and their early access program. There might be dragons
lurking as usual, but so far the fine grained refactoring has held up
and eventual yet undetected fallout is bisectable and should be
easily addressable before the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity
to follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for
inclusion into 5.16-rc1
* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
Documentation/x86: Add documentation for using dynamic XSTATE features
x86/fpu: Include vmalloc.h for vzalloc()
selftests/x86/amx: Add context switch test
selftests/x86/amx: Add test cases for AMX state management
x86/fpu/amx: Enable the AMX feature in 64-bit mode
x86/fpu: Add XFD handling for dynamic states
x86/fpu: Calculate the default sizes independently
x86/fpu/amx: Define AMX state components and have it used for boot-time checks
x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
x86/fpu/xstate: Add fpstate_realloc()/free()
x86/fpu/xstate: Add XFD #NM handler
x86/fpu: Update XFD state where required
x86/fpu: Add sanity checks for XFD
x86/fpu: Add XFD state to fpstate
x86/msr-index: Add MSRs for XFD
x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
x86/fpu: Reset permission and fpstate on exec()
x86/fpu: Prepare fpu_clone() for dynamically enabled features
x86/fpu/signal: Prepare for variable sigframe length
x86/signal: Use fpu::__state_user_size for sigalt stack validation
...
- Revert the printk format based wchan() symbol resolution as it can leak
the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset and
__sched_setscheduler() introduced a new lock dependency which is now
triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/OUkTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoR/5D/9ikdGNpKg9osNqJ3GjAmxsK6kVkB29
iFe2k8pIpWDToWQf/wQRGih4Yj3Cl49QSnZcPIibh2/12EB1qrrW6iSPJkInz8Ec
/1LS5/Vewn2OyoxyXZjdvGC5gTXEodSbIazASvX7nvdMeI4gsAsL5etzrMJirT/t
aymqvr7zovvywrwMTQJrGjUMo9l4ewE8tafMNNhRu1BHU1U4ojM9yvThyRAAcmp7
3Xy49A+Yq3IgrvYI4u8FMK5Zh08KaxSFjiLhePGm/bF+wSfYmWop2TP1jY05W2Uo
ti8hfbJMUoFRYuMxAiEldkItnc0wV4M9PtWZZ/x+B71bs65Y4Zjt9cW+rxJv2+m1
vzV31EsQwGnOti072dzWN4c/cZqngVXAjaNtErvDwJUr+Tw1ayv9KUvuodMQqZY6
mu68bFUO2kV9EMe1CBOv51Uy1RGHyLj3rlNqrkw+Xp5ISE9Ad2vhUEiRp5bQx5Ci
V/XFhGZkGUluh0vccrdFlNYZwhj8cZEzkOPCnPSeZ+bq8SyZE6xuHH/lTP1CJCOy
s800rW1huM+kgV+zRN8adDkGXibAk9N3RtVGnQXmuEy8gB9LZmQg+JeM2wsc9B+6
i0gdqZnsjNAfoK+BBAG4holxptSL8/eOJsFH8ZNIoxQ+iqooyPx9tFX7yXnRTBQj
d2qWG7UvoseT+g==
=fgtS
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- Revert the printk format based wchan() symbol resolution as it can
leak the raw value in case that the symbol is not resolvable.
- Make wchan() more robust and work with all kind of unwinders by
enforcing that the task stays blocked while unwinding is in progress.
- Prevent sched_fork() from accessing an invalid sched_task_group
- Improve asymmetric packing logic
- Extend scheduler statistics to RT and DL scheduling classes and add
statistics for bandwith burst to the SCHED_FAIR class.
- Properly account SCHED_IDLE entities
- Prevent a potential deadlock when initial priority is assigned to a
newly created kthread. A recent change to plug a race between cpuset
and __sched_setscheduler() introduced a new lock dependency which is
now triggered. Break the lock dependency chain by moving the priority
assignment to the thread function.
- Fix the idle time reporting in /proc/uptime for NOHZ enabled systems.
- Improve idle balancing in general and especially for NOHZ enabled
systems.
- Provide proper interfaces for live patching so it does not have to
fiddle with scheduler internals.
- Add cluster aware scheduling support.
- A small set of tweaks for RT (irqwork, wait_task_inactive(), various
scheduler options and delaying mmdrop)
- The usual small tweaks and improvements all over the place
* tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
sched/fair: Cleanup newidle_balance
sched/fair: Remove sysctl_sched_migration_cost condition
sched/fair: Wait before decaying max_newidle_lb_cost
sched/fair: Skip update_blocked_averages if we are defering load balance
sched/fair: Account update_blocked_averages in newidle_balance cost
x86: Fix __get_wchan() for !STACKTRACE
sched,x86: Fix L2 cache mask
sched/core: Remove rq_relock()
sched: Improve wake_up_all_idle_cpus() take #2
irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT
irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT
irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.
sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ
sched: Add cluster scheduler level for x86
sched: Add cluster scheduler level in core and related Kconfig for ARM64
topology: Represent clusters of CPUs within a die
sched: Disable -Wunused-but-set-variable
sched: Add wrapper for get_wchan() to keep task blocked
x86: Fix get_wchan() to support the ORC unwinder
proc: Use task_is_running() for wchan in /proc/$pid/stat
...
- Improve retpoline code patching by separating it from alternatives which
reduces memory footprint and allows to do better optimizations in the
actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces runtime
on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the hypercall
page.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GFgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoc1JD/0Sz6seP2OUMxbMT3gCcFo9sMvYTdsM
7WuGFbBbnCIo7g8JH7k0zRRBigptMp2eUtQXKkgaaIbWN4JbuVKf8KxN5/qXxLi4
fJ12QnNTGH9N2jtzl5wKmpjaKJnnJMD9D10XwoR+T6gn6NHd+AgLEs7GxxuQUlgo
eC9oEXhNHC8uNhiZc38EwfwmItI1bRgaLrnZWIL4rYGSMxfCK1/cEOpWrFfX9wmj
/diB6oqMyPXZXMCtgpX7TniUr5XOTCcUkeO9mQv5bmyq/YM/8hrTbcVSJlsVYLvP
EsBnUSHAcfLFiHXwa1RNiIGdbiPjbN+UYeXGAvqF58f3e5dTIHtN/UmWo7OH93If
9rLMVNcMpsfPx7QRk2IxEPumLCkyfwjzfKrVDM6P6TKEIUzD1og4IK9gTlfykVsh
56G5XiCOC/X2x8IMxKTLGuBiAVLFHXK/rSwoqhvNEWBFKDbP13QWs0LurBcW09Sa
/kQI9pIBT1xFA/R+OY5Xy1cqNVVK1Gxmk8/bllCijA9pCFSCFM4hLZE5CevdrBCV
h5SdqEK5hIlzFyypXfsCik/4p/+rfvlGfUKtFsPctxx29SPe+T0orx+l61jiWQok
rZOflwMawK5lDuASHrvNHGJcWaTwoo3VcXMQDnQY0Wulc43J5IFBaPxkZzgyd+S1
4lktHxatrCMUgw==
=pfZi
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Thomas Gleixner:
- Improve retpoline code patching by separating it from alternatives
which reduces memory footprint and allows to do better optimizations
in the actual runtime patching.
- Add proper retpoline support for x86/BPF
- Address noinstr warnings in x86/kvm, lockdep and paravirtualization
code
- Add support to handle pv_opsindirect calls in the noinstr analysis
- Classify symbols upfront and cache the result to avoid redundant
str*cmp() invocations.
- Add a CFI hash to reduce memory consumption which also reduces
runtime on a allyesconfig by ~50%
- Adjust XEN code to make objtool handling more robust and as a side
effect to prevent text fragmentation due to placement of the
hypercall page.
* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
bpf,x86: Respect X86_FEATURE_RETPOLINE*
bpf,x86: Simplify computing label offsets
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
x86/alternative: Add debug prints to apply_retpolines()
x86/alternative: Try inline spectre_v2=retpoline,amd
x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
x86/alternative: Implement .retpoline_sites support
x86/retpoline: Create a retpoline thunk array
x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
x86/asm: Fixup odd GEN-for-each-reg.h usage
x86/asm: Fix register order
x86/retpoline: Remove unused replacement symbols
objtool,x86: Replace alternatives with .retpoline_sites
objtool: Shrink struct instruction
objtool: Explicitly avoid self modifying code in .altinstr_replacement
objtool: Classify symbols
objtool: Support pv_opsindirect calls for noinstr
x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
x86/xen: Mark xen_force_evtchn_callback() noinstr
x86/xen: Make irq_disable() noinstr
...
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race between
cpuset and __sched_setscheduler() introduced a new lock dependency
which is now triggered. Break the lock dependency chain by moving the
priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF+8BUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWs2EACeNbL93aIFokd2/RllRSr4VvMjKNyW
PpA0RYDOz1Jh4ldK+7b/EYapKgAkR3yyOtz+jyjRE7jsQK0pQeLtYNLd3cTzsD7K
LCvl8rq6cbRqyFoSC15UKKNbQ/f+o/3LeGPoipr5NQZRMepxk2J/yBCNRXHvIbe6
oLMQJUgw7KKtvCrCUX9OSei4F09T1qsNrIYb7QafP5+v0zndAT7uKNivWrKGFrsh
Uk9epoH3hIkvQERkpmzwJEJaq6oyqhoYQy7ZRGayEPwIdCyivJGZrVX0mZk1LX58
uc8u5grIslX9MqZEQWBweR5y7nISB494NGKmoCInu66U/+3DSOg3AGH2Rfw8PNFZ
lMKdXzYoDgv2y6LeiLtTUKV4K1NBRXo0BhwSGbPw0o6C03/x003kG824Y+/naU75
6q05BZSia1PagPV3e0UAm0A2Rnjj/5uso2fEk0eGBSGM27jf9SQcSE8DVrEiLRd1
2N5uAXbMdfu4xACsEI1Uxu1KNOSQnUhBCy0X6Ppj1a083kLG7jg/126ebb05R8G4
MF79PFt+xUPSzmuKc/xwCdANtW+zzoyjYl5w6mwELBJ9veNbPShokGBTN/qzjXKZ
vdr3/pXx95lRAzFnGOnETesm3IyObruU4K8NbMKd2b+eYa0w1WuZCKnutGLfsqxg
byhCEw459e3P2g==
=r6ln
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for the interrupt subsystem:
Core changes:
- Prevent a potential deadlock when initial priority is assigned to a
newly created interrupt thread. A recent change to plug a race
between cpuset and __sched_setscheduler() introduced a new lock
dependency which is now triggered. Break the lock dependency chain
by moving the priority assignment to the thread function.
- A couple of small updates to make the irq core RT safe.
- Confine the irq_cpu_online/offline() API to the only left unfixable
user Cavium Octeon so that it does not grow new usage.
- A small documentation update
Driver changes:
- A large cross architecture rework to move irq_enter/exit() into the
architecture code to make addressing the NOHZ_FULL/RCU issues
simpler.
- The obligatory new irq chip driver for Microchip EIC
- Modularize a few irq chip drivers
- Expand usage of devm_*() helpers throughout the driver code
- The usual small fixes and improvements all over the place"
* tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...
* Fixes for Xen emulator bugs showing up as debug kernel WARNs
* Fix another issue with SEV/ES string I/O VMGEXITs
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmF6uGIUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNRagf/Srvk9lNcRh4cEzsczErKMyr3xOqA
jgsTSqgl1ExJI9sBLMpVYBOFGILMaMSrhLPIltKPy0Bj/E+hw8WOQwPa44QjWlSD
MAUxO1Nryt9Luc2L8uSd1c//g4fr4V1BhOaumk1lM14Q8EDfQBcDIMI2ZKueMU1+
2Q+n8/AsG63jQIINwKNidof0dzRtbfcE30Wq/8QHttIPo5wt6l0YClOlOikqNY8N
5+WSQFmuutHIXftq5Jb/Ldn/+HVukWZyZOEVwLnBpM9uBvIubNgcEakqvxsaVtAn
FHdvnA+Bk99/Xuhl+wRLQo8ofzQIQ13RQv3HPArJAJv34oAJZx2rNObVlA==
=6ofB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Fixes for s390 interrupt delivery
- Fixes for Xen emulator bugs showing up as debug kernel WARNs
- Fix another issue with SEV/ES string I/O VMGEXITs
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Take srcu lock in post_kvm_run_save()
KVM: SEV-ES: fix another issue with string I/O VMGEXITs
KVM: x86/xen: Fix kvm_xen_has_interrupt() sleeping in kvm_vcpu_block()
KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock
KVM: s390: preserve deliverable_mask in __airqs_kick_single_vcpu
KVM: s390: clear kicked_mask before sleeping again
- More progress on the protected VM front, now with the full
fixed feature set as well as the limitation of some hypercalls
after initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7u5YPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD6w8QAIKDLJCTqkxv5Vh4ZSmtXxg4gTZMBlg8oSQ8
sVL639aqBvFe3A6Vmz6IwBm+NT7Sm1zxkuH9qHzVR1gmXq0oLYNrIuyrzRW8PvqO
hIkSRRoVsf03755TmkxwR7/2jAFxb6FhEVAy6VWdQyI44orihIPvMp8aTIq+jvU+
XoNGb/rPf9HpSUtvuaHYvZhSZBhoi5dRnkr33R1+VR69n7Axs8lm905xcl6Pt0a0
QqYZWQvFu/BXPyNflG7LUsegRF/iiV2vNTbNNowkzlV5suqxBpJAp6ApDL/gWrHv
ya/6cMqicSjBIkWnawhXY98w6/5xfzK4IV/zc00FNWOlUdVP89Thqrgc8EkigS9R
BGcxFFqj41snr+ensSBBIkNtV+dBX52H3rUE0F9seiTXm8QWI86JobdeNadT8tUP
TXdOeCUcA+cp4Ngln18lsbOEaBkPA5H1po1nUFPHbKnVOxnqXScB7E/xF6rAbryV
m+Z+oidU7MyS/Ev/Da0ww/XFx7cs2ez9EgeQvjcdFAvUMqS6kcXEExvgGYlm+KRQ
GBMKPLCNHKdflMANoSpol7MZUmPJ45XoWKW1rntj2r9X+oJW2Z2hEx32xrWDJdqK
ixnbjog5kNZb0CjLGsUC90lo2hpRJecaLhAjgTLYaNC1QxGPrt92eat6gnwuMTBc
mpADqi7w
=qBAO
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.16
- More progress on the protected VM front, now with the full
fixed feature set as well as the limitation of some hypercalls
after initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
- A large cross-arch rework to move irq_enter()/irq_exit() into
the arch code, and removing it from the generic irq code.
Thanks to Mark Rutland for the huge effort!
- A few irqchip drivers are made modular (broadcom, meson), because
that's apparently a thing...
- A new driver for the Microchip External Interrupt Controller
- The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
can only be selected on the Cavium Octeon platform. Once this
platform is removed, the API will be removed at the same time.
- A sprinkle of devm_* helper, as people seem to love that.
- The usual spattering of small fixes and minor improvements.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7rnYPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDudEP/i3WmAcXQYKJpRz075M8S6PZ8BXeTKUe7WMK
rrslOkxDqyQ2SVqMLII1xkyOWafC7BnRjexm/ASwrBsc6GyQha7B2YsKy1m/NEwy
ZcnXCCIg71LpDrUyxbscFxB6s5OvUN0yv+a+WnEAmOXpD1x3S8x5tHmRUfsRGksR
zOhKaYPLqgCiw3VHRuhEKFUA+CMjXxHhw3lJv6gPh6TRjdXQuJouau2dBzr7tQEd
h9Jq2OatWXiwPr00hQDDILbdH4+fQYKJqsaaLNX0Pxexg2slRWHwrgA2o/w0tTVW
99HOc9hN04QoLkDfyQis40L1YC7VOIr5OAqzUehdYELT8UsrZS288Rr6099n4M/Y
x8Nzcg4eA+jVUz1VMEBA9qR45fKjEMcTAXyNAAYLsov/obSgGH/PSOYaunG2xvYq
iiJBM/g506PTw2MRROqrH5oKiER3tTD65f5NM0mJONr3xEm9XT74m0JIodgVZ4QX
0LMJytgetg0b+yZcFY25GhJ+2mGoYwB2eiZBVjE3FyLSs0epcuzogaKRi5axK4sN
rvlAtgNZiOg7tzRqiPIQKSzO3dCyJjR86t5fd1cRBl/WPmywvA2Lkcgd09V2oyJe
FEp1QllpgYw0a5+aIS+bdOUK63FLnLdEMas7WgSAAxA4/jjgP1p+SbytOD81psL0
4r02YN2A
=/NLR
-----END PGP SIGNATURE-----
Merge tag 'irqchip-5.16' into irq/core
Merge irqchip updates for Linux 5.16 from Marc Zyngier:
- A large cross-arch rework to move irq_enter()/irq_exit() into
the arch code, and removing it from the generic irq code.
Thanks to Mark Rutland for the huge effort!
- A few irqchip drivers are made modular (broadcom, meson), because
that's apparently a thing...
- A new driver for the Microchip External Interrupt Controller
- The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
can only be selected on the Cavium Octeon platform. Once this
platform is removed, the API will be removed at the same time.
- A sprinkle of devm_* helper, as people seem to love that.
- The usual spattering of small fixes and minor improvements.
* tag 'irqchip-5.16': (912 commits)
h8300: Fix linux/irqchip.h include mess
dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
MIPS: irq: Avoid an unused-variable error
genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
MIPS: loongson64: Drop call to irq_cpu_offline()
irq: remove handle_domain_{irq,nmi}()
irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: riscv: perform irqentry in entry code
irq: openrisc: perform irqentry in entry code
irq: csky: perform irqentry in entry code
irq: arm64: perform irqentry in entry code
irq: arm: perform irqentry in entry code
irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
irq: add generic_handle_arch_irq()
irq: unexport handle_irq_desc()
irq: simplify handle_domain_{irq,nmi}()
irq: mips: simplify do_domain_IRQ()
...
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211029083332.3680101-1-maz@kernel.org
Current BPF codegen doesn't respect X86_FEATURE_RETPOLINE* flags and
unconditionally emits a thunk call, this is sub-optimal and doesn't
match the regular, compiler generated, code.
Update the i386 JIT to emit code equal to what the compiler emits for
the regular kernel text (IOW. a plain THUNK call).
Update the x86_64 JIT to emit code similar to the result of compiler
and kernel rewrites as according to X86_FEATURE_RETPOLINE* flags.
Inlining RETPOLINE_AMD (lfence; jmp *%reg) and !RETPOLINE (jmp *%reg),
while doing a THUNK call for RETPOLINE.
This removes the hard-coded retpoline thunks and shrinks the generated
code. Leaving a single retpoline thunk definition in the kernel.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.614772675@infradead.org
Rewrite retpoline thunk call sites to be indirect calls for
spectre_v2=off. This ensures spectre_v2=off is as near to a
RETPOLINE=n build as possible.
This is the replacement for objtool writing alternative entries to
ensure the same and achieves feature-parity with the previous
approach.
One noteworthy feature is that it relies on the thunks to be in
machine order to compute the register index.
Specifically, this does not yet address the Jcc __x86_indirect_thunk_*
calls generated by clang, a future patch will add this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.232495794@infradead.org
Stick all the retpolines in a single symbol and have the individual
thunks as inner labels, this should guarantee thunk order and layout.
Previously there were 16 (or rather 15 without rsp) separate symbols and
a toolchain might reasonably expect it could displace them however it
liked, with disregard for their relative position.
However, now they're part of a larger symbol. Any change to their
relative position would disrupt this larger _array symbol and thus not
be sound.
This is the same reasoning used for data symbols. On their own there
is no guarantee about their relative position wrt to one aonther, but
we're still able to do arrays because an array as a whole is a single
larger symbol.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.169659320@infradead.org
Because it makes no sense to split the retpoline gunk over multiple
headers.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.106290934@infradead.org
Currently GEN-for-each-reg.h usage leaves GEN defined, relying on any
subsequent usage to start with #undef, which is rude.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.041792350@infradead.org
Ensure the register order is correct; this allows for easy translation
between register number and trampoline and vice-versa.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120309.978573921@infradead.org
Now that objtool no longer creates alternatives, these replacement
symbols are no longer needed, remove them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120309.915051744@infradead.org
Hyperv provides GHCB protocol to write Synthetic Interrupt
Controller MSR registers in Isolation VM with AMD SEV SNP
and these registers are emulated by hypervisor directly.
Hyperv requires to write SINTx MSR registers twice. First
writes MSR via GHCB page to communicate with hypervisor
and then writes wrmsr instruction to talk with paravisor
which runs in VMPL0. Guest OS ID MSR also needs to be set
via GHCB page.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-7-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted(). Add HYPERVISOR feature check in the
hv_is_isolation_supported() to optimize in non-virtualization
environment.
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-4-ltykernel@gmail.com
[ wei: fix conflicts with tip ]
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Hyperv exposes GHCB page via SEV ES GHCB MSR for SNP guest
to communicate with hypervisor. Map GHCB page for all
cpus to read/write MSR register and submit hvcall request
via ghcb page.
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-2-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmF298ceHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGIJYH/1rsEFQQ6caeQdy1
z9eFIe48DNM4l7bFk+qEj2UAbzPdahVJ299Mg5fW0n2CDemOc9/n0b9TxQ37YObi
mOzu0xwJVupIxkyFMPQSSc2q8aLm67NSpJy08DsmaNses5hSvu8x15RPHLQTybjt
SwtKns+jpCq79P1GWbrB5e5UkLb0VNoxNp4L1U4pMrYGcEkJUXbaxNY2V/JcXdM7
Vtn+qN0T/J6V6QVftv0t8Ecj3bjEnmL3kZHaTaNg3dGeKRpCGyHc5lcBQ0cNFG6t
vjZ9VbuhBzGI3TN2tHH5hpA1UXo7HPBBCwQqxF1jeGLGHULikYwZ3TAPWqL3QZqC
9cxr9SY=
=p75d
-----END PGP SIGNATURE-----
BackMerge tag 'v5.15-rc7' into drm-next
The msm next tree is based on rc3, so let's just backmerge rc7 before pulling it in.
Signed-off-by: Dave Airlie <airlied@redhat.com>
Add the AMX state components in XFEATURE_MASK_USER_SUPPORTED and the
TILE_DATA component to the dynamic states and update the permission check
table accordingly.
This is only effective on 64 bit kernels as for 32bit kernels
XFEATURE_MASK_TILE is defined as 0.
TILE_DATA is caller-saved state and the only dynamic state. Add build time
sanity check to ensure the assumption that every dynamic feature is caller-
saved.
Make AMX state depend on XFD as it is dynamic feature.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-24-chang.seok.bae@intel.com
The XSTATE initialization uses check_xstate_against_struct() to sanity
check the size of XSTATE-enabled features. AMX is a XSAVE-enabled feature,
and its size is not hard-coded but discoverable at run-time via CPUID.
The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.
Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-21-chang.seok.bae@intel.com
The fpstate embedded in struct fpu is the default state for storing the FPU
registers. It's sized so that the default supported features can be stored.
For dynamically enabled features the register buffer is too small.
The #NM handler detects first use of a feature which is disabled in the
XFD MSR. After handling permission checks it recalculates the size for
kernel space and user space state and invokes fpstate_realloc() which
tries to reallocate fpstate and install it.
Provide the allocator function which checks whether the current buffer size
is sufficient and if not allocates one. If allocation is successful the new
fpstate is initialized with the new features and sizes and the now enabled
features is removed from the task's XFD mask.
realloc_fpstate() uses vzalloc(). If use of this mechanism grows to
re-allocate buffers larger than 64KB, a more sophisticated allocation
scheme that includes purpose-built reclaim capability might be justified.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-19-chang.seok.bae@intel.com
If the XFD MSR has feature bits set then #NM will be raised when user space
attempts to use an instruction related to one of these features.
When the task has no permissions to use that feature, raise SIGILL, which
is the same behavior as #UD.
If the task has permissions, calculate the new buffer size for the extended
feature set and allocate a larger fpstate. In the unlikely case that
vzalloc() fails, SIGSEGV is raised.
The allocation function will be added in the next step. Provide a stub
which fails for now.
[ tglx: Updated serialization ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-18-chang.seok.bae@intel.com
Add storage for XFD register state to struct fpstate. This will be used to
store the XFD MSR state. This will be used for switching the XFD MSR when
FPU content is restored.
Add a per-CPU variable to cache the current MSR value so the MSR has only
to be written when the values are different.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-15-chang.seok.bae@intel.com
XFD introduces two MSRs:
- IA32_XFD to enable/disable a feature controlled by XFD
- IA32_XFD_ERR to expose to the #NM trap handler which feature
was tried to be used for the first time.
Both use the same xstate-component bitmap format, used by XCR0.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-14-chang.seok.bae@intel.com
Intel's eXtended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
This is going to be used to postpone the allocation of a larger XSTATE
buffer for a task to the point where it is actually using a related
instruction after the permission to use that facility has been granted.
XFD is not used by the kernel, but only applied to userspace. This is a
matter of policy as the kernel knows how a fpstate is reallocated and the
XFD state.
The compacted XSAVE format is adjustable for dynamic features. Make XFD
depend on XSAVES.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-13-chang.seok.bae@intel.com
The default portion of the parent's FPU state is saved in a child task.
With dynamic features enabled, the non-default portion is not saved in a
child's fpstate because these register states are defined to be
caller-saved. The new task's fpstate is therefore the default buffer.
Fork inherits the permission of the parent.
Also, do not use memcpy() when TIF_NEED_FPU_LOAD is set because it is
invalid when the parent has dynamic features.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-11-chang.seok.bae@intel.com
To allow building up the infrastructure required to support dynamically
enabled FPU features, add:
- XFEATURES_MASK_DYNAMIC
This constant will hold xfeatures which can be dynamically enabled.
- fpu_state_size_dynamic()
A static branch for 64-bit and a simple 'return false' for 32-bit.
This helper allows to add dynamic-feature-specific changes to common
code which is shared between 32-bit and 64-bit without #ifdeffery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-8-chang.seok.bae@intel.com
Dynamically enabled XSTATE features are by default disabled for all
processes. A process has to request permission to use such a feature.
To support this implement a architecture specific prctl() with the options:
- ARCH_GET_XCOMP_SUPP
Copies the supported feature bitmap into the user space provided
u64 storage. The pointer is handed in via arg2
- ARCH_GET_XCOMP_PERM
Copies the process wide permitted feature bitmap into the user space
provided u64 storage. The pointer is handed in via arg2
- ARCH_REQ_XCOMP_PERM
Request permission for a feature set. A feature set can be mapped to a
facility, e.g. AMX, and can require one or more XSTATE components to
be enabled.
The feature argument is the number of the highest XSTATE component
which is required for a facility to work.
The request argument is not a user supplied bitmap because that makes
filtering harder (think seccomp) and even impossible because to
support 32bit tasks the argument would have to be a pointer.
The permission mechanism works this way:
Task asks for permission for a facility and kernel checks whether that's
supported. If supported it does:
1) Check whether permission has already been granted
2) Compute the size of the required kernel and user space buffer
(sigframe) size.
3) Validate that no task has a sigaltstack installed
which is smaller than the resulting sigframe size
4) Add the requested feature bit(s) to the permission bitmap of
current->group_leader->fpu and store the sizes in the group
leaders fpu struct as well.
If that is successful then the feature is still not enabled for any of the
tasks. The first usage of a related instruction will result in a #NM
trap. The trap handler validates the permission bit of the tasks group
leader and if permitted it installs a larger kernel buffer and transfers
the permission and size info to the new fpstate container which makes all
the FPU functions which require per task information aware of the extended
feature set.
[ tglx: Adopted to new base code, added missing serialization,
massaged namings, comments and changelog ]
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-7-chang.seok.bae@intel.com
The upcoming prctl() which is required to request the permission for a
dynamically enabled feature will also provide an option to retrieve the
supported features. If the CPU does not support XSAVE, the supported
features would be 0 even when the CPU supports FP and SSE.
Provide separate storage for the legacy feature set to avoid that and fill
in the bits in the legacy init function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-6-chang.seok.bae@intel.com
Dynamically enabled features can be requested by any thread of a running
process at any time. The request does neither enable the feature nor
allocate larger buffers. It just stores the permission to use the feature
by adding the features to the permission bitmap and by calculating the
required sizes for kernel and user space.
The reallocation of the kernel buffer happens when the feature is used
for the first time which is caught by an exception. The permission
bitmap is then checked and if the feature is permitted, then it becomes
fully enabled. If not, the task dies similarly to a task which uses an
undefined instruction.
The size information is precomputed to allow proper sigaltstack size checks
once the feature is permitted, but not yet in use because otherwise this
would open race windows where too small stacks could be installed causing
a later fail on signal delivery.
Initialize them to the default feature set and sizes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-5-chang.seok.bae@intel.com
Hyper-V needs to issue the GHCB HV call in order to read/write MSRs in
Isolation VMs. For that, expose sev_es_ghcb_hv_call().
The Hyper-V Isolation VMs are unenlightened guests and run a paravisor
at VMPL0 for communicating. GHCB pages are being allocated and set up
by that paravisor. Linux gets the GHCB page's physical address via
MSR_AMD64_SEV_ES_GHCB from the paravisor and should not change it.
Add a @set_ghcb_msr parameter to sev_es_ghcb_hv_call() to control
whether the function should set the GHCB's address prior to the call or
not and export that function for use by HyperV.
[ bp: - Massage commit message
- add a struct ghcb forward declaration to fix randconfig builds. ]
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-6-ltykernel@gmail.com
Should instruction emulation fail, include the VM exit reason, etc. in
the emulation_failure data passed to userspace, in order that the VMM
can report it as a debugging aid when describing the failure.
Suggested-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend the get_exit_info static call to provide the reason for the VM
exit. Modify relevant trace points to use this rather than extracting
the reason in the caller.
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-3-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the upcoming AMX support it's necessary to do a proper integration with
KVM. Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current:🧵:fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Convert the KVM FPU code over to this new scheme.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185313.019454292@linutronix.de
For the upcoming AMX support it's necessary to do a proper integration with
KVM. Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current:🧵:fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Provide:
- An allocator which initializes the state properly
- A replacement for the existing FPU swap mechanim
Aside of the reduced memory footprint, this also makes state switching
more efficient when TIF_FPU_NEED_LOAD is set. It does not require a
memcpy as the state is already correct in the to be swapped out fpstate.
The existing interfaces will be removed once KVM is converted over.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185312.954684740@linutronix.de
For the upcoming AMX support it's necessary to do a proper integration with
KVM. To avoid more nasty hackery in KVM which violate encapsulation extend
struct fpu and fpstate so the fpstate switching can be consolidated and
simplified.
Currently KVM allocates two FPU structs which are used for saving the user
state of the vCPU thread and restoring the guest state when entering
vcpu_run() and doing the reverse operation before leaving vcpu_run().
With the new fpstate mechanism this can be reduced to one extra buffer by
swapping the fpstate pointer in current:🧵:fpu. This makes the
upcoming support for AMX and XFD simpler because then fpstate information
(features, sizes, xfd) are always consistent and it does not require any
nasty workarounds.
Add fpu::__task_fpstate to save the regular fpstate pointer while the task
is inside vcpu_run(). Add some state fields to fpstate to indicate the
nature of the state.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211022185312.896403942@linutronix.de
* Fix for instruction emulation with PKU
* fixes for rare delaying of interrupt delivery
* fix for SEV-ES buffer overflow
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFy2tsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMrKggAq6JWuFGwJY8hq9hd/8SMvJUsmtmh
ua7zKj8xi8w52yZNigCSllj3cOtpQ4pTpy9nhUBcXbGEWDNbZ9Tm6flYmvc6Hrt3
iffXBtqri3ioSvQr908f+ceOAsX8ishA1ewbMKLmathGN6+GXa3KtqVAZ2t7z3Yp
VX/I/xpViYGwhMPi5T1Yoj0SfVAEhO0ROodcGJXo2ddX/FVZTibqE/nONkXbgMP0
gibf39N7JIti3oz+puLkFUnBKcdi/jy9yUjz01Rn315QrrFEsOsPhQGLR6Q24lgg
7aarqbsoJQK6eJwNU/SxwpiZuj5lRsQVD0evkNd/JxDkGCa1T5cXUVILdg==
=+1Ow
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more x86 kvm fixes from Paolo Bonzini:
- Cache coherency fix for SEV live migration
- Fix for instruction emulation with PKU
- fixes for rare delaying of interrupt delivery
- fix for SEV-ES buffer overflow
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SEV-ES: go over the sev_pio_data buffer in multiple passes if needed
KVM: SEV-ES: keep INS functions together
KVM: x86: remove unnecessary arguments from complete_emulator_pio_in
KVM: x86: split the two parts of emulator_pio_in
KVM: SEV-ES: clean up kvm_sev_es_ins/outs
KVM: x86: leave vcpu->arch.pio.count alone in emulator_pio_in_out
KVM: SEV-ES: rename guest_ins_data to sev_pio_data
KVM: SEV: Flush cache on non-coherent systems before RECEIVE_UPDATE_DATA
KVM: MMU: Reset mmu->pkru_mask to avoid stale data
KVM: nVMX: promptly process interrupts delivered while in guest mode
KVM: x86: check for interrupts before deciding whether to exit the fast path
Compile kretprobe related stacktrace entry recovery code and
unwind_state::kr_cur field only when CONFIG_KRETPROBES=y.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that
vCPUs responding to requests can take the lock for read and run in
parallel. Using a mutex forces serialization of vCPUs even though
kvm_vcpu_update_apicv() only touches data local to that vCPU or is
protected by a different lock, e.g. SVM's ir_list_lock.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The PIO scratch buffer is larger than a single page, and therefore
it is not possible to copy it in a single step to vcpu->arch/pio_data.
Bound each call to emulator_pio_in/out to a single page; keep
track of how many I/O operations are left in vcpu->arch.sev_pio_count,
so that the operation can be restarted in the complete_userspace_io
callback.
For OUT, this means that the previous kvm_sev_es_outs implementation
becomes an iterator of the loop, and we can consume the sev_pio_data
buffer before leaving to userspace.
For IN, instead, consuming the buffer and decreasing sev_pio_count
is always done in the complete_userspace_io callback, because that
is when the memcpy is done into sev_pio_data.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will be using this field for OUTS emulation as well, in case the
data that is pushed via OUTS spans more than one page. In that case,
there will be a need to save the data pointer across exits to userspace.
So, change the name to something that refers to any kind of PIO.
Also spell out what it is used for, namely SEV-ES.
No functional change intended.
Cc: stable@vger.kernel.org
Fixes: 7ed9abfe8e ("KVM: SVM: Support string IO operations for an SEV-ES guest")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The microcode loader has been looping through __start_builtin_fw down to
__end_builtin_fw to look for possibly built-in firmware for microcode
updates.
Now that the firmware loader code has exported an API for looping
through the kernel's built-in firmware section, use it and drop the x86
implementation in favor.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211021155843.1969401-4-mcgrof@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Paul pointed out the error messages when KVM fails to load are unhelpful
in understanding exactly what went wrong if userspace probes the "wrong"
module.
Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel
and kvm_amd, and use the name for relevant error message when KVM fails
to load so that the user knows which module failed to load.
Opportunistically tweak the "disabled by bios" error message to clarify
that _support_ was disabled, not that the module itself was magically
disabled by BIOS.
Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211018183929.897461-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SDM section 18.2.3 mentioned that:
"IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of
any general-purpose or fixed-function counters via a single WRMSR."
It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing,
the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr
MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop
global_ovf_ctrl variable.
Tested-by: Like Xu <likexu@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unify the flags for rmaps and page tracking data, using a
single flag in struct kvm_arch and a single loop to go
over all the address spaces and memslots. This avoids
code duplication between alloc_all_memslots_rmaps and
kvm_page_track_enable_mmu_write_tracking.
Signed-off-by: David Stevens <stevensd@chromium.org>
[This patch is the delta between David's v2 and v3, with conflicts
fixed and my own commit message. - Paolo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that everything is mopped up, move all the helpers and prototypes into
the core header. They are not required by the outside.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.514095101@linutronix.de
xfeatures_mask_fpstate() is no longer valid when dynamically enabled
features come into play.
Rework restore_regs_from_fpstate() so it takes a constant mask which will
then be applied against the maximum feature set so that the restore
operation brings all features which are not in the xsave buffer xfeature
bitmap into init state.
This ensures that if the previous task used a dynamically enabled feature
that the task which restores has all unused components properly initialized.
Cleanup the last user of xfeatures_mask_fpstate() as well and remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.461348278@linutronix.de
Use the new fpu_user_cfg to retrieve the information instead of
xfeatures_mask_uabi() which will be no longer correct when dynamically
enabled features become available.
Using fpu_user_cfg is appropriate when setting XCOMP_BV in the
init_fpstate since it has space allocated for "max_features". But,
normal fpstates might only have space for default xfeatures. Since
XRSTOR* derives the format of the XSAVE buffer from XCOMP_BV, this can
lead to XRSTOR reading out of bounds.
So when copying actively used fpstate, simply read the XCOMP_BV features
bits directly out of the fpstate instead.
This correction courtesy of Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.408879849@linutronix.de
Move the feature mask storage to the kernel and user config
structs. Default and maximum feature set are the same for now.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.352041752@linutronix.de
Provide a struct to store information about the maximum supported and the
default feature set and buffer sizes for both user and kernel space.
This allows quick retrieval of this information for the upcoming support
for dynamically enabled features.
[ bp: Add vertical spacing between the struct members. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211014230739.126107370@linutronix.de
DM&P devices were not being properly identified, which resulted in
unneeded Spectre/Meltdown mitigations being applied.
The manufacturer states that these devices execute always in-order and
don't support either speculative execution or branch prediction, so
they are not vulnerable to this class of attack. [1]
This is something I've personally tested by a simple timing analysis
on my Vortex86MX CPU, and can confirm it is true.
Add identification for some devices that lack the CPUID product name
call, so they appear properly on /proc/cpuinfo.
¹https://www.ssv-embedded.de/doks/infos/DMP_Ann_180108_Meltdown.pdf
[ bp: Massage commit message. ]
Signed-off-by: Marcos Del Sol Vives <marcos@orca.pet>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211017094408.1512158-1-marcos@orca.pet
Prepare for dynamically enabled states per task. The function needs to
retrieve the features and sizes which are valid in a fpstate
context. Retrieve them from fpstate.
Move the function declarations to the core header as they are not
required anywhere else.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145323.233529986@linutronix.de
Add state size and feature mask information to the fpstate container. This
will be used for runtime checks with the upcoming support for dynamically
enabled features and dynamically sized buffers. That avoids conditionals
all over the place as the required information is accessible for both
default and extended buffers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.921388806@linutronix.de
In preparation for dynamically enabled FPU features move the function
out of line as the goal is to expose less and not more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.869001791@linutronix.de
We don't need special hook for graph tracer entry point,
but instead we can use graph_ops::func function to install
the return_hooker.
This moves the graph tracing setup _before_ the direct
trampoline prepares the stack, so the return_hooker will
be called when the direct trampoline is finished.
This simplifies the code, because we don't need to take into
account the direct trampoline setup when preparing the graph
tracer hooker and we can allow function graph tracer on entries
registered with direct trampoline.
Link: https://lkml.kernel.org/r/20211008091336.33616-4-jolsa@kernel.org
[fixed compile error reported by kernel test robot <lkp@intel.com>]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Convert the rest of the core code to the new register storage mechanism in
preparation for dynamically sized buffers.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.659456185@linutronix.de
Convert FPU tracing code to the new register storage mechanism in
preparation for dynamically sized buffers.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.503327333@linutronix.de
In order to prepare for the support of dynamically enabled FPU features,
move the clearing of xstate components to the FPU core code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211013145322.399567049@linutronix.de
Convert restore_fpregs_from_fpstate() and related code to the new
register storage mechanism in preparation for dynamically sized buffers.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.347395546@linutronix.de
New xfeatures will not longer be automatically stored in the regular XSAVE
buffer in thread_struct::fpu.
The kernel will provide the default sized buffer for storing the regular
features up to AVX512 in thread_struct::fpu and if a task requests to use
one of the new features then the register storage has to be extended.
The state will be accessed via a pointer in thread_struct::fpu which
defaults to the builtin storage and can be switched when extended storage
is required.
To avoid conditionals all over the code, create a new container for the
register storage which will gain other information, e.g. size, feature
masks etc., later. For now it just contains the register storage, which
gives it exactly the same layout as the exiting fpu::state.
Stick fpu::state and the new fpu::__fpstate into an anonymous union and
initialize the pointer. Add build time checks to validate that both are
at the same place and have the same size.
This allows step by step conversion of all users.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211013145322.234458659@linutronix.de
Similar to the copy from user function the FPU core has this already
implemented with all bells and whistles.
Get rid of the duplicated code and use the core functionality.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.244101845@linutronix.de
To make upcoming changes for support of dynamically enabled features
simpler, provide a proper function for the exception handler which removes
exposure of FPU internals.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011540.053515012@linutronix.de
Now that the file is empty, fixup all references with the proper includes
and delete the former kitchen sink.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011540.001197214@linutronix.de
In order to remove internal.h make signal.h independent of it.
Include asm/fpu/xstate.h to fix a missing update_regset_xstate_info()
prototype, which is
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.844565975@linutronix.de
Move function declarations which need to be globally available to api.h
where they belong.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.792363754@linutronix.de
Only used internally in the FPU core code.
While at it, convert to the percpu accessors which verify preemption is
disabled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.686806639@linutronix.de
Further disintegration of internal.h:
Move the CPU feature tests to a core header and remove the unused one.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.401510559@linutronix.de
internal.h is a kitchen sink which needs to get out of the way to prepare
for the upcoming changes.
Move the context switch and exit to user inlines into a separate header,
which is all that code needs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011539.349132461@linutronix.de
Copying a user space buffer to the memory buffer is already available in
the FPU core. The copy mechanism in KVM lacks sanity checks and needs to
use cpuid() to lookup the offset of each component, while the FPU core has
this information cached.
Make the FPU core variant accessible for KVM and replace the home brewed
mechanism.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.134065207@linutronix.de
Swapping the host/guest FPU is directly fiddling with FPU internals which
requires 5 exports. The upcoming support of dynamically enabled states
would even need more.
Implement a swap function in the FPU core code and export that instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/20211015011539.076072399@linutronix.de
No point in having this duplicated all over the place with needlessly
different defines.
Provide a proper initialization function which initializes user buffers
properly and make KVM use it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.897664678@linutronix.de
PKRU code does not need anything from FPU headers. Include cpufeature.h
instead and fixup the resulting fallout in perf.
This is a preparation for FPU changes in order to prevent recursive include
hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211015011538.551522694@linutronix.de
To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.
A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210916181538.968978-8-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Protect the reference point for kvmclock with a seqcount, so that
kvmclock updates for all vCPUs can proceed in parallel. Xen runstate
updates will also run in parallel and not bounce the kvmclock cacheline.
Of the variables that were protected by pvclock_gtod_sync_lock,
nr_vcpus_matched_tsc is different because it is updated outside
pvclock_update_vm_gtod_copy and read inside it. Therefore, we
need to keep it protected by a spinlock. In fact it must now
be a raw spinlock, because pvclock_update_vm_gtod_copy, being the
write-side of a seqcount, is non-preemptible. Since we already
have tsc_write_lock which is a raw spinlock, we can just use
tsc_write_lock as the lock that protects the write-side of the
seqcount.
Co-developed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181538.968978-6-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling the migration of TSCs correctly is difficult, in part because
Linux does not provide userspace with the ability to retrieve a (TSC,
realtime) clock pair for a single instant in time. In lieu of a more
convenient facility, KVM can report similar information in the kvm_clock
structure.
Provide userspace with a host TSC & realtime pair iff the realtime clock
is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
realtime value, advance the KVM clock by the amount of elapsed time. Do
not step the KVM clock backwards, though, as it is a monotonic
oscillator.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210916181538.968978-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
update_pasid() and its call chain are currently unused in the tree because
Thomas disabled the ENQCMD feature. The feature will be re-enabled shortly
using a different approach and update_pasid() and its call chain will not
be used in the new approach.
Remove the useless functions.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210920192349.2602141-1-fenghua.yu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211014053839.727419-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Resolve the conflict between these commits:
x86/fpu: 1193f408cd ("x86/fpu/signal: Change return type of __fpu_restore_sig() to boolean")
x86/urgent: d298b03506 ("x86/fpu: Restore the masking out of reserved MXCSR bits")
b2381acd3f ("x86/fpu: Mask out the invalid MXCSR bits properly")
Conflicts:
arch/x86/kernel/fpu/signal.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.
To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a boost
to performance especially on medium load system on Jacobsville. on a
Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
cores each, the benchmark number is as follow:
Improvement over baseline kernel for mcf_r
copies run time base rate
1 -0.1% -0.2%
6 25.1% 25.1%
12 18.8% 19.0%
24 0.3% 0.3%
So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.
Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
Having a stable wchan means the process must be blocked and for it to
stay that way while performing stack unwinding.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
out due to histerical reasons and 64-bit kernels reject them
- A fix to clear X86_FEATURE_SMAP when support for is not config-enabled
- Three fixes correcting misspelled Kconfig symbols used in code
- Two resctrl object cleanup fixes
- Yet another attempt at fixing the neverending saga of botched x86
timers, this time because some incredibly smart hardware decides to turn
off the HPET timer in a low power state - who cares if the OS is relying
on it...
- Check the full return value range of an SEV VMGEXIT call to determine
whether it returned an error
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFiqboACgkQEsHwGGHe
VUpIFA//cLWfa1vvamCcLjW0lruQVzHrZesO4Cbti3Fyp2at/Dtwt9w/uZPu9NAa
+sreJBdrkZfo9lmKW6/E1MmLT/YlLg8YHsylKn9d+XSdcy0qWXLYdVVm7bb4teJf
XxRQfYNQrwfpjFNnt+7NUcaqte2zUo7K16CctJF5+E6SGUn+hlu6zK15tf6MMAM1
TFHsQWEuRW5Mgc7eD734cNGDOJgzvb4IACn5BNfKR1+jD1ANfutytXjGqcveJ/sg
lBoWMCU47vo5/uoW516oBK6PfQ/+s1OvYAx2G4DMQSC7WpEWpxnJUoszj9umu+jE
VndS8jQ4WGXcVmfkkwUHbVxcJzsPEzZ/5m+nER9hrGOykKWTajzi2MirBHju5EKv
xfYLqEJHNG9YulxKy2wIW0VRmXDE3wFZfaPAmQbLXud1KfzlC/EpEaloZSJSgqyG
L4uOKk8CBumYJzgVCfTFAqqr1HhmeylYSxHmOUEzTm0sEJX2HuodGcl+sPI/LDPW
MkjVYXq2sOUEVLmk5xyJIkbAUcK2X/Fzt3rKS4CVsjfzWRW67o1oopMy6ZrQ0o/h
Dt/fHub/+Pke5sbB2+RiRsvq3aDftRkvaZK05pTiqlE9gFlKaCVwxDQqvmTnY0oa
PkPzauXRC4qjNsdDMGHaiclm/fk/nlLM9vxXGJ+oTXP6snC4OhQ=
=kKOw
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.15_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- A FPU fix to properly handle invalid MXCSR values: 32-bit masks them
out due to historical reasons and 64-bit kernels reject them
- A fix to clear X86_FEATURE_SMAP when support for is not
config-enabled
- Three fixes correcting misspelled Kconfig symbols used in code
- Two resctrl object cleanup fixes
- Yet another attempt at fixing the neverending saga of botched x86
timers, this time because some incredibly smart hardware decides to
turn off the HPET timer in a low power state - who cares if the OS is
relying on it...
- Check the full return value range of an SEV VMGEXIT call to determine
whether it returned an error
* tag 'x86_urgent_for_v5.15_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Restore the masking out of reserved MXCSR bits
x86/Kconfig: Correct reference to MWINCHIP3D
x86/platform/olpc: Correct ifdef symbol to intended CONFIG_OLPC_XO15_SCI
x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=n
x86/entry: Correct reference to intended CONFIG_64_BIT
x86/resctrl: Fix kfree() of the wrong type in domain_add_cpu()
x86/resctrl: Free the ctrlval arrays when domain_setup_mon_state() fails
x86/hpet: Use another crystalball to evaluate HPET usability
x86/sev: Return an error on a returned non-zero SW_EXITINFO1[31:0]
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYWBSIwAKCRCAXGG7T9hj
vrXxAP9na1EqRJ+SpWyvxHY1jMaIrbg1bgnOc+GsnWxU5liW5AEA4h1HjHtVtrzL
3vweIS6u2fanrWlYML/daQ3r6EuLPQc=
=iXsP
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.15b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- fix two minor issues in the Xen privcmd driver plus a cleanup patch
for that driver
- fix multiple issues related to running as PVH guest and some related
earlyprintk fixes for other Xen guest types
- fix an issue introduced in 5.15 the Xen balloon driver
* tag 'for-linus-5.15b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: fix cancelled balloon action
xen/x86: adjust data placement
x86/PVH: adjust function/data placement
xen/x86: hook up xen_banner() also for PVH
xen/x86: generalize preferred console model from PV to PVH Dom0
xen/x86: make "earlyprintk=xen" work for HVM/PVH DomU
xen/x86: allow "earlyprintk=xen" to work for PV Dom0
xen/x86: make "earlyprintk=xen" work better for PVH Dom0
xen/x86: allow PVH Dom0 without XEN_PV=y
xen/x86: prevent PVH type from getting clobbered
xen/privcmd: drop "pages" parameter from xen_remap_pfn()
xen/privcmd: fix error handling in mmap-resource processing
xen/privcmd: replace kcalloc() by kvcalloc() when allocating empty pages
Export smca_get_bank_type for use in the AMD GPU
driver to determine MCA bank while handling correctable
and uncorrectable errors in GPU UMC.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
[ bp: Steal and adapt Tom's commit message. ]
Fixes: 7fae4c24a2 ("x86: Increase exception stack sizes")
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic
Commit in Fixes adds a condition with IS_ENABLED(CONFIG_64_BIT),
but the intended config item is called CONFIG_64BIT, as defined in
arch/x86/Kconfig.
Fortunately, scripts/checkkconfigsymbols.py warns:
64_BIT
Referencing files: arch/x86/include/asm/entry-common.h
Correct the reference to the intended config symbol.
Fixes: 662a022189 ("x86/entry: Fix AC assertion")
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20210803113531.30720-2-lukas.bulwahn@gmail.com
Decouple XEN_DOM0 from XEN_PV, converting some existing uses of XEN_DOM0
to a new XEN_PV_DOM0. (I'm not convinced all are really / should really
be PV-specific, but for starters I've tried to be conservative.)
For PVH Dom0 the hypervisor populates MADT with only x2APIC entries, so
without x2APIC support enabled in the kernel things aren't going to work
very well. (As opposed, DomU-s would only ever see LAPIC entries in MADT
as of now.) Note that this then requires PVH Dom0 to be 64-bit, as
X86_X2APIC depends on X86_64.
In the course of this xen_running_on_version_or_later() needs to be
available more broadly. Move it from a PV-specific to a generic file,
considering that what it does isn't really PV-specific at all anyway.
Note that xen/interface/version.h cannot be included on its own; in
enlighten.c, which uses SCHEDOP_* anyway, include xen/interface/sched.h
first to resolve the apparently sole missing type (xen_ulong_t).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/983bb72f-53df-b6af-14bd-5e088bd06a08@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Replace uses of mem_encrypt_active() with calls to cc_platform_has() with
the CC_ATTR_MEM_ENCRYPT attribute.
Remove the implementation of mem_encrypt_active() across all arches.
For s390, since the default implementation of the cc_platform_has()
matches the s390 implementation of mem_encrypt_active(), cc_platform_has()
does not need to be implemented in s390 (the config option
ARCH_HAS_CC_PLATFORM is not set).
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-9-bp@alien8.de
Replace uses of sev_es_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_STATE_ENCRYPT. If future support is added for other
memory encyrption techonologies, the use of CC_ATTR_GUEST_STATE_ENCRYPT
can be updated, as required.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-8-bp@alien8.de
Replace uses of sev_active() with the more generic cc_platform_has()
using CC_ATTR_GUEST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_GUEST_MEM_ENCRYPT
can be updated, as required.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-7-bp@alien8.de
Replace uses of sme_active() with the more generic cc_platform_has()
using CC_ATTR_HOST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_HOST_MEM_ENCRYPT
can be updated, as required.
This also replaces two usages of sev_active() that are really geared
towards detecting if SME is active.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-6-bp@alien8.de
Introduce an x86 version of the cc_platform_has() function. This will be
used to replace vendor specific calls like sme_active(), sev_active(),
etc.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de
In preparation for other uses of the cc_platform_has() function
besides AMD's memory encryption support, selectively build the
AMD memory encryption architecture override functions only when
CONFIG_AMD_MEM_ENCRYPT=y. These functions are:
- early_memremap_pgprot_adjust()
- arch_memremap_can_ram_remap()
Additionally, routines that are only invoked by these architecture
override functions can also be conditionally built. These functions are:
- memremap_should_map_decrypted()
- memremap_is_efi_data()
- memremap_is_setup_data()
- early_memremap_is_setup_data()
And finally, phys_mem_access_encrypted() is conditionally built as well,
but requires a static inline version of it when CONFIG_AMD_MEM_ENCRYPT is
not set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-2-bp@alien8.de
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFXQUoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMglgf/egh3zb9/+BUQWe0xWfhcINNzpsVk
PJtiBmJc3nQLbZbTSLp63rouy1lNgR0s2DiMwP7G1u39OwW8W3LHMrBUSqF1F01+
gntb4GGiRTiTPJI64K4z6ytORd3tuRarHq8TUIa2zvki9ZW5Obgkm1i1RsNMOo+s
AOA7whhpS8e/a5fBbtbS9bTZb30PKTZmbW4oMjvO9Sw4Eb76IauqPSEtRPSuCAc7
r7z62RTlm10Qk0JR3tW1iXMxTJHZk+tYPJ8pclUAWVX5bZqWa/9k8R0Z5i/miFiZ
glW/y3R4+aUwIQV2v7V3Jx9MOKDhZxniMtnqZG/Hp9NVDtWIz37V/U37vw==
=zQQ1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
"Small x86 fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: Ensure all migrations are performed when test is affined
KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
x86/kvmclock: Move this_cpu_pvti into kvmclock.h
selftests: KVM: Don't clobber XMM register when read
KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
Avoid allocating the gfn_track arrays if nothing needs them. If there
are no external to KVM users of the API (i.e. no GVT-g), then page
tracking is only needed for shadow page tables. This means that when tdp
is enabled and there are no external users, then the gfn_track arrays
can be lazily allocated when the shadow MMU is actually used. This avoid
allocations equal to .05% of guest memory when nested virtualization is
not used, if the kernel is compiled without GVT-g.
Signed-off-by: David Stevens <stevensd@chromium.org>
Message-Id: <20210922045859.2011227-3-stevensd@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can
use the common variant. This can be accomplished by adding another
macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now.
Further simplification can be achieved by adding __kvm_arch_free_vm()
doing the common part.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-Id: <20210903130808.30142-5-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu_try_to_unsync_pages checks if page tracking is active for the given
gfn, which requires knowing the memslot. We can pass down the memslot
via make_spte to avoid this lookup.
The memslot is also handy for make_spte's marking of the gfn as dirty:
we can test whether dirty page tracking is enabled, and if so ensure that
pages are mapped as writable with 4K granularity. Apart from the warning,
no functional change is intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-7-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that kvm_page_fault has a pointer to the memslot it can be passed
down to the page tracking code to avoid a redundant slot lookup.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20210813203504.2742757-5-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass struct kvm_page_fault to mmu->page_fault() instead of
extracting the arguments from the struct. FNAME(page_fault) can use
the precomputed bools from the error code.
Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Updates to the kvmclock parameters needs to do a complicated dance of
KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking
pvclock_gtod_sync_lock. Place that in two functions that can be called
on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the kretprobe replaces the function return address with
the kretprobe_trampoline on the stack, x86 unwinders can not
continue the stack unwinding at that point, or record
kretprobe_trampoline instead of correct return address.
To fix this issue, find the correct return address from task's
kretprobe_instances as like as function-graph tracer does.
With this fix, the unwinder can correctly unwind the stack
from kretprobe event on x86, as below.
<...>-135 [003] ...1 6.722338: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read)
<...>-135 [003] ...1 6.722377: <stack trace>
=> kretprobe_trace_func+0x209/0x2f0
=> kretprobe_dispatcher+0x4a/0x70
=> __kretprobe_trampoline_handler+0xca/0x150
=> trampoline_handler+0x44/0x70
=> kretprobe_trampoline+0x2a/0x50
=> vfs_read+0xab/0x1a0
=> ksys_read+0x5f/0xe0
=> do_syscall_64+0x33/0x40
=> entry_SYSCALL_64_after_hwframe+0x44/0xae
Link: https://lkml.kernel.org/r/163163055130.489837.5161749078833497255.stgit@devnote2
Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add UNWIND_HINT_FUNC on __kretprobe_trampoline() code so that ORC
information is generated on the __kretprobe_trampoline() correctly.
Also, this uses STACK_FRAME_NON_STANDARD_FP(), CONFIG_FRAME_POINTER-
-specific version of STACK_FRAME_NON_STANDARD().
Link: https://lkml.kernel.org/r/163163049242.489837.11970969750993364293.stgit@devnote2
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:
dereference_kernel_function_descriptor(kretprobe_trampoline)
In fact we had bugs in arch code where it wasn't replicated correctly.
So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.
Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The CPU field will be moved back into thread_info even when
THREAD_INFO_IN_TASK is enabled, so add it back to x86's definition of
struct thread_info.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Open code the call to mmu->translate_gpa() when loading nested PDPTRs and
kill off the existing helper, kvm_read_guest_page_mmu(), to discourage
incorrect use. Reading guest memory straight from an L2 GPA is extremely
rare (as evidenced by the lack of users), as very few constructs in x86
specify physical addresses, even fewer are virtualized by KVM, and even
fewer yet require emulation of L2 by L0 KVM.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the
number of allowed vcpu-ids. This has already led to confusion, so
rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more
clear
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-3-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_make_vcpus_request_mask() already disables preemption so just like
kvm_make_all_cpus_request_except() it can be switched to using
pre-allocated per-cpu cpumasks. This allows for improvements for both
users of the function: in Hyper-V emulation code 'tlb_flush' can now be
dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask()
gets rid of dynamic allocation.
cpumask_available() checks in kvm_make_vcpu_request() and
kvm_kick_many_cpus() can now be dropped as they checks for an impossible
condition: kvm_init() makes sure per-cpu masks are allocated.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- missing TLB flush
- nested virtualization fixes for SMM (secure boot on nested hypervisor)
and other nested SVM fixes
- syscall fuzzing fixes
- live migration fix for AMD SEV
- mirror VMs now work for SEV-ES too
- fixes for reset
- possible out-of-bounds access in IOAPIC emulation
- fix enlightened VMCS on Windows 2022
ARM:
- Add missing FORCE target when building the EL2 object
- Fix a PMU probe regression on some platforms
Generic:
- KCSAN fixes
selftests:
- random fixes, mostly for clang compilation
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFN0EwUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNqaQf/Vx7ePFTqwWpo+8wKapnc6JN9SLjC
hM4jipxfc1WyQWcfCt8ZuPhCnhF7o8mG/mrqTm+JB+oGqIsydHW19DiUT8ekv09F
dQ+XYSiR4B547wUH5XLQc4xG9imwYlXGEOHqrE7eJvGH3LOqVFX2fLRBnFefZbO8
GKhRJrGXwG3/JSAP6A0c22iVU+pLbfV9gpKwrAj0V7o8nzT2b3Wmh74WBNb47BzE
a4+AwKpWO4rqJGOwdYwy67pdFHh1YmrlZ59cFZc7fzlXE+o0D0bitaJyioZALpOl
4mRGdzoYkNB++ZjDzVFnAClCYQV/oNxCNGFaFF2mh/gzXG1TLmN7B8zGDg==
=7oVh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"A bit late... I got sidetracked by back-from-vacation routines and
conferences. But most of these patches are already a few weeks old and
things look more calm on the mailing list than what this pull request
would suggest.
x86:
- missing TLB flush
- nested virtualization fixes for SMM (secure boot on nested
hypervisor) and other nested SVM fixes
- syscall fuzzing fixes
- live migration fix for AMD SEV
- mirror VMs now work for SEV-ES too
- fixes for reset
- possible out-of-bounds access in IOAPIC emulation
- fix enlightened VMCS on Windows 2022
ARM:
- Add missing FORCE target when building the EL2 object
- Fix a PMU probe regression on some platforms
Generic:
- KCSAN fixes
selftests:
- random fixes, mostly for clang compilation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
selftests: KVM: Explicitly use movq to read xmm registers
selftests: KVM: Call ucall_init when setting up in rseq_test
KVM: Remove tlbs_dirty
KVM: X86: Synchronize the shadow pagetable before link it
KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
KVM: x86: nSVM: don't copy virt_ext from vmcb12
KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0
KVM: x86: nSVM: restore int_vector in svm_clear_vintr
kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
KVM: x86: VMX: synthesize invalid VM exit when emulating invalid guest state
KVM: x86: nSVM: refactor svm_leave_smm and smm_enter_smm
KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode
KVM: x86: reset pdptrs_from_userspace when exiting smm
KVM: x86: nSVM: restore the L1 host state prior to resuming nested guest on SMM exit
KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
...
PREEMPT_RT preempts softirqs and the current implementation avoids
do_softirq_own_stack() and only uses __do_softirq().
Disable the unused softirqs stacks on PREEMPT_RT to safe some memory and
ensure that do_softirq_own_stack() is not used which is not expected.
[ bigeasy: commit description. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210924161245.2357247-1-bigeasy@linutronix.de
- Prevent sending the wrong signal when protection keys are enabled and
the kernel handles a fault in the vsyscall emulation.
- Invoke early_reserve_memory() before invoking e820_memory_setup() which
is required to make the Xen dom0 e820 hooks work correctly.
- Use the correct data type for the SETZ operand in the EMQCMDS
instruction wrapper.
- Prevent undefined behaviour to the potential unaligned accesss in the
instroction decoder library.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmFQQaITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaZjD/0TF0mE8QUhI4tyGELdNgwvje5iZ9vg
Nd9KJpR4hUALHgfUD44NVl9JWawFY2d8FXyIPoAFEcvmy6o4f1w0ia8US3hQWA0Y
EdLSigXi/eYSstkONaJUEBCxlLbwy7JDzaazA9DeKOEuRc7NWSyZURYvzTAkPK1Y
mbE9kjKhjFa5NGnSB8HbSF2yEzFsKaTo4nreWP/OkzDjnEMshLR1/FUOUvZmlsgA
CWjMxAVYFqeJN3QhDgR/vRKPoz1sOjDL1s4AsU+xdy63WyFJZ7Z1b8t6bOBoYh6w
UztkuOkzZ6pIdzz4O1WGoFx4/FJ74qNx0vO/hOB+cKH6rgJs6AkHAvwlnjI/fE2C
Y+IsuE4PBXMRpkaayTCsAq/enabwgKsmLSUu916APrhVvuUtb3GJgyhedLE3mEBw
yZXezzRDhNpYop2yQSRXDeKebpoQgl+zqEP5g1O8pAFnud8FGHnz64eJV7Su7Y7C
BCac0hmv+drlqb/jOSYqjsfo6QfhvR60WwDIgTplOMMLa3plEJFx/rIuU2xVg5g9
w0m2QUsZboyT2yBnl8gRrqrcQmv2t4iX6TAj9Wm23Lx41h94JQMRtZyJT9bcNqY9
jMJu27BcNSveciZA7W2DVUlFf/gTF3bwpF7ZDWRt/VSrHPtkI9WKlERhQaywo1L0
rF8SGCEuNU2ktw==
=h7v1
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for X86:
- Prevent sending the wrong signal when protection keys are enabled
and the kernel handles a fault in the vsyscall emulation.
- Invoke early_reserve_memory() before invoking e820_memory_setup()
which is required to make the Xen dom0 e820 hooks work correctly.
- Use the correct data type for the SETZ operand in the EMQCMDS
instruction wrapper.
- Prevent undefined behaviour to the potential unaligned accesss in
the instruction decoder library"
* tag 'x86-urgent-2021-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn, tools/x86: Fix undefined behavior due to potential unaligned accesses
x86/asm: Fix SETZ size enqcmds() build failure
x86/setup: Call early_reserve_memory() earlier
x86/fault: Fix wrong signal when vsyscall fails with pkey
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYU8xegAKCRCAXGG7T9hj
vvk2APwM85dFSMNHQo5+S35X+h+M8uKuqqvPYxVtKqQEwu3LXAEAg0cVgr1lWegI
X98f07/+M0rPJl24kgW/EIxAD9fsWAw=
=JILz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.15b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Some minor cleanups and fixes of some theoretical bugs, as well as a
fix of a bug introduced in 5.15-rc1"
* tag 'for-linus-5.15b-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/x86: fix PV trap handling on secondary processors
xen/balloon: fix balloon kthread freezing
swiotlb-xen: this is PV-only on x86
xen/pci-swiotlb: reduce visibility of symbols
PCI: only build xen-pcifront in PV-enabled environments
swiotlb-xen: ensure to issue well-formed XENMEM_exchange requests
Xen/gntdev: don't ignore kernel unmapping error
xen/x86: drop redundant zeroing from cpu_initialize_context()
Get rid of the indirect function pointer and use flags settings instead
to steer execution.
Now that it is not an indirect call any longer, drop the instrumentation
annotation for objtool too.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-3-bp@alien8.de
Turn it into a normal function which calls an AMD- or Intel-specific
variant depending on the CPU it runs on.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-2-bp@alien8.de
When building under GCC 4.9 and 5.5:
arch/x86/include/asm/special_insns.h: Assembler messages:
arch/x86/include/asm/special_insns.h:286: Error: operand size mismatch for `setz'
Change the type to "bool" for condition code arguments, as documented.
Fixes: 7f5933f81b ("x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction")
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910223332.3224851-1-keescook@chromium.org
Check the return of init_srcu_struct(), which can fail due to OOM, when
initializing the page track mechanism. Lack of checking leads to a NULL
pointer deref found by a modified syzkaller.
Reported-by: TCS Robot <tcs_robot@tencent.com>
Signed-off-by: Haimin Zhang <tcs_kernel@tencent.com>
Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
[Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It turns out that a single page of stack is trivial to overflow with
all the tracing gunk enabled. Raise the exception stacks to 2 pages,
which is still half the interrupt stacks, which are at 4 pages.
Reported-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YUIO9Ye98S5Eb68w@hirez.programming.kicks-ass.net
Current code has an explicit check for hitting the task stack guard;
but overflowing any of the other stacks will get you a non-descript
general #DF warning.
Improve matters by using get_stack_info_noinstr() to detetrmine if and
which stack guard page got hit, enabling a better stack warning.
In specific, Michael Wang reported what turned out to be an NMI
exception stack overflow, which is now clearly reported as such:
[] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)
Reported-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Wang <yun.wang@linux.alibaba.com>
Link: https://lkml.kernel.org/r/YUTE/NuqnaWbST8n@hirez.programming.kicks-ass.net
Since commit c8137ace56 ("x86/iopl: Restrict iopl() permission
scope") it's possible to emulate iopl(3) using ioperm(), except for
the CLI/STI usage.
Userspace CLI/STI usage is very dubious (read broken), since any
exception taken during that window can lead to rescheduling anyway (or
worse). The IOPL(2) manpage even states that usage of CLI/STI is highly
discouraged and might even crash the system.
Of course, that won't stop people and HP has the dubious honour of
being the first vendor to be found using this in their hp-health
package.
In order to enable this 'software' to still 'work', have the #GP treat
the CLI/STI instructions as NOPs when iopl(3). Warn the user that
their program is doing dubious things.
Fixes: a24ca99768 ("x86/iopl: Remove legacy IOPL option")
Reported-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org # v5.5+
Link: https://lkml.kernel.org/r/20210918090641.GD5106@worktop.programming.kicks-ass.net
The function __bad_area_nosemaphore() calls kernelmode_fixup_or_oops()
with the parameter @signal being actually @pkey, which will send a
signal numbered with the argument in @pkey.
This bug can be triggered when the kernel fails to access user-given
memory pages that are protected by a pkey, so it can go down the
do_user_addr_fault() path and pass the !user_mode() check in
__bad_area_nosemaphore().
Most cases will simply run the kernel fixup code to make an -EFAULT. But
when another condition current->thread.sig_on_uaccess_err is met, which
is only used to emulate vsyscall, the kernel will generate the wrong
signal.
Add a new parameter @pkey to kernelmode_fixup_or_oops() to fix this.
[ bp: Massage commit message, fix build error as reported by the 0day
bot: https://lkml.kernel.org/r/202109202245.APvuT8BX-lkp@intel.com ]
Fixes: 5042d40a26 ("x86/fault: Bypass no_context() for implicit kernel faults from usermode")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jiashuo Liang <liangjs@pku.edu.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20210730030152.249106-1-liangjs@pku.edu.cn
xen_swiotlb and pci_xen_swiotlb_init() are only used within the file
defining them, so make them static and remove the stubs. Otoh
pci_xen_swiotlb_detect() has a use (as function pointer) from the main
pci-swiotlb.c file - convert its stub to a #define to NULL.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/aef5fc33-9c02-4df0-906a-5c813142e13c@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
None of the call sites cares about the error code. All they need to know is
whether the function succeeded or not.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.909065931@linutronix.de
None of the call sites cares about the actual return code. Change the
return type to boolean and return 'true' on success.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.736773588@linutronix.de
There is no reason to have the header zeroing in the pagefault disabled
region. Do it upfront once.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.621674721@linutronix.de
gcc will sometimes manifest the address of boot_cpu_data in a register
as part of constant propagation. When multiple static_cpu_has() are used
this may foul the mainline code with a register load which will only be
used on the fallback path, which is unused after initialization.
Explicitly force gcc to use immediate (rip-relative) addressing for
the fallback path, thus removing any possible register use from
static_cpu_has().
While making changes, modernize the code to use
.pushsection...popsection instead of .section...previous.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910195910.2542662-4-hpa@zytor.com
Add a macro _ASM_RIP() to add a (%rip) suffix on 64 bits only. This is
useful for immediate memory references where one doesn't want gcc
to possibly use a register indirection as it may in the case of an "m"
constraint.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210910195910.2542662-3-hpa@zytor.com
Commit 865c50e1d2 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
added an optimised version of __get_user_asm() for x86 using 'asm goto'.
Like the non-optimised code, the 32-bit implementation of 64-bit
get_user() expands to a pair of 32-bit accesses. Unlike the
non-optimised code, the _original_ pointer is incremented to copy the
high word instead of loading through a new pointer explicitly
constructed to point at a 32-bit type. Consequently, if the pointer
points at a 64-bit type then we end up loading the wrong data for the
upper 32-bits.
This was observed as a mount() failure in Android targeting i686 after
b0cfcdd9b9 ("d_path: make 'prepend()' fill up the buffer exactly on
overflow") because the call to copy_from_kernel_nofault() from
prepend_copy() ends up in __get_kernel_nofault() and casts the source
pointer to a 'u64 __user *'. An attempt to mount at "/debug_ramdisk"
therefore ends up failing trying to mount "/debumdismdisk".
Use the existing '__gu_ptr' source pointer to unsigned int for 32-bit
__get_user_asm_u64() instead of the original pointer.
Cc: Bill Wendling <morbo@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 865c50e1d2 ("x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT")
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
FPU restore from a signal frame can trigger various exceptions. The
exceptions are caught with an exception table entry. The handler of this
entry stores the trap number in EAX. The FPU specific fixup negates that
trap number to convert it into an negative error code.
Any other exception than #PF is fatal and recovery is not possible. This
relies on the fact that the #PF exception number is the same as EFAULT, but
that's not really obvious.
Remove the negation from the exception fixup as it really has no value and
check for X86_TRAP_PF at the call site.
There is still confusion due to the return code conversion for the error
case which will be cleaned up separately.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.506192488@linutronix.de
The macros used for restoring FPU state from a user space buffer can handle
all exceptions including #MC. They need to return the trap number in the
error case as the code which invokes them needs to distinguish the cause of
the failure. It aborts the operation for anything except #PF.
Use the new EX_TYPE_FAULT_MCE_SAFE exception table fixup type to document
the nature of the fixup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.387464538@linutronix.de
Provide exception fixup types which can be used to identify fixups which
allow in kernel #MC recovery and make them invoke the existing handlers.
These will be used at places where #MC recovery is handled correctly by the
caller.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.269689153@linutronix.de
The exception table entries contain the instruction address, the fixup
address and the handler address. All addresses are relative. Storing the
handler address has a few downsides:
1) Most handlers need to be exported
2) Handlers can be defined everywhere and there is no overview about the
handler types
3) MCE needs to check the handler type to decide whether an in kernel #MC
can be recovered. The functionality of the handler itself is not in any
way special, but for these checks there need to be separate functions
which in the worst case have to be exported.
Some of these 'recoverable' exception fixups are pretty obscure and
just reuse some other handler to spare code. That obfuscates e.g. the
#MC safe copy functions. Cleaning that up would require more handlers
and exports
Rework the exception fixup mechanics by storing a fixup type number instead
of the handler address and invoke the proper handler for each fixup
type. Also teach the extable sort to leave the type field alone.
This makes most handlers static except for special cases like the MCE
MSR fixup and the BPF fixup. This allows to add more types for cleaning up
the obscure places without adding more handler code and exports.
There is a marginal code size reduction for a production config and it
removes _eight_ exported symbols.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
No point in defining the identical macros twice depending on C or assembly
mode. They are still identical.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.023659534@linutronix.de
All users of compat_alloc_user_space() and copy_in_user() have been
removed from the kernel, only a few functions in sparc remain that can be
changed to calling arch_copy_in_user() instead.
Link: https://lkml.kernel.org/r/20210727144859.4150043-7-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as they
do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is disabled
- bugfixes and cleanups, especially with respect to 1) vCPU reset and
2) choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmE2CIAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMyqwf+Ky2WoThuQ9Ra0r/m8pUTAx5+gsAf
MmG24rNLE+26X0xuBT9Q5+etYYRLrRTWJvo5cgHooz7muAYW6scR+ho5xzvLTAxi
DAuoijkXsSdGoFCp0OMUHiwG3cgY5N7feTEwLPAb2i6xr/l6SZyCP4zcwiiQbJ2s
UUD0i3rEoNQ02/hOEveud/ENxzUli9cmmgHKXR3kNgsJClSf1fcuLnhg+7EGMhK9
+c2V+hde5y0gmEairQWm22MLMRolNZ5NL4kjykiNh2M5q9YvbHe5+f/JmENlNZMT
bsUQT6Ry1ukuJ0V59rZvUw71KknPFzZ3d6HgW4pwytMq6EJKiISHzRbVnQ==
=FCAB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual
PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as
they do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized
LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is
disabled
- bugfixes and cleanups, especially with respect to vCPU reset and
choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless
necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
KVM: Drop unused kvm_dirty_gfn_invalid()
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
KVM: MMU: mark role_regs and role accessors as maybe unused
KVM: MIPS: Remove a "set but not used" variable
x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
KVM: stats: Add VM stat for remote tlb flush requests
KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
kvm: x86: Increase MAX_VCPUS to 1024
kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
KVM: s390: index kvm->arch.idle_mask by vcpu_idx
KVM: s390: Enable specification exception interpretation
KVM: arm64: Trim guest debug exception handling
KVM: SVM: Add 5-level page table support for SVM
...
Support for 710 VCPUs was tested by Red Hat since RHEL-8.4,
so increase KVM_SOFT_MAX_VCPUS to 710.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-4-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs.
I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it
might involve complicated questions around the meaning of
"supported" and "recommended" in the upstream tree.
KVM_SOFT_MAX_VCPUS will be changed in a separate patch.
For reference, visible effects of this change are:
- KVM_CAP_MAX_VCPUS will now return 1024 (of course)
- Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX
will now be 1024
- KVM_MAX_VCPU_ID will change from 1151 to 4096
- Size of struct kvm will increase from 19328 to 22272 bytes
(in x86_64)
- Size of struct kvm_ioapic will increase from 1780 to 5084 bytes
(in x86_64)
- Bitmap stack variables that will grow:
- At kvm_hv_flush_tlb() kvm_hv_send_ipi(),
vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long
- vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long
once patch "KVM: x86: Fix stack-out-of-bounds memory access
from ioapic_write_indirect()" is applied
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-3-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of requiring KVM_MAX_VCPU_ID to be manually increased
every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS.
This should be enough for CPU topologies where Cores-per-Package
and Packages-per-Socket are not powers of 2.
In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152.
The only side effect of this change is making some fields in
struct kvm_ioapic larger, increasing the struct size from 1628 to
1780 bytes (in x86_64).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20210903211600.2002377-2-ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmEuJwwTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXp0ICACsx9NtQh1f9xGMClYrbobJfGmiwHVV
uKut/44Vg39tSyZB4mQt3A8YcaQj1Nibo6HVmxJtbKbrKwlTGAiQh5fiOmBOd7Re
/rII/S+CGtAyChI1adHTSL2xdk6WY0c7XQw+IPaERBikG5rO81Y6NLjFZNOv494k
JnG9uGGjAcJWFYylPcLxt4sR/hEfE4KDzsWjWOb5azYgo/RwOan6zYDdkUgocp4A
J+zmgCiME8LLmEV19gn7p4gpX7X9m5mcNgn53eICYPhrBqI0PTWocm6DepCEnrQ+
pEobIagWIMx5Dr7euEJwLxFSN7bdzleVOa4FSfM0zUsEjdbiPH47VQFM
=Vae6
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- make Hyper-V code arch-agnostic (Michael Kelley)
- fix sched_clock behaviour on Hyper-V (Ani Sinha)
- fix a fault when Linux runs as the root partition on MSHV (Praveen
Kumar)
- fix VSS driver (Vitaly Kuznetsov)
- cleanup (Sonia Sharma)
* tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer
Drivers: hv: Enable Hyper-V code to be built on ARM64
arm64: efi: Export screen_info
arm64: hyperv: Initialize hypervisor on boot
arm64: hyperv: Add panic handler
arm64: hyperv: Add Hyper-V hypercall and register access utilities
x86/hyperv: fix root partition faults when writing to VP assist page MSR
hv: hyperv.h: Remove unused inline functions
drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers
x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0
Drivers: hv: Move Hyper-V misc functionality to arch-neutral code
Drivers: hv: Add arch independent default functions for some Hyper-V handlers
Drivers: hv: Make portions of Hyper-V init code be arch neutral
x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable
asm-generic/hyperv: Add missing #include of nmi.h
core:
- extract i915 eDP backlight into core
- DP aux bus support
- drm_device.irq_enabled removed
- port drivers to native irq interfaces
- export gem shadow plane handling for vgem
- print proper driver name in framebuffer registration
- driver fixes for implicit fencing rules
- ARM fixed rate compression modifier added
- updated fb damage handling
- rmfb ioctl logging/docs
- drop drm_gem_object_put_locked
- define DRM_FORMAT_MAX_PLANES
- add gem fb vmap/vunmap helpers
- add lockdep_assert(once) helpers
- mark drm irq midlayer as legacy
- use offset adjusted bo mapping conversion
vgaarb:
- cleanups
fbdev:
- extend efifb handling to all arches
- div by 0 fixes for multiple drivers
udmabuf:
- add hugepage mapping support
dma-buf:
- non-dynamic exporter fixups
- document implicit fencing rules
amdgpu:
- Initial Cyan Skillfish support
- switch virtual DCE over to vkms based atomic
- VCN/JPEG power down fixes
- NAVI PCIE link handling fixes
- AMD HDMI freesync fixes
- Yellow Carp + Beige Goby fixes
- Clockgating/S0ix/SMU/EEPROM fixes
- embed hw fence in job
- rework dma-resv handling
- ensure eviction to system ram
amdkfd:
- uapi: SVM address range query added
- sysfs leak fix
- GPUVM TLB optimizations
- vmfault/migration counters
i915:
- Enable JSL and EHL by default
- preliminary XeHP/DG2 support
- remove all CNL support (never shipped)
- move to TTM for discrete memory support
- allow mixed object mmap handling
- GEM uAPI spring cleaning
- add I915_MMAP_OBJECT_FIXED
- reinstate ADL-P mmap ioctls
- drop a bunch of unused by userspace features
- disable and remove GPU relocations
- revert some i915 misfeatures
- major refactoring of GuC for Gen11+
- execbuffer object locking separate step
- reject caching/set-domain on discrete
- Enable pipe DMC loading on XE-LPD and ADL-P
- add PSF GV point support
- Refactor and fix DDI buffer translations
- Clean up FBC CFB allocation code
- Finish INTEL_GEN() and friends macro conversions
nouveau:
- add eDP backlight support
- implicit fence fix
msm:
- a680/7c3 support
- drm/scheduler conversion
panfrost:
- rework GPU reset
virtio:
- fix fencing for planes
ast:
- add detect support
bochs:
- move to tiny GPU driver
vc4:
- use hotplug irqs
- HDMI codec support
vmwgfx:
- use internal vmware device headers
ingenic:
- demidlayering irq
rcar-du:
- shutdown fixes
- convert to bridge connector helpers
zynqmp-dsub:
- misc fixes
mgag200:
- convert PLL handling to atomic
mediatek:
- MT8133 AAL support
- gem mmap object support
- MT8167 support
etnaviv:
- NXP Layerscape LS1028A SoC support
- GEM mmap cleanups
tegra:
- new user API
exynos:
- missing unlock fix
- build warning fix
- use refcount_t
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmEtvn8ACgkQDHTzWXnE
hr7aqw//WfcIyGdPLjAz59cW8jm+FgihD5colHtOUYRHRO4GeX/bNNufquR8+N3y
HESsyZdpihFHms/wURMq41ibmHg0EuHA01HZzjZuGBesG4F9I8sP/HnDOxDuYuAx
N7Lg4PlUNlfFHmw7Y84owQ6s/XWmNp5iZ8e/mTK5hcraJFQKS4QO74n9RbG/F1vC
Hc3P6AnpqGac2AEGXt0NjIRxVVCTUIBGx+XOhj+1AMyAGzt9VcO1DS9PVCS0zsEy
zKMj9tZAPNg0wYsXAi4kA1lK7uVY8KoXSVDYLpsI5Or2/e7mfq2b4EWrezbtp6UA
H+w86axuwJq7NaYHYH6HqyrLTOmvcHgIl2LoZN91KaNt61xfJT3XZkyQoYViGIrJ
oZy6X/+s+WPoW98bHZrr6vbcxtWKfEeQyUFEAaDMmraKNJwROjtwgFC9DP8MDctq
PUSM+XkwbGRRxQfv9dNKufeWfV5blVfzEJO8EfTU1YET3WTDaUHe/FoIcLZt2DZG
JAJgZkIlU8egthPdakUjQz/KoyLMyovcN5zcjgzgjA9PyNEq74uElN9l446kSSxu
jEVErOdd+aG3Zzk7/ZZL/RmpNQpPfpQ2RaPUkgeUsW01myNzUNuU3KUDaSlVa+Oi
1n7eKoaQ2to/+LjhYApVriri4hIZckNNn5FnnhkgwGi8mpHQIVQ=
=vZkA
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- i915 has seen a lot of refactoring and uAPI cleanups due to a
change in the upstream direction going forward
This has all been audited with known userspace, but there may be
some pitfalls that were missed.
- i915 now uses common TTM to enable discrete memory on DG1/2 GPUs
- i915 enables Jasper and Elkhart Lake by default and has preliminary
XeHP/DG2 support
- amdgpu adds support for Cyan Skillfish
- lots of implicit fencing rules documented and fixed up in drivers
- msm now uses the core scheduler
- the irq midlayer has been removed for non-legacy drivers
- the sysfb code now works on more than x86.
Otherwise the usual smattering of stuff everywhere, panels, bridges,
refactorings.
Detailed summary:
core:
- extract i915 eDP backlight into core
- DP aux bus support
- drm_device.irq_enabled removed
- port drivers to native irq interfaces
- export gem shadow plane handling for vgem
- print proper driver name in framebuffer registration
- driver fixes for implicit fencing rules
- ARM fixed rate compression modifier added
- updated fb damage handling
- rmfb ioctl logging/docs
- drop drm_gem_object_put_locked
- define DRM_FORMAT_MAX_PLANES
- add gem fb vmap/vunmap helpers
- add lockdep_assert(once) helpers
- mark drm irq midlayer as legacy
- use offset adjusted bo mapping conversion
vgaarb:
- cleanups
fbdev:
- extend efifb handling to all arches
- div by 0 fixes for multiple drivers
udmabuf:
- add hugepage mapping support
dma-buf:
- non-dynamic exporter fixups
- document implicit fencing rules
amdgpu:
- Initial Cyan Skillfish support
- switch virtual DCE over to vkms based atomic
- VCN/JPEG power down fixes
- NAVI PCIE link handling fixes
- AMD HDMI freesync fixes
- Yellow Carp + Beige Goby fixes
- Clockgating/S0ix/SMU/EEPROM fixes
- embed hw fence in job
- rework dma-resv handling
- ensure eviction to system ram
amdkfd:
- uapi: SVM address range query added
- sysfs leak fix
- GPUVM TLB optimizations
- vmfault/migration counters
i915:
- Enable JSL and EHL by default
- preliminary XeHP/DG2 support
- remove all CNL support (never shipped)
- move to TTM for discrete memory support
- allow mixed object mmap handling
- GEM uAPI spring cleaning
- add I915_MMAP_OBJECT_FIXED
- reinstate ADL-P mmap ioctls
- drop a bunch of unused by userspace features
- disable and remove GPU relocations
- revert some i915 misfeatures
- major refactoring of GuC for Gen11+
- execbuffer object locking separate step
- reject caching/set-domain on discrete
- Enable pipe DMC loading on XE-LPD and ADL-P
- add PSF GV point support
- Refactor and fix DDI buffer translations
- Clean up FBC CFB allocation code
- Finish INTEL_GEN() and friends macro conversions
nouveau:
- add eDP backlight support
- implicit fence fix
msm:
- a680/7c3 support
- drm/scheduler conversion
panfrost:
- rework GPU reset
virtio:
- fix fencing for planes
ast:
- add detect support
bochs:
- move to tiny GPU driver
vc4:
- use hotplug irqs
- HDMI codec support
vmwgfx:
- use internal vmware device headers
ingenic:
- demidlayering irq
rcar-du:
- shutdown fixes
- convert to bridge connector helpers
zynqmp-dsub:
- misc fixes
mgag200:
- convert PLL handling to atomic
mediatek:
- MT8133 AAL support
- gem mmap object support
- MT8167 support
etnaviv:
- NXP Layerscape LS1028A SoC support
- GEM mmap cleanups
tegra:
- new user API
exynos:
- missing unlock fix
- build warning fix
- use refcount_t"
* tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm: (1318 commits)
drm/amd/display: Move AllowDRAMSelfRefreshOrDRAMClockChangeInVblank to bounding box
drm/amd/display: Remove duplicate dml init
drm/amd/display: Update bounding box states (v2)
drm/amd/display: Update number of DCN3 clock states
drm/amdgpu: disable GFX CGCG in aldebaran
drm/amdgpu: Clear RAS interrupt status on aldebaran
drm/amdgpu: Add support for RAS XGMI err query
drm/amdkfd: Account for SH/SE count when setting up cu masks.
drm/amdgpu: rename amdgpu_bo_get_preferred_pin_domain
drm/amdgpu: drop redundant cancel_delayed_work_sync call
drm/amdgpu: add missing cleanups for more ASICs on UVD/VCE suspend
drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend
drm/amdkfd: map SVM range with correct access permission
drm/amdkfd: check access permisson to restore retry fault
drm/amdgpu: Update RAS XGMI Error Query
drm/amdgpu: Add driver infrastructure for MCA RAS
drm/amd/display: Add Logging for HDMI color depth information
drm/amd/amdgpu: consolidate PSP TA init shared buf functions
drm/amd/amdgpu: add name field back to ras_common_if
drm/amdgpu: Fix build with missing pm_suspend_target_state module export
...
- Enable memcg accounting for various networking objects.
BPF:
- Introduce bpf timers.
- Add perf link and opaque bpf_cookie which the program can read
out again, to be used in libbpf-based USDT library.
- Add bpf_task_pt_regs() helper to access user space pt_regs
in kprobes, to help user space stack unwinding.
- Add support for UNIX sockets for BPF sockmap.
- Extend BPF iterator support for UNIX domain sockets.
- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.
Protocols:
- Support IOAM Pre-allocated Trace with IPv6.
- Support Management Component Transport Protocol.
- bridge: multicast: add vlan support.
- netfilter: add hooks for the SRv6 lightweight tunnel driver.
- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP
- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing
- af_unix: add OOB notification support.
- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by
the router.
- mac80211: Target Wake Time support in AP mode.
- can: j1939: extend UAPI to notify about RX status.
Driver APIs:
- Add page frag support in page pool API.
- Many improvements to the DSA (distributed switch) APIs.
- ethtool: extend IRQ coalesce uAPI with timer reset modes.
- devlink: control which auxiliary devices are created.
- Support CAN PHYs via the generic PHY subsystem.
- Proper cross-chip support for tag_8021q.
- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.
Drivers:
- veth: more flexible channels number configuration.
- openvswitch: introduce per-cpu upcall dispatch.
- Add internet mix (IMIX) mode to pktgen.
- Transparently handle XDP operations in the bonding driver.
- Add LiteETH network driver.
- Renesas (ravb):
- support Gigabit Ethernet IP
- NXP Ethernet switch (sja1105)
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge
- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation
- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings
- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels
- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state
- Netronome Ethernet (nfp):
- add conntrack offload support
- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device
- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)
- Xen pv driver:
- harden netfront against malicious backends
- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces
Refactor:
- Ambient BPF run context and cgroup storage cleanup.
- Compat rework for ndo_ioctl.
Old code removal:
- prism54 remove the obsoleted driver, deprecated by the p54 driver.
- wan: remove sbni/granch driver.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEukBYACgkQMUZtbf5S
IrsyHA//TO8dw18NYts4n9LmlJT2naJ7yBUUSSXK/M+DtW0MQ9nnHhqzPm5uJdRl
IgQTNJrW3dYzRwgqaWZqEwO1t5/FI+f87ND1Nsekg7x9tF66a6ov5WxU26TwwSba
U+si/inQ/4chuQ+LxMQobqCDxaLE46I2dIoRl+YfndJ24DRzYSwAEYIPPbSdfyU+
+/l+3s4GaxO4k/hLciPAiOniyxLoUNiGUTNh+2yqRBXelSRJRKVnl+V22ANFrxRW
nTEiplfVKhlPU1e4iLuRtaxDDiePHhw9I3j/lMHhfeFU2P/gKJIvz4QpGV0CAZg2
1VvDU32WEx1GQLXJbKm0KwoNRUq1QSjOyyFti+BO7ugGaYAR4gKhShOqlSYLzUtB
tbtzQhSNLWOGqgmSJOztZb5kFDm2EdRSll5/lP2uyFlPkIsIp0QbscJVzNTnS74b
Xz15ZOw41Z4TfWPEMWgfrx6Zkm7pPWkly+7WfUkPcHa1gftNz6tzXXxSXcXIBPdi
yQ5JCzzxrM5573YHuk5YedwZpn6PiAt4A/muFGk9C6aXP60TQAOS/ppaUzZdnk4D
NfOk9mj06WEULjYjPcKEuT3GGWE6kmjb8Pu0QZWKOchv7vr6oZly1EkVZqYlXELP
AfhcrFeuufie8mqm0jdb4LnYaAnqyLzlb1J4Zxh9F+/IX7G3yoc=
=JDGD
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Enable memcg accounting for various networking objects.
BPF:
- Introduce bpf timers.
- Add perf link and opaque bpf_cookie which the program can read out
again, to be used in libbpf-based USDT library.
- Add bpf_task_pt_regs() helper to access user space pt_regs in
kprobes, to help user space stack unwinding.
- Add support for UNIX sockets for BPF sockmap.
- Extend BPF iterator support for UNIX domain sockets.
- Allow BPF TCP congestion control progs and bpf iterators to call
bpf_setsockopt(), e.g. to switch to another congestion control
algorithm.
Protocols:
- Support IOAM Pre-allocated Trace with IPv6.
- Support Management Component Transport Protocol.
- bridge: multicast: add vlan support.
- netfilter: add hooks for the SRv6 lightweight tunnel driver.
- tcp:
- enable mid-stream window clamping (by user space or BPF)
- allow data-less, empty-cookie SYN with TFO_SERVER_COOKIE_NOT_REQD
- more accurate DSACK processing for RACK-TLP
- mptcp:
- add full mesh path manager option
- add partial support for MP_FAIL
- improve use of backup subflows
- optimize option processing
- af_unix: add OOB notification support.
- ipv6: add IFLA_INET6_RA_MTU to expose MTU value advertised by the
router.
- mac80211: Target Wake Time support in AP mode.
- can: j1939: extend UAPI to notify about RX status.
Driver APIs:
- Add page frag support in page pool API.
- Many improvements to the DSA (distributed switch) APIs.
- ethtool: extend IRQ coalesce uAPI with timer reset modes.
- devlink: control which auxiliary devices are created.
- Support CAN PHYs via the generic PHY subsystem.
- Proper cross-chip support for tag_8021q.
- Allow TX forwarding for the software bridge data path to be
offloaded to capable devices.
Drivers:
- veth: more flexible channels number configuration.
- openvswitch: introduce per-cpu upcall dispatch.
- Add internet mix (IMIX) mode to pktgen.
- Transparently handle XDP operations in the bonding driver.
- Add LiteETH network driver.
- Renesas (ravb):
- support Gigabit Ethernet IP
- NXP Ethernet switch (sja1105):
- fast aging support
- support for "H" switch topologies
- traffic termination for ports under VLAN-aware bridge
- Intel 1G Ethernet
- support getcrosststamp() with PCIe PTM (Precision Time
Measurement) for better time sync
- support Credit-Based Shaper (CBS) offload, enabling HW traffic
prioritization and bandwidth reservation
- Broadcom Ethernet (bnxt)
- support pulse-per-second output
- support larger Rx rings
- Mellanox Ethernet (mlx5)
- support ethtool RSS contexts and MQPRIO channel mode
- support LAG offload with bridging
- support devlink rate limit API
- support packet sampling on tunnels
- Huawei Ethernet (hns3):
- basic devlink support
- add extended IRQ coalescing support
- report extended link state
- Netronome Ethernet (nfp):
- add conntrack offload support
- Broadcom WiFi (brcmfmac):
- add WPA3 Personal with FT to supported cipher suites
- support 43752 SDIO device
- Intel WiFi (iwlwifi):
- support scanning hidden 6GHz networks
- support for a new hardware family (Bz)
- Xen pv driver:
- harden netfront against malicious backends
- Qualcomm mobile
- ipa: refactor power management and enable automatic suspend
- mhi: move MBIM to WWAN subsystem interfaces
Refactor:
- Ambient BPF run context and cgroup storage cleanup.
- Compat rework for ndo_ioctl.
Old code removal:
- prism54 remove the obsoleted driver, deprecated by the p54 driver.
- wan: remove sbni/granch driver"
* tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1715 commits)
net: Add depends on OF_NET for LiteX's LiteETH
ipv6: seg6: remove duplicated include
net: hns3: remove unnecessary spaces
net: hns3: add some required spaces
net: hns3: clean up a type mismatch warning
net: hns3: refine function hns3_set_default_feature()
ipv6: remove duplicated 'net/lwtunnel.h' include
net: w5100: check return value after calling platform_get_resource()
net/mlxbf_gige: Make use of devm_platform_ioremap_resourcexxx()
net: mdio: mscc-miim: Make use of the helper function devm_platform_ioremap_resource()
net: mdio-ipq4019: Make use of devm_platform_ioremap_resource()
fou: remove sparse errors
ipv4: fix endianness issue in inet_rtm_getroute_build_skb()
octeontx2-af: Set proper errorcode for IPv4 checksum errors
octeontx2-af: Fix static code analyzer reported issues
octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg
octeontx2-af: Fix loop in free and unmap counter
af_unix: fix potential NULL deref in unix_dgram_connect()
dpaa2-eth: Replace strlcpy with strscpy
octeontx2-af: Use NDC TX for transmit packet data
...
which can be found on various ALi chipsets and is also available on older
Intel systems which expose a PIRQ router. While the Intel support is more
or less nostalgia, the ALi chips are still in use on popular embedded
boards used for routers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn2QTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQT1EACIvzRbycwclASIV6rBK5FMcVa2VuXR
GVqrfERPCUHQxnLshUJxnLk0NvZcQrLHjYl/QMCHBFOeEh3XrzU7JkKDW0Q8Dnov
QGFRtandKDwY4TwnCPKVdz/HeWMxNRT7OF4d08Q3iKCN5l39RLxraMixSrFL8soO
wgGcRTjbTa6HaMlqacFN7DwwiHxbIGJNepi0yqLZBV2dQOnZPd+ujV1FRSNXkv9p
vFPfuazk/psiSXy3x/+YVPUw+6h8DRDkflc9+wvSR+1cVl8eyrjkLgLH43ihddEN
Dl1SG5vKyCOtvQm+TEYdB5qjb/Zd4BjlbvKPJ+94OTtsjIIwxzInizkeTXiLHXnl
SDHX9Sc8L4sYP5+tAew1WMj8K2/p6FzdHm+sBJHd2JFSsMpeErI7p0y0Nz58E7pG
0cRqeWlq7rbGFPq544A8cgx/LjPkZT4LgutGpJ6f3NTZeLfj09xbFRqxNOHqAx+h
fp+36RNb1/j70Yz+4r7lLeDOVswbK+YxPIZGdnNfINTHeGllthDI5vaUL0L2jZnI
CnnKjss2a1WkDC8gczr/3QYcQRKrKDHL0hn0nUh+9laAaTSwNv3oRrkUWvMqwaT8
qSMMm5Eb84B4fZLyvPIcAwyC++JU/cVCgWEP37EzhYcvp6tq8GmR1cdi1lo2/K4O
qhg1d7loNh0eCg==
=R+c1
-----END PGP SIGNATURE-----
Merge tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PIRQ updates from Thomas Gleixner:
"A set of updates to support port 0x22/0x23 based PCI configuration
space which can be found on various ALi chipsets and is also available
on older Intel systems which expose a PIRQ router.
While the Intel support is more or less nostalgia, the ALi chips are
still in use on popular embedded boards used for routers"
* tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix typo s/ECLR/ELCR/ for the PIC register
x86: Avoid magic number with ELCR register accesses
x86/PCI: Add support for the Intel 82426EX PIRQ router
x86/PCI: Add support for the Intel 82374EB/82374SB (ESC) PIRQ router
x86/PCI: Add support for the ALi M1487 (IBC) PIRQ router
x86: Add support for 0x22/0x23 port I/O configuration space
A stop gap for potential future speculation related hardware
vulnerabilities and a mechanism for truly security paranoid
applications.
It allows a task to request that the L1D cache is flushed when the kernel
switches to a different mm. This can be requested via prctl().
Changes vs. the previous versions:
- Get rid of the software flush fallback
- Make the handling consistent with other mitigations
- Kill the task when it ends up on a SMT enabled core which defeats the
purpose of L1D flushing obviously
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn0oTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoa5fD/47vHGtjAtDr/DaXR1C6F9AvVbKEl8p
oNHn8IukE6ts6G4dFH9wUvo/Ut0K3kxX54I+BATew0LTy6tsQeUYh/xjwXMupgNV
oKOc9waoqdFvju3ayLFWJmuACLdXpyrGC1j35Aji61zSbR/GdtZ4oDxbuN2YJDAT
BTcgKrBM5nQm94JNa083RQSCU5LJxbC7ETkIh6NR73RSPCjUC1Wpxy1sAQAa2MPD
8EzcJ/DjVGaHCI7adX10sz3xdUcyOz7qYz16HpoMGx+oSiq7pGEBtUiK97EYMcrB
s+ADFUjYmx/pbEWv2r4c9zxNh7ZV3aLBsWwi7bScHIsv8GjrsA/mYLWskuwOV6BB
22qZjfd0c4raiJwd+nmSx+D2Szv6lZ20gP+krtP2VNC6hUv7ft0VPLySiaFMmUHj
quooDZis/W5n+4C9Q8Rk9uUtKzzJOngqW+duftiixHiNQ/ECP/QCAHhZYck/NOkL
tZkNj6lJj9+2iR7mhbYROZ+wrYQzRvqNb2pJJQoi/wA0q7wPSKBi3m+51lPsht5W
tn94CpaDDZ4IB7Fe1NtcA0UpYJSWpDQGlau4qp92HMCCIcRFfQEm+m9x8axwcj7m
ECblHJYBPHuNcCHvPA8kHvr1nd6UUXrGPIo8TK8YhUUbK6pO0OjdNzZX496ia/2g
pLzaW2ENTPLbXg==
=27wH
-----END PGP SIGNATURE-----
Merge tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache flush updates from Thomas Gleixner:
"A reworked version of the opt-in L1D flush mechanism.
This is a stop gap for potential future speculation related hardware
vulnerabilities and a mechanism for truly security paranoid
applications.
It allows a task to request that the L1D cache is flushed when the
kernel switches to a different mm. This can be requested via prctl().
Changes vs the previous versions:
- Get rid of the software flush fallback
- Make the handling consistent with other mitigations
- Kill the task when it ends up on a SMT enabled core which defeats
the purpose of L1D flushing obviously"
* tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: Add L1D flushing Documentation
x86, prctl: Hook L1D flushing in via prctl
x86/mm: Prepare for opt-in based L1D flush in switch_mm()
x86/process: Make room for TIF_SPEC_L1D_FLUSH
sched: Add task_work callback for paranoid L1D flush
x86/mm: Refactor cond_ibpb() to support other use cases
x86/smp: Add a per-cpu view of SMT state
- Add support for Intel Sapphire Rapids server CPU uncore events
- Allow the AMD uncore driver to be built as a module
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrWgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iPbg/+LTi1ki2kswWF5Fmoo0IWZ1GiCkWD0rSm
SqCJR3u/MHd7QxsfPXFuhABt7mqaC50epfim0MNC1J11wRoNy6GzM/9nICdez9CK
c2c3jAwoEfWE7o5PTSjXolH0FFXVYwQ9WBRJTBCaZjuXUW7TOk7h9o8fXoF8SssU
VkP9z0YKanhN480v4k7/KR20ktY6GPFFu5cxhC0wyygZXIoG6ku+nmDjvN1ipo54
JQYQbCTz+dj/C0uZehbEoTZWx7cajAxlbq+Iyor3ND30YHLeRxWNEcq/Jzn9Szqa
LXnIi3mg4/mHc8mtZjDHXazpMxGYI02hkBACftPb2gizR4DOwtWlw1A2oHjbfmvM
oF29kcZmFU/Na2O5JTf0IpV4LLkt/OlrTsTcd5ROYWgmri3UV18MhaKz0R0vQNhI
8Xx6TJVA9fFHCIONg+4yxzDkYlxHJ9Tg8yFb06M6NWOnud03xYv4FGOzi0laoums
8XbYZUnMh8aVkXz05CYUadknu/ajMOSqAZZAstng3unazrutSCMkZ+Pzs4X+yqq5
Zz2Tb26oi6KepLD0eQNTmo1pRwuWC/IBqVF5aKH4e4qgyta/VxWob3Qd+I7ONaCl
6HpbaWz01Nw8U2wE4dQB0wKwIGlLE8bRQTS2QFuqaEHu0yZsQGt8zMwVWB+f93Gb
viglPyeA/y8=
=QrPT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event updates from Ingo Molnar:
- Add support for Intel Sapphire Rapids server CPU uncore events
- Allow the AMD uncore driver to be built as a module
- Misc cleanups and fixes
* tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
perf/amd/uncore: Allow the driver to be built as a module
x86/cpu: Add get_llc_id() helper function
perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/
perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL
perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
perf/x86/intel: Replace deprecated CPU-hotplug functions
perf/x86: Remove unused assignment to pointer 'e'
perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX
perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server
perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server
perf/x86/intel/uncore: Factor out snr_uncore_mmio_map()
perf/x86/intel/uncore: Add alias PMU name
perf/x86/intel/uncore: Add Sapphire Rapids server MDF support
perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support
perf/x86/intel/uncore: Add Sapphire Rapids server UPI support
perf/x86/intel/uncore: Add Sapphire Rapids server M2M support
perf/x86/intel/uncore: Add Sapphire Rapids server IMC support
perf/x86/intel/uncore: Add Sapphire Rapids server PCU support
perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support
...
is not up yet - delay that processing until everything is ready
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsp9MACgkQEsHwGGHe
VUp1ARAAtlYWtIQXUV+YiGGnueO3WZlbME/O/I6uz3KSIgZg59qaXPpscL7vCjS1
pFav1F3GKpplvkWAPBQf290yLYn934oLDanof1ENcAQxE46RD/CTr4V1J3xSzb31
+XEcwHoDFMdSHp219pinNWEDFGlYvb9/Q8AxCMFlUomRrPE/1tjo2l8gmZ7Wi1r5
uUwImmiYKnKoin3HpaY5n0NW+3madWhVfPhwv7Tsh0Hp0Qh5KRra+0OoLSGeKNDk
EDpKE/XKAjCNjcBLNIAWtLCHvPC1bWh2jC+Qmu8a9UGn4e5KmWw+GCUoOuBRuA6X
KOAoXtZXARsDDXadQKjNnQ4P3CvFcmNpCzjoFAarsflA3vKSHttDamk14bxE/mRx
AkNdw7oQYfOLAP2PVpPGgBEiU5WIoPwXdudDi6R8Tu107Z0mhiJIgbjPajJOvFpj
EnTO4k/A+Lk4uhPYYFhlhphDT/B8KdVlnEv0VJds0nPEE2uC3JiXxu7dHbBPYcXm
KtvJ5A+jDEMvFoPzoqwDBr7EzeHUUjyyI1Gym5EsjCNoH4WG8SyXYngCRrFOaIFe
e047UnJ4a7ibsFHAPrB9LtWLxvhPl13D4D+Ejb1GvnzDwSaAJAP9gVBGC0VpicId
NGEYAjsRsLGP2D3I4+Bhh1MN6mL/wlcmzO4izXXhMyc9vuTrXws=
=1ROV
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS update from Borislav Petkov:
"A single RAS change for 5.15:
- Do not start processing MCEs logged early because the decoding
chain is not up yet - delay that processing until everything is
ready"
* tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Defer processing of early errors
- Improve ftrace code patching so that stop_machine is not required anymore.
This requires a small common code patch acked by Steven Rostedt:
https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
- Enable KCSAN for s390. This comes with a small common code change to fix a
compile warning. Acked by Marco Elver:
https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
- Add KFENCE support for s390. This also comes with a minimal x86 patch from
Marco Elver who said also this can be carried via the s390 tree:
https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
- More changes to prepare the decompressor for relocation.
- Enable DAT also for CPU restart path.
- Final set of register asm removal patches; leaving only three locations where
needed and sane.
- Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO support to
hwcaps flags.
- Cleanup hwcaps implementation.
- Add new instructions to in-kernel disassembler.
- Various QDIO cleanups.
- Add SCLP debug feature.
- Various other cleanups and improvements all over the place.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmEs1GsACgkQIg7DeRsp
bsJ9+A/9FApCECNPgu6jOX4Ee+no+LxpCPUF8rvt56TFTLv7+Dhm7fJl0xQ9utsZ
FyLMDAr1/FKdm2wBW23QZH4vEIt1bd6e/03DwwK+6IjHKZHRIfB8eGJMsLj/TDzm
K6/+FI7qXjvpNXxgkCqXf5yESi/y5Dgr+16kTBhPZj5awRiwe5puPamji3uiQ45V
r4MdGCCC9BnTZvtPpUrr8ImnUqHJ4/TMo1YYdykLbZFuAvvYUyZ5YG5kh0pMa8JZ
DGJpfLQfy7ZNscIzdVhZtfzzESVtS6/AOeBzDMO1pbM1CGXtvpJJP0Wjlr/PGwoW
fvuMHpqTlDi+TfNZiPP5lwsFC89xSd6gtZH7vAuI8kFCXgW3RMjABF6h/mzpH1WO
jXyaSEZROc/83gxPMYyOYiqrKyAFPbpZ/Rnav2bvGQGneqx7RvmpF3GgA9WEo1PW
rMDoEbLstJuHk0E2uEV+OnQd5F7MHNonzpYfp/7pyQ+PL8w2GExV9yngVc/f3TqG
HYLC9rc3K6DkxZappcJm0qTb7lDTMFI7xK3g9RiqPQBJE1v1MYE/rai48nW69ypE
bRNL76AjyXKo+zKR2wlhJVMY1I1+DarMopHhZj6fzQT5te1LLsv8OuTU2gkt6dIq
YoSYOYvModf3HbKnJul2tszQG9yl+vpE9MiCyBQSsxIYXCriq/c=
=WDRh
-----END PGP SIGNATURE-----
Merge tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Improve ftrace code patching so that stop_machine is not required
anymore. This requires a small common code patch acked by Steven
Rostedt:
https://lore.kernel.org/linux-s390/20210730220741.4da6fdf6@oasis.local.home/
- Enable KCSAN for s390. This comes with a small common code change to
fix a compile warning. Acked by Marco Elver:
https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
- Add KFENCE support for s390. This also comes with a minimal x86 patch
from Marco Elver who said also this can be carried via the s390 tree:
https://lore.kernel.org/linux-s390/YQJdarx6XSUQ1tFZ@elver.google.com/
- More changes to prepare the decompressor for relocation.
- Enable DAT also for CPU restart path.
- Final set of register asm removal patches; leaving only three
locations where needed and sane.
- Add NNPA, Vector-Packed-Decimal-Enhancement Facility 2, PCI MIO
support to hwcaps flags.
- Cleanup hwcaps implementation.
- Add new instructions to in-kernel disassembler.
- Various QDIO cleanups.
- Add SCLP debug feature.
- Various other cleanups and improvements all over the place.
* tag 's390-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (105 commits)
s390: remove SCHED_CORE from defconfigs
s390/smp: do not use nodat_stack for secondary CPU start
s390/smp: enable DAT before CPU restart callback is called
s390: update defconfigs
s390/ap: fix state machine hang after failure to enable irq
KVM: s390: generate kvm hypercall functions
s390/sclp: add tracing of SCLP interactions
s390/debug: add early tracing support
s390/debug: fix debug area life cycle
s390/debug: keep debug data on resize
s390/diag: make restart_part2 a local label
s390/mm,pageattr: fix walk_pte_level() early exit
s390: fix typo in linker script
s390: remove do_signal() prototype and do_notify_resume() function
s390/crypto: fix all kernel-doc warnings in vfio_ap_ops.c
s390/pci: improve DMA translation init and exit
s390/pci: simplify CLP List PCI handling
s390/pci: handle FH state mismatch only on disable
s390/pci: fix misleading rc in clp_set_pci_fn()
s390/boot: factor out offset_vmlinux_info() function
...
Add <asm/amd-ibs.h> with bitfield definitions for IBS MSRs,
and demonstrate usage within the driver.
Also move 'struct perf_ibs_data' where it can be shared with
the perf tool that will soon be using it.
No functional changes.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-9-kim.phillips@amd.com
Factor out a helper function rather than export cpu_llc_id, which is
needed in order to be able to build the AMD uncore driver as a module.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.
During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.
But there is a problem. In:
5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.
This was partially fixed by:
cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")
which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.
Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.
[ Based on an original patch, commit message by and completely
productized by Tony Luck. ]
Fixes: 5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
When the 5-level page table CPU flag is set in the host, but the guest
has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
the shadow NPT page tables will be fixed, consisting of one pointer to
a lower-level table and 511 non-present entries. Extend the existing
code that creates the fixed PML4 or PDP table, to provide a fixed PML5
table if needed.
This is not needed on EPT because the number of layers in the tables
is specified in the EPTP instead of depending on the host CR4.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
a fixed TDP level.
Signed-off-by: Wei Huang <wei.huang2@amd.com>
Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
while running.
This change is mostly intended for more robust single stepping
of the guest and it has the following benefits when enabled:
* Resuming from a breakpoint is much more reliable.
When resuming execution from a breakpoint, with interrupts enabled,
more often than not, KVM would inject an interrupt and make the CPU
jump immediately to the interrupt handler and eventually return to
the breakpoint, to trigger it again.
From the user point of view it looks like the CPU never executed a
single instruction and in some cases that can even prevent forward
progress, for example, when the breakpoint is placed by an automated
script (e.g lx-symbols), which does something in response to the
breakpoint and then continues the guest automatically.
If the script execution takes enough time for another interrupt to
arrive, the guest will be stuck on the same breakpoint RIP forever.
* Normal single stepping is much more predictable, since it won't
land the debugger into an interrupt handler.
* RFLAGS.TF has less chance to be leaked to the guest:
We set that flag behind the guest's back to do single stepping
but if single step lands us into an interrupt/exception handler
it will be leaked to the guest in the form of being pushed
to the stack.
This doesn't completely eliminate this problem as exceptions
can still happen, but at least this reduces the chances
of this happening.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Existing KVM code tracks the number of large pages regardless of their
sizes. Therefore, when large page of 1GB (or larger) is adopted, the
information becomes less useful because lpages counts a mix of 1G and 2M
pages.
So remove the lpages since it is easy for user space to aggregate the info.
Instead, provide a comprehensive page stats of all sizes from 4K to 512G.
Suggested-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20210803044607.599629-4-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
however, possible to track whether the feature was actually used by the
guest and only inhibit APICv/AVIC when needed.
TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
Windows know that AutoEOI feature should be avoided. While it's up to
KVM userspace to set the flag, KVM can help a bit by exposing global
APICv/AVIC enablement.
Maxim:
- always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
since this feature can be used regardless of AVIC
Paolo:
- use arch.apicv_update_lock to protect the hv->synic_auto_eoi_used
instead of atomic ops
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-12-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently on SVM, the kvm_request_apicv_update toggles the APICv
memslot without doing any synchronization.
If there is a mismatch between that memslot state and the AVIC state,
on one of the vCPUs, an APIC mmio access can be lost:
For example:
VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: access an APIC mmio register.
Since AVIC is still disabled on VCPU1, the access will not be intercepted
by it, and neither will it cause MMIO fault, but rather it will just be
read/written from/to the dummy page mapped into the
APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.
Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:
1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock
This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Thanks to the former patches, it is now possible to keep the APICv
memslot always enabled, and it will be invisible to the guest
when it is inhibited
This code is based on a suggestion from Sean Christopherson:
https://lkml.org/lkml/2021/7/19/2970
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
than fetching from the memslot pointer. Start to use the latter one in
kvm_alloc_memslot_metadata().
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Invert the mask of bits that we pick from L2 in
nested_vmcb02_prepare_control
* Invert and explicitly use VIRQ related bits bitmask in svm_clear_vintr
This fixes a security issue that allowed a malicious L1 to run L2 with
AVIC enabled, which allowed the L2 to exploit the uninitialized and enabled
AVIC to read/write the host physical memory at some offsets.
Fixes: 3d6368ef58 ("KVM: SVM: Add VMRUN handler")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
VMX and SVM instructions use asm goto to handle the fault (or in the
case of VMREAD, completely custom logic). Drop kvm_spurious_fault()'s
asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
flow that invoked it from assembly code.
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit ae561edeb4 ("KVM: x86: DR0-DR3 are not clear on reset") added code to
ensure eff_db are updated when they're modified through non-standard paths.
But there is no reason to also update hardware DRs unless hardware breakpoints
are active or DR exiting is disabled, and in those cases updating hardware is
handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.
KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
to be removed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define PIC_ELCR1 and PIC_ELCR2 macros for accesses to the ELCR registers
implemented by many chipsets in their embedded 8259A PIC cores, avoiding
magic numbers that are difficult to handle, and complementing the macros
we already have for registers originally defined with discrete 8259A PIC
implementations. No functional change.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200237300.9461@angie.orcam.me.uk
Define macros and accessors for the configuration space addressed
indirectly with an index register and a data register at the port I/O
locations of 0x22 and 0x23 respectively.
This space is defined by the Intel MultiProcessor Specification for the
IMCR register used to switch between the PIC and the APIC mode[1], by
Cyrix processors for their configuration[2][3], and also some chipsets.
Given the lack of atomicity with the indirect addressing a spinlock is
required to protect accesses, although for Cyrix processors it is enough
if accesses are executed with interrupts locally disabled, because the
registers are local to the accessing CPU, and IMCR is only ever poked at
by the BSP and early enough for interrupts not to have been configured
yet. Therefore existing code does not have to change or use the new
spinlock and neither it does.
Put the spinlock in a library file then, so that it does not get pulled
unnecessarily for configurations that do not refer it.
Convert Cyrix accessors to wrappers so as to retain the brevity and
clarity of the `getCx86' and `setCx86' calls.
References:
[1] "MultiProcessor Specification", Version 1.4, Intel Corporation,
Order Number: 242016-006, May 1997, Section 3.6.2.1 "PIC Mode", pp.
3-7, 3-8
[2] "5x86 Microprocessor", Cyrix Corporation, Order Number: 94192-00,
July 1995, Section 2.3.2.4 "Configuration Registers", p. 2-23
[3] "6x86 Processor", Cyrix Corporation, Order Number: 94175-01, March
1996, Section 2.4.4 "6x86 Configuration Registers", p. 2-23
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107182353140.9461@angie.orcam.me.uk
gfn_to_hva_cache is not thread-safe, so it is usually used only within
a vCPU (whose code is protected by vcpu->mutex). The Xen interface
implementation has such a cache in kvm->arch, but it is not really
used except to store the location of the shared info page. Replace
shinfo_set and shinfo_cache with just the value that is passed via
KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
initialization value is not zero anymore and therefore kvm_xen_init_vm
needs to be introduced.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For root partition the VP assist pages are pre-determined by the
hypervisor. The root kernel is not allowed to change them to
different locations. And thus, we are getting below stack as in
current implementation root is trying to perform write to specific
MSR.
[ 2.778197] unchecked MSR access error: WRMSR to 0x40000073 (tried to write 0x0000000145ac5001) at rIP: 0xffffffff810c1084 (native_write_msr+0x4/0x30)
[ 2.784867] Call Trace:
[ 2.791507] hv_cpu_init+0xf1/0x1c0
[ 2.798144] ? hyperv_report_panic+0xd0/0xd0
[ 2.804806] cpuhp_invoke_callback+0x11a/0x440
[ 2.811465] ? hv_resume+0x90/0x90
[ 2.818137] cpuhp_issue_call+0x126/0x130
[ 2.824782] __cpuhp_setup_state_cpuslocked+0x102/0x2b0
[ 2.831427] ? hyperv_report_panic+0xd0/0xd0
[ 2.838075] ? hyperv_report_panic+0xd0/0xd0
[ 2.844723] ? hv_resume+0x90/0x90
[ 2.851375] __cpuhp_setup_state+0x3d/0x90
[ 2.858030] hyperv_init+0x14e/0x410
[ 2.864689] ? enable_IR_x2apic+0x190/0x1a0
[ 2.871349] apic_intr_mode_init+0x8b/0x100
[ 2.878017] x86_late_time_init+0x20/0x30
[ 2.884675] start_kernel+0x459/0x4fb
[ 2.891329] secondary_startup_64_no_verify+0xb0/0xbb
Since the hypervisor already provides the VP assist pages for root
partition, we need to memremap the memory from hypervisor for root
kernel to use. The mapping is done in hv_cpu_init during bringup and is
unmapped in hv_cpu_die during teardown.
Signed-off-by: Praveen Kumar <kumarpraveen@linux.microsoft.com>
Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Link: https://lore.kernel.org/r/20210731120519.17154-1-kumarpraveen@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Based on our observations, after any vm-exit associated with vPMU, there
are at least two or more perf interfaces to be called for guest counter
emulation, such as perf_event_{pause, read_value, period}(), and each one
will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
more severe when guest use counters in a multiplexed manner.
Holding a lock once and completing the KVM request operations in the perf
context would introduce a set of impractical new interfaces. So we can
further optimize the vPMU implementation by avoiding repeated calls to
these interfaces in the KVM context for at least one pattern:
After we call perf_event_pause() once, the event will be disabled and its
internal count will be reset to 0. So there is no need to pause it again
or read its value. Once the event is paused, event period will not be
updated until the next time it's resumed or reprogrammed. And there is
also no need to call perf_event_period twice for a non-running counter,
considering the perf_event for a running counter is never paused.
Based on this implementation, for the following common usage of
sampling 4 events using perf on a 4u8g guest:
echo 0 > /proc/sys/kernel/watchdog
echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
for i in `seq 1 1 10`
do
taskset -c 0 perf record \
-e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
/root/br_instr a
done
the average latency of the guest NMI handler is reduced from
37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
Also, in addition to collecting more samples, no loss of sampling
accuracy was observed compared to before the optimization.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210728120705.6855-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
As alluded to in commit f36f3f2846 ("KVM: add "new" argument to
kvm_arch_commit_memory_region"), a bunch of other places where struct
kvm_memory_slot is used, needs to be refactored to preserve the
"const"ness of struct kvm_memory_slot across-the-board.
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
[Do not touch body of slot_rmap_walk_init. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.
No functional change intended.
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86's <asm/tlbflush.h> only declares non-module accessible functions
(such as flush_tlb_one_kernel) if !MODULE.
In preparation of including <asm/kfence.h> from the KFENCE test module,
only define the helpers if !MODULE to avoid breaking the build with
CONFIG_KFENCE_KUNIT_TEST=m.
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/YQJdarx6XSUQ1tFZ@elver.google.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The goal of this is to allow tasks that want to protect sensitive
information, against e.g. the recently found snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out. This protects
their data from being snooped or leaked via side channels after the task
has context switched out.
This could also be used to wipe L1D when an untrusted task is switched in,
but that's not a really well defined scenario while the opt-in variant is
clearly defined.
The mechanism is default disabled and can be enabled on the kernel command
line.
Prepare for the actual prctl based opt-in:
1) Provide the necessary setup functionality similar to the other
mitigations and enable the static branch when the command line option
is set and the CPU provides support for hardware assisted L1D
flushing. Software based L1D flush is not supported because it's CPU
model specific and not really well defined.
This does not come with a sysfs file like the other mitigations
because it is not bound to any specific vulnerability.
Support has to be queried via the prctl(2) interface.
2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
mangled into the mm pointer in one go which allows to reuse the
existing mechanism in switch_mm() for the conditional IBPB speculation
barrier efficiently.
3) Add the L1D flush specific functionality which flushes L1D when the
outgoing task opted in.
Also check whether the incoming task has requested L1D flush and if so
validate that it is not accidentaly running on an SMT sibling as this
makes the whole excercise moot because SMT siblings share L1D which
opens tons of other attack vectors. If that happens schedule task work
which signals the incoming task on return to user/guest with SIGBUS as
this is part of the paranoid L1D flush contract.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
The upcoming support for paranoid L1D flush in switch_mm() requires that
TIF_SPEC_IB and the new TIF_SPEC_L1D_FLUSH are two consecutive bits in
thread_info::flags.
Move TIF_SPEC_FORCE_UPDATE to a spare bit to make room for the new one.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
cond_ibpb() has the necessary bits required to track the previous mm in
switch_mm_irqs_off(). This can be reused for other use cases like L1D
flushing on context switch.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-3-sblbir@amazon.com
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not.
This is helpful when the system has some of its cores with threads offlined
and can be used for cases where action is taken based on the state of SMT.
The upcoming support for paranoid L1D flush will make use of this information.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-2-sblbir@amazon.com
Parts of linux/compat.h are under an #ifdef, but we end up
using more of those over time, moving things around bit by
bit.
To get it over with once and for all, make all of this file
uncondititonal now so it can be accessed everywhere. There
are only a few types left that are in asm/compat.h but not
yet in the asm-generic version, so add those in the process.
This requires providing a few more types in asm-generic/compat.h
that were not already there. The only tricky one is
compat_sigset_t, which needs a little help on 32-bit architectures
and for x86.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmD5TGAACgkQ/lWMcqZw
E8PNgxAApjTYQSfjIBbOZnNraxW6w7/bPea35E9A47EdBQsNGnYftNsFjbrn/mCJ
D+0eRLjCMlg4FF1SHdh9cPJ35py+ygbDeupogboLITfU99eGBth3fM2Xdg9LPcBh
dbni/JLG9R7gIvSlqdJuweN21trfVrV/9FQEilG5DvQcl27Wx5g8VMRZke1EqGKX
7Id09Uq50ky18vhDjQRCveYhRqJAxV+XozBatzHyxpDVzjLQvRhlAAYdvrSMHZ5R
jreGzOfR8awc6Om+w7wx3Jn1oEGmXVZB/VqxEqGtMOr3lpARPucxrqfHsqpam3rv
yIoEKPrkG+k6fsU7Tbg59jNqe/PbCUW3AlpyuBxf55EbnVGgjLDbq4sRRMkehPfA
fhC31ujOXQQnAgaxyeQAaAJFKNFJzA8Cq5ZPfG+zztzuomHCiUVQBRowP65hJMzR
+ZlEDnhUD3STLz39zuO1reZR1ZoPIvKbsokHAA+ZrIwUd6U3D3ia8V51pq+lL5aS
TGDkyMN9jyZ+SO8Z7+2FnJAv9FAOPU/WCLU/fWW46jAvuezwMIwVcjfSqDU2XbZD
e7KgHpHhx3BGxI8TThHKlY7mf6IL2Bm7X1Cv1pdZs/eEn3Udh2ax942uTQZu/YOO
0AT1XchpvYCBNRw05bVI3OlJ+w3I8uV+h+11jHOKeY6cbwdHeKE=
=BUya
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2021-07-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.15-rc1:
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/2d0d2fe8-01fc-e216-c3fd-38db9e69944e@linux.intel.com
The x86 architecture has generic support to register a system framebuffer
platform device. It either registers a "simple-framebuffer" if the config
option CONFIG_X86_SYSFB is enabled, or a legacy VGA/VBE/EFI FB device.
But the code is generic enough to be reused by other architectures and can
be moved out of the arch/x86 directory.
This will allow to also support the simple{fb,drm} drivers on non-x86 EFI
platforms, such as aarch64 where these drivers are only supported with DT.
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625130947.1803678-2-javierm@redhat.com
The code to allocate and initialize the hv_vp_index array is
architecture neutral. Similarly, the code to allocate and
populate the hypercall input and output arg pages is architecture
neutral. Move both sets of code out from arch/x86 and into
utility functions in drivers/hv/hv_common.c that can be shared
by Hyper-V initialization on ARM64.
No functional changes. However, the allocation of the hypercall
input and output arg pages is done differently so that the
size is always the Hyper-V page size, even if not the same as
the guest page size (such as with ARM64's 64K page size).
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
- Prevent sigaltstack out of bounds writes. The kernel unconditionally
writes the FPU state to the alternate stack without checking whether
the stack is large enough to accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been
updated despite the fact that the FPU state which is stored on the
signal stack has grown over time which causes trouble in the field
when AVX512 is available on a CPU. The kernel does not expose the
minimum requirements for the alternate stack size depending on the
available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well
- A major cleanup of the x86 FPU code. The recent discoveries of XSTATE
related issues unearthed quite some inconsistencies, duplicated code
and other issues.
The fine granular overhaul addresses this, makes the code more robust
and maintainable, which allows to integrate upcoming XSTATE related
features in sane ways.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz
y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB
KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj
wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL
gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga
f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA
GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK
XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC
M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq
SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx
0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4
/TbbRZu7nXDN2w==
=Mq1m
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
"Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes.
The kernel unconditionally writes the FPU state to the alternate
stack without checking whether the stack is large enough to
accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space
data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never
been updated despite the fact that the FPU state which is stored on
the signal stack has grown over time which causes trouble in the
field when AVX512 is available on a CPU. The kernel does not expose
the minimum requirements for the alternate stack size depending on
the available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well.
- A major cleanup of the x86 FPU code. The recent discoveries of
XSTATE related issues unearthed quite some inconsistencies,
duplicated code and other issues.
The fine granular overhaul addresses this, makes the code more
robust and maintainable, which allows to integrate upcoming XSTATE
related features in sane ways"
* tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again
x86/fpu/signal: Let xrstor handle the features to init
x86/fpu/signal: Handle #PF in the direct restore path
x86/fpu: Return proper error codes from user access functions
x86/fpu/signal: Split out the direct restore code
x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing()
x86/fpu/signal: Sanitize the xstate check on sigframe
x86/fpu/signal: Remove the legacy alignment check
x86/fpu/signal: Move initial checks into fpu__restore_sig()
x86/fpu: Mark init_fpstate __ro_after_init
x86/pkru: Remove xstate fiddling from write_pkru()
x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate()
x86/fpu: Remove PKRU handling from switch_fpu_finish()
x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
x86/fpu: Hook up PKRU into ptrace()
x86/fpu: Add PKRU storage outside of task XSAVE buffer
x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
...
The get_unaligned()/put_unaligned() helpers are traditionally architecture
specific, with the two main variants being the "access-ok.h" version
that assumes unaligned pointer accesses always work on a particular
architecture, and the "le-struct.h" version that casts the data to a
byte aligned type before dereferencing, for architectures that cannot
always do unaligned accesses in hardware.
Based on the discussion linked below, it appears that the access-ok
version is not realiable on any architecture, but the struct version
probably has no downsides. This series changes the code to use the
same implementation on all architectures, addressing the few exceptions
separately.
Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmDfFx4ACgkQmmx57+YA
GNkqzRAAjdlIr8M+xI2CyT0/A9tswYfLMeWejmYopq3zlxI6RnvPiJJDIdY2I8US
1npIiDo55w061CnXL9rV65ocL3XmGu1mabOvgM6ATsec+8t4WaXBV9tysxTJ9ea0
ltLTa2P5DXWALvWiVMTME7hFaf1cW+8Uqt3LmXxDp2l5zasXajCHAH6YokON2PfM
CsaRhwSxIu8Sbnu/IQGBI9JW5UXsBfKSyUwtM0OwP7jFOuIeZ4WBVA+j6UxONnFC
wouKmAM/ThoOsaV9aP4EZLIfBx8d4/hfYQjZ958kYXurerruYkJeEqdIRbV0QqTy
2O6ZrJ6uqPlzfWz9h458me2dt98YEtALHV/3DCWUcBfHmUQtxElyJYEhG0YjVF3H
5RYtjw8Q2LS/QR5ask1Xn0JfT89rRnLi2migAtsA4Ce70JP4Us6wGobkj4SHlgDt
P7+eVq2Mkhqw/kmV8N4p+ZS5lpkK0JniDN+ONDhkZqHL/zXG/HQzx9wLV69jlvo2
ASevKxITdi+bKHWs5ANungkBOnBUQZacq46mVyi4HPDwMAFyWvVYTbFumy9koagQ
o9NEgX3RsZcxxi7bU1xuFPFMLMlUQT3Nb30+84B4fKe9FmvHC1hizTiCnp7q4bZr
z6a6AMHke7YLqKZOqzTJGRR3lPoZZDCb775SAd70LQp6XPZXOHs=
=IY5U
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm/unaligned.h unification from Arnd Bergmann:
"Unify asm/unaligned.h around struct helper
The get_unaligned()/put_unaligned() helpers are traditionally
architecture specific, with the two main variants being the
"access-ok.h" version that assumes unaligned pointer accesses always
work on a particular architecture, and the "le-struct.h" version that
casts the data to a byte aligned type before dereferencing, for
architectures that cannot always do unaligned accesses in hardware.
Based on the discussion linked below, it appears that the access-ok
version is not realiable on any architecture, but the struct version
probably has no downsides. This series changes the code to use the
same implementation on all architectures, addressing the few
exceptions separately"
Link: https://lore.kernel.org/lkml/75d07691-1e4f-741f-9852-38c0b4f520bc@synopsys.com/
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
Link: https://lore.kernel.org/lkml/20210507220813.365382-14-arnd@kernel.org/
Link: git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-rework-v2
Link: https://lore.kernel.org/lkml/CAHk-=whGObOKruA_bU3aPGZfoDqZM1_9wBkwREp0H0FgR-90uQ@mail.gmail.com/
* tag 'asm-generic-unaligned-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: simplify asm/unaligned.h
asm-generic: uaccess: 1-byte access is always aligned
netpoll: avoid put_unaligned() on single character
mwifiex: re-fix for unaligned accesses
apparmor: use get_unaligned() only for multi-byte words
partitions: msdos: fix one-byte get_unaligned()
asm-generic: unaligned always use struct helpers
asm-generic: unaligned: remove byteshift helpers
powerpc: use linux/unaligned/le_struct.h on LE power7
m68k: select CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
sh: remove unaligned access for sh4a
openrisc: always use unaligned-struct header
asm-generic: use asm-generic/unaligned.h for most architectures
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently most platforms define pmd_pgtable() as pmd_page() duplicating
the same code all over. Instead just define a default value i.e
pmd_page() for pmd_pgtable() and let platforms override when required via
<asm/pgtable.h>. All the existing platform that override pmd_pgtable()
have been moved into their respective <asm/pgtable.h> header in order to
precede before the new generic definition. This makes it much cleaner
with reduced code.
Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over. Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required. This
makes it much cleaner with reduced code.
The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.
Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Guo Ren <guoren@kernel.org> [csky]
Acked-by: Stafford Horne <shorne@gmail.com> [openrisc]
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V]
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Consolidate the macros for .byte ... opcode sequences
- Deduplicate register offset defines in include files
- Simplify the ia32,x32 compat handling of the related syscall tables to
get rid of #ifdeffery.
- Clear all EFLAGS which are not required for syscall handling
- Consolidate the syscall tables and switch the generation over to the
generic shell script and remove the CFLAGS tweaks which are not longer
required.
- Use 'int' type for system call numbers to match the generic code.
- Add more selftests for syscalls
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbKzMTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoae8D/9+pksdf8lE5dRLtngSeTDLiyIV+qq4
vSks7XfrTTAhOV2nRwtIulc2CO6H7jcvn6ehmiC/X0Tn9JK5brwSJJYryNEjA3cp
3p9jPrB1w1SDhx35JzILN4DDaJfI3jobLSLDq0KQzuEL0+c0R4l3WBplpCzbLjqj
NaFQgslf8RSnjha9NLTKzlzSaNNNo9Ioo6DyrsBDEdcRBtAPlFfdVtT3oJE73ANH
dK5POoVWysmAnDAwEW17j9bBJLtxeWsrhM9CrtqvcKr3HhK9WjWUFAr+diQf5GKf
BAD2A+5y8wZQXvFOuC9WZxfQwUFSLExt8BfcXblOUbf2CdlvoYVzOlvI141kA++4
q4wQ1vl6MbLCp6wLysc3bnwKUEmnf2E4Iyj5JR2aFrw096pAoZ3ZbAQi7s3Vhb16
aSbGxIw3rHRuB0f8VmOA0iEHiXlkRmE/K+nH1/uDTUZLaDpktPvpKQJsp0+9qXFk
eVtEw4bVKJ7q5ozjMzpm9aPxPp1v8MGxUOJOy80W7Ti+vBp2KmMKc1gy8QsYrTvW
Vzvpp3U+/WFh2X7AG0zlP/JEnOuJmMwMK5QhzMC2rEbaHJ66ht7SABvtSbOHHw5Z
zugxTE0lx3n7izCxW1RLEu//xtWY0FbU2L5oE2Ace27myUPeBQCDJzynUn93dMM9
9nq2TtgTCF6XvA==
=+sb9
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry code related updates from Thomas Gleixner:
- Consolidate the macros for .byte ... opcode sequences
- Deduplicate register offset defines in include files
- Simplify the ia32,x32 compat handling of the related syscall tables
to get rid of #ifdeffery.
- Clear all EFLAGS which are not required for syscall handling
- Consolidate the syscall tables and switch the generation over to the
generic shell script and remove the CFLAGS tweaks which are not
longer required.
- Use 'int' type for system call numbers to match the generic code.
- Add more selftests for syscalls
* tag 'x86-entry-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/syscalls: Don't adjust CFLAGS for syscall tables
x86/syscalls: Remove -Wno-override-init for syscall tables
x86/uml/syscalls: Remove array index from syscall initializers
x86/syscalls: Clear 'offset' and 'prefix' in case they are set in env
x86/entry: Use int everywhere for system call numbers
x86/entry: Treat out of range and gap system calls the same
x86/entry/64: Sign-extend system calls on entry to int
selftests/x86/syscall: Add tests under ptrace to syscall_numbering_64
selftests/x86/syscall: Simplify message reporting in syscall_numbering
selftests/x86/syscall: Update and extend syscall_numbering_64
x86/syscalls: Switch to generic syscallhdr.sh
x86/syscalls: Use __NR_syscalls instead of __NR_syscall_max
x86/unistd: Define X32_NR_syscalls only for 64-bit kernel
x86/syscalls: Stop filling syscall arrays with *_sys_ni_syscall
x86/syscalls: Switch to generic syscalltbl.sh
x86/entry/x32: Rename __x32_compat_sys_* to __x64_compat_sys_*
- Consolidate the VECTOR defines and the usage sites.
- Cleanup GDT/IDT related code and replace open coded ASM with proper
native helfper functions.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbLAUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTiXEACiuisDJ2fYFqU1dmYRbWIDtWbgsJ3k
CVABRjgCbGfviKaaJuMoHf5tbnXWWu7y8jd8Z+h9cwOlyQOzNBsZjplzPS0h8zME
KAekAkO2VGf5G7VdWLrfMvjIY/NDuAgxj+7w01LvnyWROePGRkbeP3iH41qo+auM
5Cj4lu333+rO4kzmdXzwJ7CHQXOa/OT0MrBL14saYFaM3qSSkCzeIXnE6/ZNapsE
zZYOCDF19MpPm6GZT1i4qRxirhw1TLNycsYavlOxZ/Hyp0BO0t2TiNRwZtdIVz+a
1sedm+pD9E+1qHQfB+P03P65OixxN0hArNlKgGou5LDMRF45pvfqQXEBbTsqHSxh
vWlL/tK7Z7U5dsK7ZA0HvlZYdrunWn/cNMqWb08WDyuPLxJ0QxJjsdOB2teVEus+
kNYsP0ZxRvPNHKtqVfTXGS8ksrNS/57lUz6UJmBA3UYhYg33UgPCfF/gQzTnpfSo
4TzhWIeLlCOId9FPxXpXa4NjjsqXvNEOPGrTx4BY8SYHYln4HoSyffRIZQ8xl0lA
Qfetod+Hajt+5JXGndb906kexY7i14ZOrkHEjkUtq0asNmbwJ+hVs2VaYcq/ghuS
BmhlnarYuWw9t11yD9Ln5stoVgRJ2KEX5T9fOCtCsJZyHo+Eta/p14ocU0eLQQdh
HbsRKB+pE+al2A==
=eAPe
-----END PGP SIGNATURE-----
Merge tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 interrupt related updates from Thomas Gleixner:
- Consolidate the VECTOR defines and the usage sites.
- Cleanup GDT/IDT related code and replace open coded ASM with proper
native helper functions.
* tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c
x86: Add native_[ig]dt_invalidate()
x86/idt: Remove address argument from idt_invalidate()
x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
x86/irq: Remove unused vectors defines
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of live migration
- Support for initializing the PDPTRs without loading them from memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this has
been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on
AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4
QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn
iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO
h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6
+/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY
QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w==
=NcRh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This covers all architectures (except MIPS) so I don't expect any
other feature pull requests this merge window.
ARM:
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration and
apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical
address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of
live migration
- Support for initializing the PDPTRs without loading them from
memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this
has been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM
registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
on AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
kvm: x86: disable the narrow guest module parameter on unload
selftests: kvm: Allows userspace to handle emulation errors.
kvm: x86: Allow userspace to handle emulation errors
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
KVM: x86: Enhance comments for MMU roles and nested transition trickiness
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
KVM: x86/mmu: Use MMU's role to determine PTTYPE
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
...
- Optimise SVE switching for CPUs with 128-bit implementations.
- Fix output format from SVE selftest.
- Add support for versions v1.2 and 1.3 of the SMC calling convention.
- Allow Pointer Authentication to be configured independently for
kernel and userspace.
- PMU driver cleanups for managing IRQ affinity and exposing event
attributes via sysfs.
- KASAN optimisations for both hardware tagging (MTE) and out-of-line
software tagging implementations.
- Relax frame record alignment requirements to facilitate 8-byte
alignment with KASAN and Clang.
- Cleanup of page-table definitions and removal of unused memory types.
- Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
- Refactoring of our instruction decoding routines and addition of some
missing encodings.
- Move entry code moved into C and hardened against harmful compiler
instrumentation.
- Update booting requirements for the FEAT_HCX feature, added to v8.7
of the architecture.
- Fix resume from idle when pNMI is being used.
- Additional CPU sanity checks for MTE and preparatory changes for
systems where not all of the CPUs support 32-bit EL0.
- Update our kernel string routines to the latest Cortex Strings
implementation.
- Big cleanup of our cache maintenance routines, which were confusingly
named and inconsistent in their implementations.
- Tweak linker flags so that GDB can understand vmlinux when using RELR
relocations.
- Boot path cleanups to enable early initialisation of per-cpu
operations needed by KCSAN.
- Non-critical fixes and miscellaneous cleanup.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmDUh1YQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNDaUCAC+2Jy2Yopd94uBPYajGybM0rqCUgE7b5n1
A7UzmQ6fia2hwqCPmxGG+sRabovwN7C1bKrUCc03RIbErIa7wum1edeyqmF/Aw44
DUDY1MAOSZaFmX8L62QCvxG1hfdLPtGmHMd1hdXvxYK7PCaigEFnzbLRWTtgE+Ok
JhdvNfsoeITJObHnvYPF3rV3NAbyYni9aNJ5AC/qb3dlf6XigEraXaMj29XHKfwc
+vmn+25oqFkLHyFeguqIoK+vUQAy/8TjFfjX83eN3LZknNhDJgWS1Iq1Nm+Vxt62
RvDUUecWJjAooCWgmil6pt0enI+q6E8LcX3A3cWWrM6psbxnYzkU
=I6KS
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"There's a reasonable amount here and the juicy details are all below.
It's worth noting that the MTE/KASAN changes strayed outside of our
usual directories due to core mm changes and some associated changes
to some other architectures; Andrew asked for us to carry these [1]
rather that take them via the -mm tree.
Summary:
- Optimise SVE switching for CPUs with 128-bit implementations.
- Fix output format from SVE selftest.
- Add support for versions v1.2 and 1.3 of the SMC calling
convention.
- Allow Pointer Authentication to be configured independently for
kernel and userspace.
- PMU driver cleanups for managing IRQ affinity and exposing event
attributes via sysfs.
- KASAN optimisations for both hardware tagging (MTE) and out-of-line
software tagging implementations.
- Relax frame record alignment requirements to facilitate 8-byte
alignment with KASAN and Clang.
- Cleanup of page-table definitions and removal of unused memory
types.
- Reduction of ARCH_DMA_MINALIGN back to 64 bytes.
- Refactoring of our instruction decoding routines and addition of
some missing encodings.
- Move entry code moved into C and hardened against harmful compiler
instrumentation.
- Update booting requirements for the FEAT_HCX feature, added to v8.7
of the architecture.
- Fix resume from idle when pNMI is being used.
- Additional CPU sanity checks for MTE and preparatory changes for
systems where not all of the CPUs support 32-bit EL0.
- Update our kernel string routines to the latest Cortex Strings
implementation.
- Big cleanup of our cache maintenance routines, which were
confusingly named and inconsistent in their implementations.
- Tweak linker flags so that GDB can understand vmlinux when using
RELR relocations.
- Boot path cleanups to enable early initialisation of per-cpu
operations needed by KCSAN.
- Non-critical fixes and miscellaneous cleanup"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (150 commits)
arm64: tlb: fix the TTL value of tlb_get_level
arm64: Restrict undef hook for cpufeature registers
arm64/mm: Rename ARM64_SWAPPER_USES_SECTION_MAPS
arm64: insn: avoid circular include dependency
arm64: smp: Bump debugging information print down to KERN_DEBUG
drivers/perf: fix the missed ida_simple_remove() in ddr_perf_probe()
perf/arm-cmn: Fix invalid pointer when access dtc object sharing the same IRQ number
arm64: suspend: Use cpuidle context helpers in cpu_suspend()
PSCI: Use cpuidle context helpers in psci_cpu_suspend_enter()
arm64: Convert cpu_do_idle() to using cpuidle context helpers
arm64: Add cpuidle context save/restore helpers
arm64: head: fix code comments in set_cpu_boot_mode_flag
arm64: mm: drop unused __pa(__idmap_text_start)
arm64: mm: fix the count comments in compute_indices
arm64/mm: Fix ttbr0 values stored in struct thread_info for software-pan
arm64: mm: Pass original fault address to handle_mm_fault()
arm64/mm: Drop SECTION_[SHIFT|SIZE|MASK]
arm64/mm: Use CONT_PMD_SHIFT for ARM64_MEMSTART_SHIFT
arm64/mm: Drop SWAPPER_INIT_MAP_SIZE
arm64: Conditionally configure PTR_AUTH key of the kernel.
...
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZeG8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gHQw//fI9MAIVQbB6tVMH6GtFkQZIJLMt/bik5
AWelEXoBUbbLFGKpugC+oWGJjsvZ026f65hfQEswuqD4n0Xx8FFPRi51LP88lLya
XQV8nssJYUKYZAVA0EJd7NmnJchbnRc4KQmu6ekEQdP6+Nht8k7U9O2QetgQgcE5
IYhXctoYpr/FnBpV5PmVNAakOt0cZh6mXAtpzjHfdU8lUHZ13zPIpniSXCPd4vUB
u/a3x3l1fP+Gg8d1vpfGCBvNKRBEh5pJsjaObMlLM/qhHupsDi5Ji6y6pcJSgkcv
2nBtRGYDjYIQ0qXx6ILhNuqGFT76i/j2p8YfwMnH4NmYk908RlT0quu7fI8wBO9E
cKd3m9BG8wP67xbOrG/0ckdl3+y/1iW8kPY6SeO03Vvfm6ryqHdZs4oi4CmcX9lP
bFXi5AiYdHm0vqbwQG8P9LerWotgz4yFC9z7yC1KXJDXJxSwVxDFiXvyvxepRi6E
NZxe4RSnDp7sijEvZJa/2EA+rDVDIokfzTLgnRSMkaUuxwNsVjeNsV0b5727kiVC
DwVkxC7NZKG9UBr6WFs9hxRPE0g6xz3EJEBXaWpk2ggBmQxTfBRTjV0Pe3ii7dqQ
z7O3Gv8pojki3ttG4wExLepPHRxTBzjdsoV6/BHZpraYTP11bpQlgx/K7IYJZYa5
Tt9IZ4vNd10=
=mbmH
-----END PGP SIGNATURE-----
Merge tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
* tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Make <asm/asm.h> valid on cross-builds as well
x86/regs: Syscall_get_nr() returns -1 for a non-system call
x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros
x86/syscall: Maximize MSR_SYSCALL_MASK
x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
x86/entry: Reverse arguments to do_syscall_64()
x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>
x86/asm: Use _ASM_BYTES() in <asm/nops.h>
x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence
x86/asm: Have the __ASM_FORM macros handle commas in arguments
- Clean up & simplify AP exception handling setup.
- Consolidate the disjoint IDT setup code living in
idt_setup_traps() and idt_setup_ist_traps() into
a single idt_setup_traps() initialization function
and call it before cpu_init().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a
BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF
u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC
qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY
u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz
otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms
bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE
AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht
eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP
z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra
Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq
5WFho50hNIk=
=BPbv
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 exception handling updates from Ingo Molnar:
- Clean up & simplify AP exception handling setup.
- Consolidate the disjoint IDT setup code living in idt_setup_traps()
and idt_setup_ist_traps() into a single idt_setup_traps()
initialization function and call it before cpu_init().
* tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/idt: Rework IDT setup for boot CPU
x86/cpu: Init AP exception handling from cpu_init_secondary()
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow
the flexible utilization of SMT siblings, without exposing
untrusted domains to information leaks & side channels, plus
to ensure more deterministic computing performance on SMT
systems used by heterogenous workloads.
There's new prctls to set core scheduling groups, which
allows more flexible management of workloads that can share
siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve
'memcache'-like workloads.
- "Age" (decay) average idle time, to better track & improve workloads
such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked
via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable
it at runtime if tooling needs it. Use static keys and
other optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
+U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
UmG7bt94Trk=
=3VDr
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.
There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.
- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaxMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hPgw//f9SnGzFoP1uR5TBqM8j/QHulMewew/iD
dM5lh2emdmqHWYPBeRxUHgag38K2Golr3Y+NxLA3R+RMx+OZQe8Mz/wYvPQcBvsV
k1HHImU3GRMn4GM7GwxH3vPIottDUx3mNS2J6pzlw3kwRUVqrxUdj/0/pSY/4eJ7
ZT4uq4yLV83Jd3qioU7o7e/u6MrdNIIcAXRpVDdE9Mm1+kWXSVN7/h3Vsiz4tj5E
iS+UXEtSc1a2mnmekv63pYkJHHNUb6guD8jgI/wrm1KIFGjDRifM+3TV6R/kB96/
TfD2LhCcTShfSp8KI191pgV7/NQbB/PmLdSYmff3rTBiii4cqXuCygJCHInZ09z0
4fTSSqM6aHg7kfTQyOCp+DUQ+9vNVXWo8mxt9c6B8xA0GyCI3zhjQ4UIiSUWRpjs
Be5ZyF0kNNuPxYrKFnGnBf8+51DURpCz3sDdYRuK4KNkj1+4ZvJo/KzGTMUUIE4B
IDQG6wDP5Kb388eRDtKrG5X7IXg+L5F/kezin60j0QF5MwDgxirT217teN8H1lNn
YgWMjRK8Tw0flUJsbCxa51/nl93UtByB+fIRIc88MSeLxcI6/ORW+TxBBEqkYm5Z
6BLFtmHSuAqAXUuyZXSGLcW7XLJvIaDoHgvbDn6l4g7FMWHqPOIq6nJQY3L8ben2
e+fQrGh4noI=
=20Vc
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix task context PMU for Hetero
perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
perf/x86/intel: Fix fixed counter check warning for some Alder Lake
perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
kprobes: Do not increment probe miss count in the fault handler
x86,kprobes: WARN if kprobes tries to handle a fault
kprobes: Remove kprobe::fault_handler
uprobes: Update uprobe_write_opcode() kernel-doc comment
perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
perf/core: Fix DocBook warnings
perf/core: Make local function perf_pmu_snapshot_aux() static
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
- Core locking & atomics:
- Convert all architectures to ARCH_ATOMIC: move every
architecture to ARCH_ATOMIC, then get rid of ARCH_ATOMIC
and all the transitory facilities and #ifdefs.
Much reduction in complexity from that series:
63 files changed, 756 insertions(+), 4094 deletions(-)
- Self-test enhancements
- Futexes:
- Add the new FUTEX_LOCK_PI2 ABI, which is a variant that
doesn't set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).
[ The temptation to repurpose FUTEX_LOCK_PI's implicit
setting of FLAGS_CLOCKRT & invert the flag's meaning
to avoid having to introduce a new variant was
resisted successfully. ]
- Enhance futex self-tests
- Lockdep:
- Fix dependency path printouts
- Optimize trace saving
- Broaden & fix wait-context checks
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaEYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hPdxAAiNCsxL6X1cZ8zqbWsvLefT9Zqhzgs5u6
gdZele7PNibvbYdON26b5RUzuKfOW/hgyX6LKqr+AiNYTT9PGhcY+tycUr2PGk5R
LMyhJWmmX5cUVPU92ky+z5hEHB2gr4XPJcvgpKKUL0XB1tBaSvy2DtgwPuhXOoT1
1sCQfy63t71snt2RfEnibVW6xovwaA2lsqL81lLHJN4iRFWvqO498/m4+PWkylsm
ig/+VT1Oz7t4wqu3NhTqNNZv+4K4W2asniyo53Dg2BnRm/NjhJtgg4jRibrb0ssb
67Xdq6y8+xNBmEAKj+Re8VpMcu4aj346Ctk7d4gst2ah/Rc0TvqfH6mezH7oq7RL
hmOrMBWtwQfKhEE/fDkng30nrVxc/98YXP0n2rCCa0ySsaF6b6T185mTcYDRDxFs
BVNS58ub+zxrF9Zd4nhIHKaEHiL2ZdDimqAicXN0RpywjIzTQ/y11uU7I1WBsKkq
WkPYs+FPHnX7aBv1MsuxHhb8sUXjG924K4JeqnjF45jC3sC1crX+N0jv4wHw+89V
h4k20s2Tw6m5XGXlgGwMJh0PCcD6X22Vd9Uyw8zb+IJfvNTGR9Rp1Ec+1gMRSll+
xsn6G6Uy9bcNU0SqKlBSfelweGKn4ZxbEPn76Jc8KWLiepuZ6vv5PBoOuaujWht9
KAeOC5XdjMk=
=tH//
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Core locking & atomics:
- Convert all architectures to ARCH_ATOMIC: move every architecture
to ARCH_ATOMIC, then get rid of ARCH_ATOMIC and all the
transitory facilities and #ifdefs.
Much reduction in complexity from that series:
63 files changed, 756 insertions(+), 4094 deletions(-)
- Self-test enhancements
- Futexes:
- Add the new FUTEX_LOCK_PI2 ABI, which is a variant that doesn't
set FLAGS_CLOCKRT (.e. uses CLOCK_MONOTONIC).
[ The temptation to repurpose FUTEX_LOCK_PI's implicit setting of
FLAGS_CLOCKRT & invert the flag's meaning to avoid having to
introduce a new variant was resisted successfully. ]
- Enhance futex self-tests
- Lockdep:
- Fix dependency path printouts
- Optimize trace saving
- Broaden & fix wait-context checks
- Misc cleanups and fixes.
* tag 'locking-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
locking/lockdep: Correct the description error for check_redundant()
futex: Provide FUTEX_LOCK_PI2 to support clock selection
futex: Prepare futex_lock_pi() for runtime clock selection
lockdep/selftest: Remove wait-type RCU_CALLBACK tests
lockdep/selftests: Fix selftests vs PROVE_RAW_LOCK_NESTING
lockdep: Fix wait-type for empty stack
locking/selftests: Add a selftest for check_irq_usage()
lockding/lockdep: Avoid to find wrong lock dep path in check_irq_usage()
locking/lockdep: Remove the unnecessary trace saving
locking/lockdep: Fix the dep path printing for backwards BFS
selftests: futex: Add futex compare requeue test
selftests: futex: Add futex wait test
seqlock: Remove trailing semicolon in macros
locking/lockdep: Reduce LOCKDEP dependency list
locking/lockdep,doc: Improve readability of the block matrix
locking/atomics: atomic-instrumented: simplify ifdeffery
locking/atomic: delete !ARCH_ATOMIC remnants
locking/atomic: xtensa: move to ARCH_ATOMIC
locking/atomic: sparc: move to ARCH_ATOMIC
locking/atomic: sh: move to ARCH_ATOMIC
...
kernel tooling such as kpatch-build.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZYv4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ipeBAAhJPS/kCQ17Y5zGyMB0/6yfCWIifODoS7
9J+6/mqKHPDdV07yzPtOXuTTmpKV4OHPi8Yj8kaXs5L5fOmQ1uAwITwZNF5hU0a5
CiFIsubUCJmglf9b6L9EH5pBEQ72Cq4u8zIhJ9LmZ4t625AHJAm2ikZgascc4U67
RvVoGr5sYTo0YEsc1IDM1wUtnUhXBNjS1VwkXNnCFFTXYHju47MeY1sPHq2hvkzO
iJGC9A+hxfM1eQt9/qC/2L/6F/XECN61gcR9Get8TkWeEGHmPG+FthmPLd4oO9Ho
03J4JfMbmXumWosAeilYBNUkfii/M5Em78Wpv/cB94iSt67rq7Eb+8gm4D5svmfN
l+utsPY/HYB+uWV0hy2cV/ORRiwcJnon54dEWL6912YkKz+OIb3DK/7l9ex5lW+D
r3o8NP0s6S+RgUkOFxz5VaYK1giu6fiaFysWdKeflvwlvY/64owMepQ1QfPBbeB7
3DTzvuYZ4Cb1x/vR6WBbFqGcuJKZ1CsZIBLCblveUs+G0wlu147K5E1qlXg/Wvq7
5Vzznc4fmRng8np5hxAw8ieLkatWg7szyryUV/4H2Ubs/jWGcH628ZYbapaCb7EM
Eson65xzbVfhnz16z8sN13XIF1lGe8sb0+qiFSclEfyDUnZDuhwMn6d9Ubqxrg5J
uTULEzmY/rI=
=MvPd
-----END PGP SIGNATURE-----
mergetag object d33b9035e1
type commit
tag objtool-core-2021-06-28
tagger Ingo Molnar <mingo@kernel.org> 1624859477 +0200
The biggest change in this cycle is the new code to handle
and rewrite variable sized jump labels - which results in
slightly tighter code generation in hot paths, through the
use of short(er) NOPs.
Also a number of cleanups and fixes, and a change to the
generic include/linux/compiler.h to handle a s390 GCC quirk.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZZGcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1goYg/7BxUIJXP0F5wbrMbAvJIDRgR/j3TA+ztk
uNU1yabBGluMxCqJ87HadJ+A5d010G+GRUn/birVr7w1UuwWv8HOda78dnyG7tme
xm78/1FlOnstuOTQxhK6rjbb2cp+QOmdsAQkq1TF4SOxArBQiwtjiOvytHjb5yNx
7LrlbtuZ7Dtc0qd2evkG4ma4QkGoDhBS1dRogrItc27ZLuFIQoNnEd2K2QNMgczw
a/Jx8fgNmdoJSq+vkBn9TnS/cJYUW/PAlPNtO3ac8yE857aDIVnjXFRzveAP/nTh
rwFD6aCGnJAqyqP7A8ElNjySos5O+ebYApxe7rEx0TNLbrc55qSP9lpdIO+vgytV
Xzy4O7z6o+lailQ4EoF8Qf+rlPeue0kLF23SsNbZY1uT0vjX1Uv70xgKbkuyPygp
GNXAy6dOXK0AfaZYL/Wa50yVnJnkYDjes/hHr+HEam5Oad566pqIyQNP8yWSPqaf
KHkL//1pb5C2RKwot4IYv/ftHfZB5QftoFq6bhGBc1GXUd/FiqivvGHPW/6g7rxi
ZIrXs+Fqm/5KP9mssNONfyz5XEvbcUTD1CbeqX9eyVbiYZbLp1oWSgtogiRW9ya+
HR7t0Dt/UFzFWbilb6EZff/Hdr1NZBZLdrfpvVDoMf5tR9J0BIOyjddTu89g/FIO
KcfJ5yyjJBU=
=+HAB
-----END PGP SIGNATURE-----
Merge tags 'objtool-urgent-2021-06-28' and 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix and updates from Ingo Molnar:
"An ELF format fix for a section flags mismatch bug that breaks kernel
tooling such as kpatch-build.
The biggest change in this cycle is the new code to handle and rewrite
variable sized jump labels - which results in slightly tighter code
generation in hot paths, through the use of short(er) NOPs.
Also a number of cleanups and fixes, and a change to the generic
include/linux/compiler.h to handle a s390 GCC quirk"
* tag 'objtool-urgent-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Don't make .altinstructions writable
* tag 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Improve reloc hash size guestimate
instrumentation.h: Avoid using inline asm operand modifiers
compiler.h: Avoid using inline asm operand modifiers
kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
objtool: Reflow handle_jump_alt()
jump_label/x86: Remove unused JUMP_LABEL_NOP_SIZE
jump_label, x86: Allow short NOPs
objtool: Provide stats for jump_labels
objtool: Rewrite jump_label instructions
objtool: Decode jump_entry::key addend
jump_label, x86: Emit short JMP
jump_label: Free jump_entry::key bit1 for build use
jump_label, x86: Add variable length patching support
jump_label, x86: Introduce jump_entry_size()
jump_label, x86: Improve error when we fail expected text
jump_label, x86: Factor out the __jump_table generation
jump_label, x86: Strip ASM jump_label support
x86, objtool: Dont exclude arch/x86/realmode/
objtool: Rewrite hashtable sizing
on code executed in the guest and handle the case where failure to
get the RIP would result in a #GP, as it should, instead of in a #PF
- Disable interrupts while the per-CPU GHCB is held
- Split the #VC handler depending on where the #VC exception has
happened and therefore provide for precise context tracking like the
rest of the exception handlers deal with noinstr regions now
- Add defines for the GHCB version 2 protocol so that further shared
development with KVM can happen without merge conflicts
- The usual small cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZij8ACgkQEsHwGGHe
VUpwIQ/8CzFbGm2k2RdmO0H/VPwfF1HFSWpM9YFGSs++yOqfiyCFbyIcTcRbK4IO
+BUIRoHSgCWPb+5pJli1Wf0J/sIdYr9D4MDWt1oRQG6e/4NE2SL3EOnYJWW5VtOT
u1AVk01ooPOFDKIoh4OIZ7tCKAeNWBv+oe5dmP46spiEZbHHCzHIEaBuOQRzvX9C
jSKulDHjA4iaNl/BQMF7dJL1+aPWj2NXjSj86fhMAa+m5MspDXbIaM5wMZfPzc1k
Rj/m89JThp+mFwik46o/7g/5Q8SYtTE+Hqi1TX/65/dbyizLqbH5W3g0zwrD8TYf
B7kHguqkoE1j1avLwOYK1yJB8ZTjtf+OXjUAR4UPzxkG7Xhelu5Qb7RD/WCJ3YqO
KEFIFq+hsiAqvb6RkmX0aVecIJ49aqGX+onsMpLWq9pz2R4BRcH7jo81TIBcosg5
2Kfx2aPcMec7u7RMBHqwiaC4Adp7/vmHhukawfI8xCWLd7wEjvAMP3eeePxR+C0l
SSnn0O9COj8pctvq4eOGJAUXzPa4YtsaX+kILBs+hUdQXmQGVSxyTpakyhhUpGQ8
YyblbHybS8JeYdGqPVS/tn0Rc2DqOSQJetjmXAGhlkEkkGY8i1Ddwe0MaamJozol
g/wHNYcok/OQWglvVThv6EAY2pTSeWelmjUkZi1dnkYNH1VUxxE=
=iyX+
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Differentiate the type of exception the #VC handler raises depending
on code executed in the guest and handle the case where failure to
get the RIP would result in a #GP, as it should, instead of in a #PF
- Disable interrupts while the per-CPU GHCB is held
- Split the #VC handler depending on where the #VC exception has
happened and therefore provide for precise context tracking like the
rest of the exception handlers deal with noinstr regions now
- Add defines for the GHCB version 2 protocol so that further shared
development with KVM can happen without merge conflicts
- The usual small cleanups
* tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Use "SEV: " prefix for messages from sev.c
x86/sev: Add defines for GHCB version 2 MSR protocol requests
x86/sev: Split up runtime #VC handler for correct state tracking
x86/sev: Make sure IRQs are disabled while GHCB is active
x86/sev: Propagate #GP if getting linear instruction address failed
x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]()
x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()
x86/sev: Fix error message in runtime #VC handler
- Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too
- Use the special RAPL CPUID bit to detect the functionality on AMD and
Hygon instead of doing family matching.
- Add support for new Intel microcode deprecating TSX on some models and
do not enable kernel workarounds for those CPUs when TSX transactions
always abort, as a result of that microcode update.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZhzEACgkQEsHwGGHe
VUo5ow//eRwlb1OL/D3jzLT4nTYX8+XdufaJF1HBr1Cf3mdNkiEgyu2bvsXNTpN/
ZP7CFCHibgYeHJ7qTTkhoK1DCe4YHjj450oCgg7pv40Mv9E29Rpszie8y8e/ngkc
g9OiAeEd4A32v8bRMAOOX0UZN4afismXBW0k4iwOAguNFiZ/usrrVYTZpJe3wG65
/YM9FdDZ+Mt7BavJdVyGh03PpzoSMrKyEQ673CHhERQyy5oEublrDSmtt5hQJv1W
4tgNOWpw57Gi7Vs7UYd7VvBQKwQZKeQeHJWu1TXUB6pw0lKYvULH6m0dasvc6cGb
WtCBvbQU9MRP0LvdvYOdgmSgn400z7mEwlUWmAFJLIUlDsuRpZmVQ4C1/OUnOSdx
amb7I3bp1z6Rqjs9ADW5h87qDA+q5OmbIZeIDvuRypQOB3yEktAEdUvWb65b1Fgm
9CpzebxyaOUM9YRxDzDd2joZYKnfI3stF6UCrVXaZwYei+Jmzn5gc8ZOoOX9g6gO
eX/sLW2RWRx6XxilaWZijOHJTjokVUpEnD12aGtKO6ou5QbFTwldc2Metpua42cL
5p8wRxEYeKT/EE/GKy/qIEp624QaInSEmfyq8RFKU4em7GSaSUmoQF5151LfnoRY
ARHkEdz+T8s5RI5xSvUZLRMNYjig9tZas3blYfbJHnU7V2+bspQ=
=wW+k
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- New AMD models support
- Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too
- Use the special RAPL CPUID bit to detect the functionality on AMD and
Hygon instead of doing family matching.
- Add support for new Intel microcode deprecating TSX on some models
and do not enable kernel workarounds for those CPUs when TSX
transactions always abort, as a result of that microcode update.
* tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Clear CPUID bits when TSX always force aborts
x86/events/intel: Do not deploy TSX force abort workaround when TSX is deprecated
x86/msr: Define new bits in TSX_FORCE_ABORT MSR
perf/x86/rapl: Use CPUID bit on AMD and Hygon parts
x86/cstate: Allow ACPI C1 FFH MWAIT use on Hygon systems
x86/amd_nb: Add AMD family 19h model 50h PCI ids
x86/cpu: Fix core name for Sapphire Rapids
that the kernel properly attempts to offline the corresponding page, as
it does for kernel-detected correctable errors.
- Fix a typo in AMD's error descriptions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZg1cACgkQEsHwGGHe
VUpW8BAAlnwit5Vg4UVocY7mwTi0GvP36Fz2u81kppMROpWgQhhmX35ZxoxgQoSC
0ojKnOJTgGpOdKknmK/vom4ysxNRZxjz0zat9n+cqcfqVwP14KzhjaX1FPXnEQfE
mPkn3v8fsML87glPTzmpELYSOZTpu6OYdiFZAzKL8Gp8aytyh4FamTV2eTxn5ClG
+dejrN0NFiSALarliNttPnpfC5JvQ0KUJFxapYaMd27ssqL/2XMvJmBSpGC+OaZg
lvvv7XuRrIPRZ7lU3Zipz7Rv5r8tTfPUMr33DcUuAZxpXW3zRpds153HktTYSqsv
pZHTTLZ73GbAFVlkjqP6wcAtW2ygKW3lxsPuBSR8aIj8yU7rrrkG4wm2XsvCtrXP
4KrTZLgqGHFQaXbp1BzJzrnLyb6dxZXkEaAnX/7ZygDz+L5aMlG/7XEk4c/R9YbS
bg6NO/Dh1E547cf+bN6/yYNwPjNaO1lGOMU9N2IwjCiHFERzTsFGyFNjqMSGa7Ul
34FZAB11aklqbj+0amu5IeMd8vM3unqTGnYEQCcyG09mdsa9/bjEvEgCirq5FXf3
szjUmGpdtAsxCRZ7SzhsDu1IMT0F2D8hwgJbFSLXmtpiq5WB/EHaYbiqg8F6V36J
bENGE3rLj3HkgWHsLpgEMX2OXh7Pzo3UqwwbtOuYEiwwhvh7CZk=
=1Azq
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Add the required information to the faked APEI-reported mem error so
that the kernel properly attempts to offline the corresponding page,
as it does for kernel-detected correctable errors.
- Fix a typo in AMD's error descriptions.
* tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Fix typo "FIfo" -> "Fifo"
x86/mce: Include a MCi_MISC value in faked mce logs
x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
- prevent unprivileged userspace from reinitializing supervisor states
- Prepare init_fpstate, which is the buffer used when initializing
FPU state, properly in case the skip-writing-state-components XSAVE*
variants are used.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDVltYACgkQEsHwGGHe
VUqY9Q//c4MhJP2E15cqTWupxYk41k0UMjqPIwGmt6hRoDNKeFQm0xSgeOwe2mgk
bbzGDJOfAi2Hxza2fw6No4wIiaB3sZIqK451aI1SM9HTDB/B/dMGBPXAp9qRlnbT
kU/rDqQVqi7wlwunSunFoSLTwmQw0Lispmzwz9yirdQ+jVsnuTLWtPbUZM8RL/j8
XAhVwhDNc+Wuw0OBvRsyP5Mp6k9+2ic6z2ObIgSfgp4GeDG2F/+ZQ5W5ZeHVGQda
5QqKIdWCmAinzdz3N0iksthT3RJwLmYZ0K/qvLMrYNCvZiuUBdgrUn1Yrjo1c3lx
W+SUMtgehlylfyBbyGn5zBbJtZJtflx+kYLHLzw58lWC+ekRfxqx2F+e7S4facXr
Xn9IpnIAhru1/SAItSvScxXzjVW4DwZKO3tLr+/KsrRsTnS15pD6rx6OK88HHP/y
ofjCeS0P8STb7/Gzzqj7c+7bJvSZo/h7jmF+H2y5tRhUXZogSoh1z/QGYpvcFrwP
GOZeACREBv+D1PQNp/DN/ZiZHg6+csEg+3abtRaZSbdnfsCSpU/imXcX9GPco5vu
XS+Gxle2aqvRmQNuJEbNr7YDfocZWWXmXnkPSKCtvqSgNdxjFjZ2v3TRTAgvHEoS
Otpsv5Hk9g0FCep4oHG3zv8cb+Nk7Ycl2ZLZXQwE2Egane6U4K8=
=uqQE
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"Two more urgent FPU fixes:
- prevent unprivileged userspace from reinitializing supervisor
states
- prepare init_fpstate, which is the buffer used when initializing
FPU state, properly in case the skip-writing-state-components
XSAVE* variants are used"
* tag 'x86_urgent_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Make init_fpstate correct with optimized XSAVE
x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate()
This better reflects the purpose of this variable on AMD, since
on AMD the AVIC's memory slot can be enabled and disabled dynamically.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210623113002.111448-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a fallback mechanism to the in-kernel instruction emulator that
allows userspace the opportunity to process an instruction the emulator
was unable to. When the in-kernel instruction emulator fails to process
an instruction it will either inject a #UD into the guest or exit to
userspace with exit reason KVM_INTERNAL_ERROR. This is because it does
not know how to proceed in an appropriate manner. This feature lets
userspace get involved to see if it can figure out a better path
forward.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the pre-computed last_nonleaf_level, which is arguably wrong and at
best confusing. Per the comment:
Can have large pages at levels 2..last_nonleaf_level-1.
the intent of the variable would appear to be to track what levels can
_legally_ have large pages, but that intent doesn't align with reality.
The computed value will be wrong for 5-level paging, or if 1gb pages are
not supported.
The flawed code is not a problem in practice, because except for 32-bit
PSE paging, bit 7 is reserved if large pages aren't supported at the
level. Take advantage of this invariant and simply omit the level magic
math for 64-bit page tables (including PAE).
For 32-bit paging (non-PAE), the adjustments are needed purely because
bit 7 is ignored if PSE=0. Retain that logic as is, but make
is_last_gpte() unique per PTTYPE so that the PSE check is avoided for
PAE and EPT paging. In the spirit of avoiding branches, bump the "last
nonleaf level" for 32-bit PSE paging by adding the PSE bit itself.
Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite
FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are
not PTEs and will blow up all the other gpte helpers.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-51-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expand the comments for the MMU roles. The interactions with gfn_track
PGD reuse in particular are hairy.
Regarding PGD reuse, add comments in the nested virtualization flows to
call out why kvm_init_mmu() is unconditionally called even when nested
TDP is used.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-50-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_mmu.nx as there no consumers left.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-39-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "nxe" to "efer_nx" so that future macro magic can use the pattern
<reg>_<bit> for all CR0, CR4, and EFER bits that included in the role.
Using "efer_nx" also makes it clear that the role bit reflects EFER.NX,
not the NX bit in the corresponding PTE.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop MAXPHYADDR from mmu_role now that all MMUs have their role
invalidated after a CPUID update. Invalidating the role forces all MMUs
to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can
only be changed only through a CPUID update.
This reverts commit de3ccd26fa.
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability. Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.
KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held). And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.
To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role. Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).
Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.
In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invalidate all MMUs' roles after a CPUID update to force reinitizliation
of the MMU context/helpers. Despite the efforts of commit de3ccd26fa
("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
there are still a handful of CPUID-based properties that affect MMU
behavior but are not incorporated into mmu_role. E.g. 1gb hugepage
support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
factor into the guest's reserved PTE bits.
The obvious alternative would be to add all such properties to mmu_role,
but doing so provides no benefit over simply forcing a reinitialization
on every CPUID update, as setting guest CPUID is a rare operation.
Note, reinitializing all MMUs after a CPUID update does not fix all of
KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID
properties, which means that a vCPU can reuse shadow pages that should
not exist for the new vCPU model, e.g. that map GPAs that are now illegal
(due to MAXPHYADDR changes) or that set bits that are now reserved
(PAGE_SIZE for 1gb pages), etc...
Tracking the relevant CPUID properties in kvm_mmu_page_role would address
the majority of problems, but fully tracking that much state in the
shadow page role comes with an unpalatable cost as it would require a
non-trivial increase in KVM's memory footprint. The GBPAGES case is even
worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
support in the hardware page walker, i.e. it's a virtualization hole that
can't be closed when using TDP.
In other words, resetting the MMU after a CPUID update is largely a
superficial fix. But, it will allow reverting the tracking of MAXPHYADDR
in the mmu_role, and that case in particular needs to mostly work because
KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
is supported. For cases where KVM botches guest behavior, the damage is
limited to that guest. But for the shadow_root_level, a misconfigured
MMU can cause KVM to incorrectly access memory, e.g. due to walking off
the end of its shadow page tables.
Fixes: 7dcd575520 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested
virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in
mmu_role.base.level because that tracks the shadow root level, i.e. TDP
level. Normally, this is not an issue because LA57 can't be toggled
while long mode is active, i.e. the guest has to first disable paging,
then toggle LA57, then re-enable paging, thus ensuring an MMU
reinitialization.
But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57
without having to bounce through an unpaged section. L1 can also load a
new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a
single entry PML5 is sufficient. Such shenanigans are only problematic
if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets
reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode.
Note, in the L2 case with nested TDP, even though L1 can switch between
L2s with different LA57 settings, thus bypassing the paging requirement,
in that case KVM's nested_mmu will track LA57 in base.level.
This reverts commit 8053f924ca.
Fixes: 8053f924ca ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure. This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.
No functional change intended.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-2-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When *RSTOR from user memory raises an exception, there is no way to
differentiate them. That's bad because it forces the slow path even when
the failure was not a fault. If the operation raised eg. #GP then going
through the slow path is pointless.
Use _ASM_EXTABLE_FAULT() which stores the trap number and let the exception
fixup return the negated trap number as error.
This allows to separate the fast path and let it handle faults directly and
avoid the slow path for all other exceptions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.601480369@linutronix.de
The PKRU value of a task is stored in task->thread.pkru when the task is
scheduled out. PKRU is restored on schedule in from there. So keeping the
XSAVE buffer up to date is a pointless exercise.
Remove the xstate fiddling and cleanup all related functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de
PKRU is already updated and the xstate is not longer the proper source
of information.
[ bp: Use cpu_feature_enabled() ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.708180184@linutronix.de
As the PKRU state is managed separately restoring it from the xstate
buffer would be counterproductive as it might either restore a stale
value or reinit the PKRU state to 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.606745195@linutronix.de
One nice thing about having PKRU be XSAVE-managed is that it gets naturally
exposed into the XSAVE-using ABIs. Now that XSAVE will not be used to
manage PKRU, these ABIs need to be manually enabled to deal with PKRU.
ptrace() uses copy_uabi_xstate_to_kernel() to collect the tracee's
XSTATE. As PKRU is not in the task's XSTATE buffer, use task->thread.pkru
for filling in up the ptrace buffer.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.508770763@linutronix.de
PKRU is currently partly XSAVE-managed and partly not. It has space
in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR.
However, it is switched more eagerly than FPU because there may be a
need for PKRU to be up-to-date for things like copy_to/from_user() since
PKRU affects user-permission memory accesses, not just accesses from
userspace itself.
This leaves PKRU in a very odd position. XSAVE brings very little value
to the table for how Linux uses PKRU except for signal related XSTATE
handling.
Prepare to move PKRU away from being XSAVE-managed. Allocate space in
the thread_struct for it and save/restore it in the context-switch path
separately from the XSAVE-managed features. task->thread_struct.pkru
is only valid when the task is scheduled out. For the current task the
authoritative source is the hardware, i.e. it has to be retrieved via
rdpkru().
Leave the XSAVE code in place for now to ensure bisectability.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
switch_to() and flush_thread() write the task's PKRU value eagerly so
the PKRU value of current is always valid in the hardware.
That means there is no point in restoring PKRU on exit to user or when
reactivating the task's FPU registers in the signal frame setup path.
This allows to remove all the xstate buffer updates with PKRU values once
the PKRU state is stored in thread struct while a task is scheduled out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.303919033@linutronix.de
Rename it so it's clear that this is about user ABI features which can
differ from the feature set which the kernel saves and restores because the
kernel handles e.g. PKRU differently. But the user ABI (ptrace, signal
frame) expects it to be there.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.211585137@linutronix.de
copy_kernel_to_fpregs() restores all xfeatures but it is also the place
where the AMD FXSAVE_LEAK bug is handled.
That prevents fpregs_restore_userregs() to limit the restored features,
which is required to untangle PKRU and XSTATE handling and also for the
upcoming supervisor state management.
Move the FXSAVE_LEAK quirk into __copy_kernel_to_fpregs() and deinline that
function which has become rather fat.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.114271278@linutronix.de
Rename it so that it becomes entirely clear what this function is
about. It's purpose is to restore the FPU registers to the state which was
saved in the task's FPU memory state either at context switch or by an in
kernel FPU user.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.018867925@linutronix.de
There is no point in using copy_init_pkru_to_fpregs() which in turn calls
write_pkru(). write_pkru() tries to fiddle with the task's xstate buffer
for nothing because the XRSTOR[S](init_fpstate) just cleared the xfeature
flag in the xstate header which makes get_xsave_addr() fail.
It's a useless exercise anyway because the reinitialization activates the
FPU so before the task's xstate buffer can be used again a XRSTOR[S] must
happen which in turn dumps the PKRU value.
Get rid of the now unused copy_init_pkru_to_fpregs().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.732508792@linutronix.de
Provide a simple and trivial helper which just writes the PKRU default
value without trying to fiddle with the task's xsave buffer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.513729794@linutronix.de
When CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is disabled then the following
code fails to compile:
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
u32 pkru = READ_ONCE(init_pkru_value);
..
}
because init_pkru_value is defined as '0' which makes READ_ONCE() upset.
Provide an accessor macro to avoid #ifdeffery all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.404880646@linutronix.de
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is
set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU
feature flag. That's ineffective because all call sites have checks for
boot_cpu_data.
Make it smarter and force the feature flag when PKU is enabled on the boot
cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all
over the place. That either compiles the code out when PKEY support is
disabled in Kconfig or uses a static_cpu_has() for the feature check which
makes a significant difference in hotpaths, e.g. context switch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de
Both function names are a misnomer.
fpu__save() is actually about synchronizing the hardware register state
into the task's memory state so that either coredump or a math exception
handler can inspect the state at the time where the problem happens.
The function guarantees to preserve the register state, while "save" is a
common terminology for saving the current state so it can be modified and
restored later. This is clearly not the case here.
Rename it to fpu_sync_fpstate().
fpu__copy() is used to clone the current task's FPU state when duplicating
task_struct. While the register state is a copy the rest of the FPU state
is not.
Name it accordingly and remove the really pointless @src argument along
with the warning which comes along with it.
Nothing can ever copy the FPU state of a non-current task. It's clearly
just a consequence of arch_dup_task_struct(), but it makes no sense to
proliferate that further.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.196727450@linutronix.de
write_pkru() was originally used just to write to the PKRU register. It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.
But, later work included a requirement to also modify the task XSAVE
buffer when updating the register. This really is more related to the
XSAVE architecture than to paging.
Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
The copy functions for the independent features are horribly named and the
supervisor and independent part is just overengineered.
The point is that the supplied mask has either to be a subset of the
independent features or a subset of the task->fpu.xstate managed features.
Rewrite it so it checks for invalid overlaps of these areas in the caller
supplied feature mask. Rename it so it follows the new naming convention
for these operations. Mop up the function documentation.
This allows to use that function for other purposes as well.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20210623121455.004880675@linutronix.de
The salient feature of "dynamic" XSTATEs is that they are not part of the
main task XSTATE buffer. The fact that they are dynamically allocated is
irrelevant and will become quite confusing when user math XSTATEs start
being dynamically allocated. Rename them to "independent" because they
are independent of the main XSTATE code.
This is just a search-and-replace with some whitespace updates to keep
things aligned.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/1eecb0e4f3e07828ebe5d737ec77dc3b708fad2d.1623388344.git.luto@kernel.org
Link: https://lkml.kernel.org/r/20210623121454.911450390@linutronix.de
This is not a copy functionality. It restores the register state from the
supplied kernel buffer.
No functional changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de
The FNSAVE support requires conditionals in quite some call paths because
FNSAVE reinitializes the FPU hardware. If the save has to preserve the FPU
register state then the caller has to conditionally restore it from memory
when FNSAVE is in use.
This also requires a conditional in context switch because the restore
avoidance optimization cannot work with FNSAVE. As this only affects 20+
years old CPUs there is really no reason to keep this optimization
effective for FNSAVE. It's about time to not optimize for antiques anymore.
Just unconditionally FRSTOR the save content to the registers and clean up
the conditionals all over the place.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.617369268@linutronix.de
A copy is guaranteed to leave the source intact, which is not the case when
FNSAVE is used as that reinitilizes the registers.
Save does not make such guarantees and it matches what this is about,
i.e. to save the state for a later restore.
Rename it to save_fpregs_to_fpstate().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de
Rename them to reflect that these functions deal with user space format
XSAVE buffers.
copy_kernel_to_xstate() -> copy_uabi_from_kernel_to_xstate()
copy_user_to_xstate() -> copy_sigframe_from_user_to_xstate()
Again a clear statement that these functions deal with user space ABI.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.318485015@linutronix.de
The function names for fnsave/fnrstor operations are horribly named and
a permanent source of confusion.
Rename:
copy_kernel_to_fregs() to frstor()
copy_fregs_to_user() to fnsave_to_user_sigframe()
copy_user_to_fregs() to frstor_from_user_sigframe()
so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.223594101@linutronix.de
The function names for fxsave/fxrstor operations are horribly named and
a permanent source of confusion.
Rename:
copy_fxregs_to_kernel() to fxsave()
copy_kernel_to_fxregs() to fxrstor()
copy_fxregs_to_user() to fxsave_to_user_sigframe()
copy_user_to_fxregs() to fxrstor_from_user_sigframe()
so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.
While at it, replace the static_cpu_has(X86_FEATURE_FXSR) with
use_fxsr() to be consistent with the rest of the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.017863494@linutronix.de
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.
Rename:
copy_xregs_to_user() to xsave_to_user_sigframe()
copy_user_to_xregs() to xrstor_from_user_sigframe()
so it's entirely clear what this is about. This is also a clear indicator
of the potentially different storage format because this is user ABI and
cannot use compacted format.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.924266705@linutronix.de
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.
Rename:
copy_xregs_to_kernel() to os_xsave()
copy_kernel_to_xregs() to os_xrstor()
These are truly low level wrappers around the actual instructions
XSAVE[OPT]/XRSTOR and XSAVES/XRSTORS with the twist that the selection
based on the available CPU features happens with an alternative to avoid
conditionals all over the place and to provide the best performance for hot
paths.
The os_ prefix tells that this is the OS selected mechanism.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.830239347@linutronix.de
If the fast path of restoring the FPU state on sigreturn fails or is not
taken and the current task's FPU is active then the FPU has to be
deactivated for the slow path to allow a safe update of the tasks FPU
memory state.
With supervisor states enabled, this requires to save the supervisor state
in the memory state first. Supervisor states require XSAVES so saving only
the supervisor state requires to reshuffle the memory buffer because XSAVES
uses the compacted format and therefore stores the supervisor states at the
beginning of the memory state. That's just an overengineered optimization.
Get rid of it and save the full state for this case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.734561971@linutronix.de
This function is pointlessly global and a complete misnomer because it's
usage is related to both supervisor state checks and compacted format
checks. Remove it and just make the conditions check the XSAVES feature.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.425493349@linutronix.de
The only usecase for fpu__write_begin is the set() callback of regset, so
the function is pointlessly global.
Move it to the regset code and rename it to fpu_force_restore() which is
exactly decribing what the function does.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.328652975@linutronix.de
The function can only be used from the regset get() callbacks safely. So
there is no reason to have it globally exposed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.234942936@linutronix.de
When xsave with init state optimization is used then a component's state
in the task's xsave buffer can be stale when the corresponding feature bit
is not set.
fpregs_get() and xfpregs_get() invoke fpstate_sanitize_xstate() to update
the task's xsave buffer before retrieving the FX or FP state. That's just
duplicated code as copy_xstate_to_kernel() already handles this correctly.
Add a copy mode argument to the function which allows to restrict the state
copy to the FP and SSE features.
Also rename the function to copy_xstate_to_uabi_buf() so the name reflects
what it is doing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.805327286@linutronix.de
xstateregs_set() operates on a stopped task and tries to copy the provided
buffer into the task's fpu.state.xsave buffer.
Any error while copying or invalid state detected after copying results in
wiping the target task's FPU state completely including supervisor states.
That's just wrong. The caller supplied invalid data or has a problem with
unmapped memory, so there is absolutely no justification to corrupt the
target state.
Fix this with the following modifications:
1) If data has to be copied from userspace, allocate a buffer and copy from
user first.
2) Use copy_kernel_to_xstate() unconditionally so that header checking
works correctly.
3) Return on error without corrupting the target state.
This prevents corrupting states and lets the caller deal with the problem
it caused in the first place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.214903673@linutronix.de
They are only used in fpstate_init() and there is no point to have them in
a header just to make reading the code harder.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.023118522@linutronix.de
This function is really not doing what the comment advertises:
"Find supported xfeatures based on cpu features and command-line input.
This must be called after fpu__init_parse_early_param() is called and
xfeatures_mask is enumerated."
fpu__init_parse_early_param() does not exist anymore and the function just
returns a constant.
Remove it and fix the caller and get rid of further references to
fpu__init_parse_early_param().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.816404717@linutronix.de
Pick up dependent changes which either went mainline (x86/urgent is
based on -rc7 and that contains them) as urgent fixes and the current
x86/urgent branch which contains two more urgent fixes, so that the
bigger FPU rework can base off ontop.
Signed-off-by: Borislav Petkov <bp@suse.de>
Add the necessary defines for supporting the GHCB version 2 protocol.
This includes defines for:
- MSR-based AP hlt request/response
- Hypervisor Feature request/response
This is the bare minimum of requests that need to be supported by a GHCB
version 2 implementation. There are more requests in the specification,
but those depend on Secure Nested Paging support being available.
These defines are shared between SEV host and guest support.
[ bp: Fold in https://lkml.kernel.org/r/20210622144825.27588-2-joro@8bytes.org too.
Simplify the brewing macro maze into readability. ]
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/YNLXQIZ5e1wjkshG@8bytes.org
Fix:
vmlinux.o: warning: objtool: handle_bug()+0x10: call to task_size_max() leaves .noinstr.text section
When #UD isn't a BUG, we shouldn't violate noinstr (we'll still
probably die, but that's another story).
Fixes: 025768a966 ("x86/cpu: Use alternative to generate the TASK_SIZE_MAX constant")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210621120120.682468274@infradead.org
The XSAVE init code initializes all enabled and supported components with
XRSTOR(S) to init state. Then it XSAVEs the state of the components back
into init_fpstate which is used in several places to fill in the init state
of components.
This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
those use the init optimization and skip writing state of components which
are in init state. So init_fpstate.xsave still contains all zeroes after
this operation.
There are two ways to solve that:
1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
XSAVES is enabled because XSAVES uses compacted format.
2) Save the components which are known to have a non-zero init state by other
means.
Looking deeper, #2 is the right thing to do because all components the
kernel supports have all-zeroes init state except the legacy features (FP,
SSE). Those cannot be hard coded because the states are not identical on all
CPUs, but they can be saved with FXSAVE which avoids all conditionals.
Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
a BUILD_BUG_ON() which reminds developers to validate that a newly added
component has all zeroes init state. As a bonus remove the now unused
copy_xregs_to_kernel_booting() crutch.
The XSAVE and reshuffle method can still be implemented in the unlikely
case that components are added which have a non-zero init state and no
other means to save them. For now, FXSAVE is just simple and good enough.
[ bp: Fix a typo or two in the text. ]
Fixes: 6bad06b768 ("x86, xsave: Use xsaveopt in context-switch path when supported")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.
Fixes: 62441a1fb5 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d8: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.
The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
tag in the MMU role to identify L2+ roots. From L1's perspective, its
own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
entries. Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
but it is not required to do so.
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.
Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.
Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.
Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().
Fixes: 41fab65e7c ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Kill off pdptrs_changed() and instead go through the full kvm_set_cr3()
for PAE guest, even if the new CR3 is the same as the current CR3. For
VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as
unavailable after VM-Exit, i.e. the optimization is dead code except for
SVM without NPT.
In the unlikely scenario that anyone cares about SVM without NPT _and_ a
PAE guest, they've got bigger problems if their guest is loading the same
CR3 so frequently that the performance of kvm_set_cr3() is notable,
especially since KVM's fast PGD switching means reloading the same CR3
does not require a full rebuild. Given that PAE and PCID are mutually
exclusive, i.e. a sync and flush are guaranteed in any case, the actual
benefits of the pdptrs_changed() optimization are marginal at best.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Limiting exposed Hyper-V features requires a fast way to check if the
particular feature is exposed in guest visible CPUIDs or not. To aboid
looping through all CPUID entries on every hypercall/MSR access cache
the required leaves on CPUID update.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
for limiting Hyper-V features to those exposed to the guest in Hyper-V
CPUIDs (0x40000003, 0x40000004, ...).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM added support for certain reserved fields to be used by
software or hypervisor. Add the following reserved fields:
- VMCB offset 0x3e0 - 0x3ff
- Clean bit 31
- SVM intercept exit code 0xf0000000
Later patches will make use of this for supporting Hyper-V
nested virtualization enhancements.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <a1f17a43a8e9e751a1a9cc0281649d71bdbf721b.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently the remote TLB flush logic is specific to VMX.
Move it to a common place so that SVM can use it as well.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bit 22 of HYPERV_CPUID_FEATURES.EDX is specific to SVM and specifies
support for enlightened TLB flush. With this enlightenment enabled,
ASID invalidations flushes only gva->hpa entries. To flush TLB entries
derived from NPT, hypercalls should be used
(HvFlushGuestPhysicalAddressSpace or HvFlushGuestPhysicalAddressList)
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Message-Id: <a060f872d0df1955e52e30b877b3300485edb27c.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the following per-VCPU statistic to KVM debugfs to show if a given
VCPU is in guest mode:
guest_mode
Also add this as a per-VM statistic to KVM debugfs to show the total number
of VCPUs that are in guest mode in a given VM.
Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that .post_leave_smm() is gone, drop "pre_" from the remaining
helpers. The helpers aren't invoked purely before SMI/RSM processing,
e.g. both helpers are invoked after state is snapshotted (from regs or
SMRAM), and the RSM helper is invoked after some amount of register state
has been stuffed.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.
Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.
Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-9-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of 01 and 12 together.
The merging is done using the following equations:
offset_02 = ((offset_01 * mult_12) >> shift_bits) + offset_12
mult_02 = (mult_01 * mult_12) >> shift_bits
Where shift_bits is kvm_tsc_scaling_ratio_frac_bits.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-8-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-7-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing the ratio as
a parameter.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-5-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-3-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the TDP MMU is in use, wait to allocate the rmaps until the shadow
MMU is actually used. (i.e. a nested VM is launched.) This saves memory
equal to 0.2% of guest memory in cases where the TDP MMU is used and
there are no nested guests involved.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-8-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a field to control whether new memslots should have rmaps allocated
for them. As of this change, it's not safe to skip allocating rmaps, so
the field is always set to allocate rmaps. Future changes will make it
safe to operate without rmaps, using the TDP MMU. Then further changes
will allow the rmaps to be allocated lazily when needed for nested
oprtation.
No functional change expected.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210518173414.450044-6-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that kvm_hv_flush_tlb() has been patched to support XMM hypercall
inputs, we can start advertising this feature to guests.
Cc: Alexander Graf <graf@amazon.com>
Cc: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <e63fc1c61dd2efecbefef239f4f0a598bd552750.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V supports the use of XMM registers to perform fast hypercalls.
This allows guests to take advantage of the improved performance of the
fast hypercall interface even though a hypercall may require more than
(the current maximum of) two input registers.
The XMM fast hypercall interface uses six additional XMM registers (XMM0
to XMM5) to allow the guest to pass an input parameter block of up to
112 bytes.
Add framework to read from XMM registers in kvm_hv_hypercall() and use
the additional hypercall inputs from XMM registers in kvm_hv_flush_tlb()
when possible.
Cc: Alexander Graf <graf@amazon.com>
Co-developed-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Evgeny Iakovlev <eyakovl@amazon.de>
Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de>
Message-Id: <fc62edad33f1920fe5c74dde47d7d0b4275a9012.1622019134.git.sidcha@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The counter value of a perf task may leak to another RDPMC task.
For example, a perf stat task as below is running on CPU 0.
perf stat -e 'branches,cycles' -- taskset -c 0 ./workload
In the meantime, an RDPMC task, which is also running on CPU 0, may read
the GP counters periodically. (The RDPMC task creates a fixed event,
but read four GP counters.)
$./rdpmc_read_all_counters
index 0x0 value 0x8001e5970f99
index 0x1 value 0x8005d750edb6
index 0x2 value 0x0
index 0x3 value 0x0
index 0x0 value 0x8002358e48a5
index 0x1 value 0x8006bd1e3bc9
index 0x2 value 0x0
index 0x3 value 0x0
It is a potential security issue. Once the attacker knows what the other
thread is counting. The PerfMon counter can be used as a side-channel to
attack cryptosystems.
The counter value of the perf stat task leaks to the RDPMC task because
perf never clears the counter when it's stopped.
Three methods were considered to address the issue.
- Unconditionally reset the counter in x86_pmu_del(). It can bring extra
overhead even when there is no RDPMC task running.
- Only reset the un-assigned dirty counters when the RDPMC task is
scheduled in via sched_task(). It fails for the below case.
Thread A Thread B
clone(CLONE_THREAD) --->
set_affine(0)
set_affine(1)
while (!event-enabled)
;
event = perf_event_open()
mmap(event)
ioctl(event, IOC_ENABLE); --->
RDPMC
Counters are still leaked to the thread B.
- Only reset the un-assigned dirty counters before updating the CR4.PCE
bit. The method is implemented here.
The dirty counter is a counter, on which the assigned event has been
deleted, but the counter is not reset. To track the dirty counters,
add a 'dirty' variable in the struct cpu_hw_events.
The security issue can only be found with an RDPMC task. To enable the
RDMPC, the CR4.PCE bit has to be updated. Add a
perf_clear_dirty_counters() right before updating the CR4.PCE bit to
clear the existing dirty counters. Only the current un-assigned dirty
counters are reset, because the RDPMC assigned dirty counters will be
updated soon.
After applying the patch,
$ ./rdpmc_read_all_counters
index 0x0 value 0x0
index 0x1 value 0x0
index 0x2 value 0x0
index 0x3 value 0x0
index 0x0 value 0x0
index 0x1 value 0x0
index 0x2 value 0x0
index 0x3 value 0x0
Performance
The performance of a context switch only be impacted when there are two
or more perf users and one of the users must be an RDPMC user. In other
cases, there is no performance impact.
The worst-case occurs when there are two users: the RDPMC user only
uses one counter; while the other user uses all available counters.
When the RDPMC task is scheduled in, all the counters, other than the
RDPMC assigned one, have to be reset.
Test results for the worst-case, using a modified lat_ctx as measured
on an Ice Lake platform, which has 8 GP and 3 FP counters (ignoring
SLOTS).
lat_ctx -s 128K -N 1000 processes 2
Without the patch:
The context switch time is 4.97 us
With the patch:
The context switch time is 5.16 us
There is ~4% performance drop for the context switching time in the
worst-case.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1623693582-187370-1-git-send-email-kan.liang@linux.intel.com
Intel client processors that support the IA32_TSX_FORCE_ABORT MSR
related to perf counter interaction [1] received a microcode update that
deprecates the Transactional Synchronization Extension (TSX) feature.
The bit FORCE_ABORT_RTM now defaults to 1, writes to this bit are
ignored. A new bit TSX_CPUID_CLEAR clears the TSX related CPUID bits.
The summary of changes to the IA32_TSX_FORCE_ABORT MSR are:
Bit 0: FORCE_ABORT_RTM (legacy bit, new default=1) Status bit that
indicates if RTM transactions are always aborted. This bit is
essentially !SDV_ENABLE_RTM(Bit 2). Writes to this bit are ignored.
Bit 1: TSX_CPUID_CLEAR (new bit, default=0) When set, CPUID.HLE = 0
and CPUID.RTM = 0.
Bit 2: SDV_ENABLE_RTM (new bit, default=0) When clear, XBEGIN will
always abort with EAX code 0. When set, XBEGIN will not be forced to
abort (but will always abort in SGX enclaves). This bit is intended to
be used on developer systems. If this bit is set, transactional
atomicity correctness is not certain. SDV = Software Development
Vehicle (SDV), i.e. developer systems.
Performance monitoring counter 3 is usable in all cases, regardless of
the value of above bits.
Add support for a new CPUID bit - CPUID.RTM_ALWAYS_ABORT (CPUID 7.EDX[11])
- to indicate the status of always abort behavior.
[1] [ bp: Look for document ID 604224, "Performance Monitoring Impact
of Intel Transactional Synchronization Extension Memory". Since
there's no way for us to have stable links to documents... ]
[ bp: Massage and extend commit message. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/9add61915b4a4eedad74fbd869107863a28b428e.1623704845.git-series.pawan.kumar.gupta@linux.intel.com
When user space brings PKRU into init state, then the kernel handling is
broken:
T1 user space
xsave(state)
state.header.xfeatures &= ~XFEATURE_MASK_PKRU;
xrstor(state)
T1 -> kernel
schedule()
XSAVE(S) -> T1->xsave.header.xfeatures[PKRU] == 0
T1->flags |= TIF_NEED_FPU_LOAD;
wrpkru();
schedule()
...
pk = get_xsave_addr(&T1->fpu->state.xsave, XFEATURE_PKRU);
if (pk)
wrpkru(pk->pkru);
else
wrpkru(DEFAULT_PKRU);
Because the xfeatures bit is 0 and therefore the value in the xsave
storage is not valid, get_xsave_addr() returns NULL and switch_to()
writes the default PKRU. -> FAIL #1!
So that wrecks any copy_to/from_user() on the way back to user space
which hits memory which is protected by the default PKRU value.
Assumed that this does not fail (pure luck) then T1 goes back to user
space and because TIF_NEED_FPU_LOAD is set it ends up in
switch_fpu_return()
__fpregs_load_activate()
if (!fpregs_state_valid()) {
load_XSTATE_from_task();
}
But if nothing touched the FPU between T1 scheduling out and back in,
then the fpregs_state is still valid which means switch_fpu_return()
does nothing and just clears TIF_NEED_FPU_LOAD. Back to user space with
DEFAULT_PKRU loaded. -> FAIL #2!
The fix is simple: if get_xsave_addr() returns NULL then set the
PKRU value to 0 instead of the restrictive default PKRU value in
init_pkru_value.
[ bp: Massage in minor nitpicks from folks. ]
Fixes: 0cecca9d03 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144346.045616965@linutronix.de
switch_fpu_finish() checks current->mm as indicator for kernel threads.
That's wrong because kernel threads can temporarily use a mm of a user
process via kthread_use_mm().
Check the task flags for PF_KTHREAD instead.
Fixes: 0cecca9d03 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210608144345.912645927@linutronix.de
The entire memory range under 1M is unconditionally reserved in
setup_arch(), so there is no need for crash_reserve_low_1M() anymore.
Remove this function.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210601075354.5149-4-rppt@kernel.org
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmC9UH8eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRDYH/3WgnRz5DfVhjmlD
Lg38mPmbZWhFibXghrYrpbVpTyhjGFRuNtXAt2p7/nYnM71wzI6Qkx6cRKZeB5HE
/SqeksPWUEgJaUuoXeQBrBaG7q/+9ph7Rgaf2wP7k+E00RI3E4pbMubuqFAUeikr
itKFD9aTUsgT5XbG2hH5Ddwh5hBD2C/1PVt3jpLnJkXRCn91uEh+R7SHXP/fsjAd
ZaGOVbAGm+jePCQDBXpVUn+8fJdxvQg7rxWVRRRhi5LXG+pnAezbkGl746zBwaSw
K6lmVSA+eAiVkKu6nR4HJv9Hax1juFbp9xpcCo4jzxO5NJF4jsmytjLEaYFdi4NX
G542808=
=BPDL
-----END PGP SIGNATURE-----
Merge tag 'v5.13-rc5' into x86/cleanups
Pick up dependent changes in order to base further cleanups ontop.
Signed-off-by: Borislav Petkov <bp@suse.de>
MSR_AMD64_SEV even though the spec clearly states so, and check CPUID
bits first.
- Send only one signal to a task when it is a SEGV_PKUERR si_code type.
- Do away with all the wankery of reserving X amount of memory in
the first megabyte to prevent BIOS corrupting it and simply and
unconditionally reserve the whole first megabyte.
- Make alternatives NOP optimization work at an arbitrary position
within the patched sequence because the compiler can put single-byte
NOPs for alignment anywhere in the sequence (32-bit retpoline), vs our
previous assumption that the NOPs are only appended.
- Force-disable ENQCMD[S] instructions support and remove update_pasid()
because of insufficient protection against FPU state modification in an
interrupt context, among other xstate horrors which are being addressed
at the moment. This one limits the fallout until proper enablement.
- Use cpu_feature_enabled() in the idxd driver so that it can be
build-time disabled through the defines in .../asm/disabled-features.h.
- Fix LVT thermal setup for SMI delivery mode by making sure the APIC
LVT value is read before APIC initialization so that softlockups during
boot do not happen at least on one machine.
- Mark all legacy interrupts as legacy vectors when the IO-APIC is
disabled and when all legacy interrupts are routed through the PIC.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmC8fdEACgkQEsHwGGHe
VUqO5A/+IbIo8myl8VPjw6HRnHgY8rsYRjxdtmVhbaMi5XOmTMfVA9zJ6QALxseo
Mar8bmWcezEs0/FmNvk1vEOtIgZvRVy5RqXbu3W2EgWICuzRWbj822q+KrkbY0tH
1GWjcZQO8VlgeuQsukyj5QHaBLffpn3Fh1XB8r0cktZvwciM+LRNMnK8d6QjqxNM
ctTX4wdI6kc076pOi7MhKxSe+/xo5Wnf27lClLMOcsO/SS42KqgeRM5psWqxihhL
j6Y3Oe+Nm+7GKF8y841PUSlwjgWmlZa6UkR6DBTP7DGnHDa5hMpzxYvHOquq/SbA
leV9OLqI0iWs56kSzbEcXo7do1kld62KjsA2KtUhJfVAtm+igQLh5G0jESBwrWca
TBWaE5kt6s8wP7LXeg26o4U8XD8vqEH88Tmsjlgqb/t/PKDV9PMGvNpF00dPZFo6
Jhj2yntJYjLQYoAQLuQm5pfnKhZy3KKvk7ViGcnp3iN9i4eU9HzawIiXnliNOrTI
ohQ9KoRhy1Cx0UfLkR+cdK4ks0u26DC2/Ewt0CE5AP/CQ1rX6Zbv2gFLjSpy7yQo
6A99HEpbaLuy3kDt5vn91viPNUlOveuIXIdHp6u+zgFfx88eLUoEvfR135aV/Gyh
p5PJm/BO99KByQzFCnilkp7nBeKtnKYSmUojA6JsZKjzJimSPYo=
=zRI1
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A bunch of x86/urgent stuff accumulated for the last two weeks so
lemme unload it to you.
It should be all totally risk-free, of course. :-)
- Fix out-of-spec hardware (1st gen Hygon) which does not implement
MSR_AMD64_SEV even though the spec clearly states so, and check
CPUID bits first.
- Send only one signal to a task when it is a SEGV_PKUERR si_code
type.
- Do away with all the wankery of reserving X amount of memory in the
first megabyte to prevent BIOS corrupting it and simply and
unconditionally reserve the whole first megabyte.
- Make alternatives NOP optimization work at an arbitrary position
within the patched sequence because the compiler can put
single-byte NOPs for alignment anywhere in the sequence (32-bit
retpoline), vs our previous assumption that the NOPs are only
appended.
- Force-disable ENQCMD[S] instructions support and remove
update_pasid() because of insufficient protection against FPU state
modification in an interrupt context, among other xstate horrors
which are being addressed at the moment. This one limits the
fallout until proper enablement.
- Use cpu_feature_enabled() in the idxd driver so that it can be
build-time disabled through the defines in disabled-features.h.
- Fix LVT thermal setup for SMI delivery mode by making sure the APIC
LVT value is read before APIC initialization so that softlockups
during boot do not happen at least on one machine.
- Mark all legacy interrupts as legacy vectors when the IO-APIC is
disabled and when all legacy interrupts are routed through the PIC"
* tag 'x86_urgent_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Check SME/SEV support in CPUID first
x86/fault: Don't send SIGSEGV twice on SEGV_PKUERR
x86/setup: Always reserve the first 1M of RAM
x86/alternative: Optimize single-byte NOPs at an arbitrary position
x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid()
dmaengine: idxd: Use cpu_feature_enabled()
x86/thermal: Fix LVT thermal setup for SMI delivery mode
x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing
In an upcoming change we would like to add a flag to
GFP_HIGHUSER_MOVABLE so that it would no longer be an OR
of GFP_HIGHUSER and __GFP_MOVABLE. This poses a problem for
alloc_zeroed_user_highpage_movable() which passes __GFP_MOVABLE
into an arch-specific __alloc_zeroed_user_highpage() hook which ORs
in GFP_HIGHUSER.
Since __alloc_zeroed_user_highpage() is only ever called from
alloc_zeroed_user_highpage_movable(), we can remove one level
of indirection here. Remove __alloc_zeroed_user_highpage(),
make alloc_zeroed_user_highpage_movable() the hook, and use
GFP_HIGHUSER_MOVABLE in the hook implementations so that they will
pick up the new flag that we are going to add.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Link: https://linux-review.googlesource.com/id/Ic6361c657b2cdcd896adbe0cf7cb5a7fbb1ed7bf
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20210602235230.3928842-2-pcc@google.com
Signed-off-by: Will Deacon <will@kernel.org>
While digesting the XSAVE-related horrors which got introduced with
the supervisor/user split, the recent addition of ENQCMD-related
functionality got on the radar and turned out to be similarly broken.
update_pasid(), which is only required when X86_FEATURE_ENQCMD is
available, is invoked from two places:
1) From switch_to() for the incoming task
2) Via a SMP function call from the IOMMU/SMV code
#1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
by enforcing the state to be 'present', but all the conditionals in that
code are completely pointless for that.
Also the invocation is just useless overhead because at that point
it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
and all of this can be handled at return to user space.
#2 is broken beyond repair. The comment in the code claims that it is safe
to invoke this in an IPI, but that's just wishful thinking.
FPU state of a running task is protected by fregs_lock() which is
nothing else than a local_bh_disable(). As BH-disabled regions run
usually with interrupts enabled the IPI can hit a code section which
modifies FPU state and there is absolutely no guarantee that any of the
assumptions which are made for the IPI case is true.
Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
invoked with a NULL pointer argument, so it can hit a completely
unrelated task and unconditionally force an update for nothing.
Worse, it can hit a kernel thread which operates on a user space
address space and set a random PASID for it.
The offending commit does not cleanly revert, but it's sufficient to
force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
code to make this dysfunctional all over the place. Anything more
complex would require more surgery and none of the related functions
outside of the x86 core code are blatantly wrong, so removing those
would be overkill.
As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
required to make this actually work, this cannot result in a regression
except for related out of tree train-wrecks, but they are broken already
today.
Fixes: 20f0afd1fb ("x86/mmu: Allocate/free a PASID")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
AMD and Hygon CPUs have a CPUID bit for RAPL. Drop the fam17h suffix as
it is stale already.
Make use of this instead of a model check to work more nicely in virtual
environments where RAPL typically isn't available.
[ bp: drop the ../cpu/powerflags.c hunk which is superfluous as the
"rapl" bit name appears already in flags. ]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210514135920.16093-1-andrew.cooper3@citrix.com
There are machines out there with added value crap^WBIOS which provide an
SMI handler for the local APIC thermal sensor interrupt. Out of reset,
the BSP on those machines has something like 0x200 in that APIC register
(timestamps left in because this whole issue is timing sensitive):
[ 0.033858] read lvtthmr: 0x330, val: 0x200
which means:
- bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
- bits [10:8] have 010b which means SMI delivery mode.
Now, later during boot, when the kernel programs the local APIC, it
soft-disables it temporarily through the spurious vector register:
setup_local_APIC:
...
/*
* If this comes from kexec/kcrash the APIC might be enabled in
* SPIV. Soft disable it before doing further initialization.
*/
value = apic_read(APIC_SPIV);
value &= ~APIC_SPIV_APIC_ENABLED;
apic_write(APIC_SPIV, value);
which means (from the SDM):
"10.4.7.2 Local APIC State After It Has Been Software Disabled
...
* The mask bits for all the LVT entries are set. Attempts to reset these
bits will be ignored."
And this happens too:
[ 0.124111] APIC: Switch to symmetric I/O mode setup
[ 0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
[ 0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
This results in CPU 0 soft lockups depending on the placement in time
when the APIC soft-disable happens. Those soft lockups are not 100%
reproducible and the reason for that can only be speculated as no one
tells you what SMM does. Likely, it confuses the SMM code that the APIC
is disabled and the thermal interrupt doesn't doesn't fire at all,
leading to CPU 0 stuck in SMM forever...
Now, before
4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")
due to how the APIC_LVTTHMR was read before APIC initialization in
mcheck_intel_therm_init(), it would read the value with the mask bit 16
clear and then intel_init_thermal() would replicate it onto the APs and
all would be peachy - the thermal interrupt would remain enabled.
But that commit moved that reading to a later moment in
intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
late and with its interrupt mask bit set.
Thus, revert back to the old behavior of reading the thermal LVT
register before the APIC gets initialized.
Fixes: 4f432e8bb1 ("x86/mce: Get rid of mcheck_intel_therm_init()")
Reported-by: James Feeney <james@nurealm.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
* Another state update on exit to userspace fix
* Prevent the creation of mixed 32/64 VMs
* Fix regression with irqbypass not restarting the guest on failed connect
* Fix regression with debug register decoding resulting in overlapping access
* Commit exception state on exit to usrspace
* Fix the MMU notifier return values
* Add missing 'static' qualifiers in the new host stage-2 code
x86 fixes:
* fix guest missed wakeup with assigned devices
* fix WARN reported by syzkaller
* do not use BIT() in UAPI headers
* make the kvm_amd.avic parameter bool
PPC fixes:
* make halt polling heuristics consistent with other architectures
selftests:
* various fixes
* new performance selftest memslot_perf_test
* test UFFD minor faults in demand_paging_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCyF0MUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOHSgf/Q4Hm5e12Bj2xJy6A+iShnrbbT8PW
hcIIOA7zGWXfjVYcBV7anbj7CcpzfIz0otcRBABa5mkhj+fb3YmPEb0EzCPi4Hru
zxpcpB2w7W7WtUOIKe2EmaT+4Pk6/iLcfr8UMHMqx460akE9OmIg10QNWai3My/3
RIOeakSckBI9e/1TQZbxH66dsLwCT0lLco7i7AWHdFxkzUQyoA34HX5pczOCBsO5
3nXH+/txnRVhqlcyzWLVVGVzFqmpHtBqkIInDOXfUqIoxo/gOhOgF1QdMUEKomxn
5ZFXlL5IXNtr+7yiI67iHX7CWkGZE9oJ04TgPHn6LR6wRnVvc3JInzcB5Q==
=ollO
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes:
- Another state update on exit to userspace fix
- Prevent the creation of mixed 32/64 VMs
- Fix regression with irqbypass not restarting the guest on failed
connect
- Fix regression with debug register decoding resulting in
overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
x86 fixes:
- fix guest missed wakeup with assigned devices
- fix WARN reported by syzkaller
- do not use BIT() in UAPI headers
- make the kvm_amd.avic parameter bool
PPC fixes:
- make halt polling heuristics consistent with other architectures
selftests:
- various fixes
- new performance selftest memslot_perf_test
- test UFFD minor faults in demand_paging_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
selftests: kvm: fix overlapping addresses in memslot_perf_test
KVM: X86: Kill off ctxt->ud
KVM: X86: Fix warning caused by stale emulation context
KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
KVM: x86/mmu: Fix comment mentioning skip_4k
KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
KVM: x86: add start_assignment hook to kvm_x86_ops
KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
selftests: kvm: do only 1 memslot_perf_test run by default
KVM: X86: Use _BITUL() macro in UAPI headers
KVM: selftests: add shared hugetlbfs backing source type
KVM: selftests: allow using UFFD minor faults for demand paging
KVM: selftests: create alias mappings when using shared memory
KVM: selftests: add shmem backing source type
KVM: selftests: refactor vm_mem_backing_src_type flags
KVM: selftests: allow different backing source types
KVM: selftests: compute correct demand paging size
KVM: selftests: simplify setup_demand_paging error handling
KVM: selftests: Print a message if /dev/kvm is missing
...
PIC interrupts do not support affinity setting and they can end up on
any online CPU. Therefore, it's required to mark the associated vectors
as system-wide reserved. Otherwise, the corresponding irq descriptors
are copied to the secondary CPUs but the vectors are not marked as
assigned or reserved. This works correctly for the IO/APIC case.
When the IO/APIC is disabled via config, kernel command line or lack of
enumeration then all legacy interrupts are routed through the PIC, but
nothing marks them as system-wide reserved vectors.
As a consequence, a subsequent allocation on a secondary CPU can result in
allocating one of these vectors, which triggers the BUG() in
apic_update_vector() because the interrupt descriptor slot is not empty.
Imran tried to work around that by marking those interrupts as allocated
when a CPU comes online. But that's wrong in case that the IO/APIC is
available and one of the legacy interrupts, e.g. IRQ0, has been switched to
PIC mode because then marking them as allocated will fail as they are
already marked as system vectors.
Stay consistent and update the legacy vectors after attempting IO/APIC
initialization and mark them as system vectors in case that no IO/APIC is
available.
Fixes: 69cde0004a ("x86/vector: Use matrix allocator for vector assignment")
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
Add the (HWID, MCATYPE) tuples and names for new SMCA bank types.
Also, add their respective error descriptions to the MCE decoding module
edac_mce_amd. Also while at it, optimize the string names for some SMCA
banks.
[ bp: Drop repeated comments, explain why UMC_V2 is a separate entry. ]
Signed-off-by: Muralidhara M K <muralimk@amd.com>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20210526164601.66228-1-nchatrad@amd.com
Add a start_assignment hook to kvm_x86_ops, which is called when
kvm_arch_start_assignment is done.
The hook is required to update the wakeup vector of a sleeping vCPU
when a device is assigned to the guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210525134321.254128742@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Subsequent patches will move architectures over to the ARCH_ATOMIC API,
after preparing the asm-generic atomic implementations to function with
or without ARCH_ATOMIC.
As some architectures use the asm-generic implementations exclusively
(and don't have a local atomic.h), and to avoid the risk that
ARCH_ATOMIC isn't defined in some cases we expect, let's make the
ARCH_ATOMIC macro a Kconfig symbol instead, so that we can guarantee it
is consistently available where needed.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210525140232.53872-2-mark.rutland@arm.com
System call numbers are defined as int, so use int everywhere for system
call numbers. This is strictly a cleanup; it should not change anything
user visible; all ABI changes have been done in the preceeding patches.
[ tglx: Replaced the unsigned long cast ]
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210518191303.4135296-7-hpa@zytor.com
In some places, the native forms of descriptor table invalidation is
required. Rather than open-coding them, add explicitly native functions to
invalidate the GDT and IDT.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-6-hpa@zytor.com
There is no reason to specify any specific address to idt_invalidate(). It
looks mostly like an artifact of unifying code done differently by
accident. The most "sensible" address to set here is a NULL pointer -
virtual address zero, just as a visual marker.
This also makes it possible to mark the struct desc_ptr in idt_invalidate()
as static const.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210519212154.511983-5-hpa@zytor.com
Add defines for the number of external vectors and number of system
vectors instead of requiring the use of (FIRST_SYSTEM_VECTOR -
FIRST_EXTERNAL_VECTOR) and (NR_VECTORS - FIRST_SYSTEM_VECTOR)
respectively. Clean up the usage sites.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20210519212154.511983-3-hpa@zytor.com
UV_BAU_MESSAGE is defined but not used anywhere in the kernel. Presumably
this is a stale vector number that can be reclaimed.
MCE_VECTOR is not an actual vector: #MC is an exception, not an interrupt
vector, and as such is correctly described as X86_TRAP_MC. MCE_VECTOR is
not used anywhere is the kernel.
Note that NMI_VECTOR *is* used; specifically it is the vector number
programmed into the APIC LVT when an NMI interrupt is configured. At
the moment it is always numerically identical to X86_TRAP_NMI, that is
not necessarily going to be the case indefinitely.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20210519212154.511983-4-hpa@zytor.com
__NR_syscall_max is only used by x86 and UML. In contrast, __NR_syscalls is
widely used by all the architectures.
Convert __NR_syscall_max to __NR_syscalls and adjust the usage sites.
This prepares x86 to switch to the generic syscallhdr.sh script.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-6-masahiroy@kernel.org
Many architectures duplicate similar shell scripts.
Convert x86 and UML to use scripts/syscalltbl.sh. The generic script
generates seperate headers for x86/64 and x86/x32 syscalls, while the x86
specific script coalesced them into one. Adjust the code accordingly.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-3-masahiroy@kernel.org
The SYSCALL macros are mapped to symbols as follows:
__SYSCALL_COMMON(nr, sym) --> __x64_<sym>
__SYSCALL_X32(nr, sym) --> __x32_<sym>
Originally, the syscalls in the x32 special range (512-547) were all
compat.
This assumption is now broken after the following commits:
55db9c0e85 ("net: remove compat_sys_{get,set}sockopt")
5f764d624a ("fs: remove the compat readv/writev syscalls")
598b3cec83 ("fs: remove compat_sys_vmsplice")
c3973b401e ("mm: remove compat_process_vm_{readv,writev}")
Those commits redefined __x32_sys_* to __x64_sys_* because there is no stub
like __x32_sys_*.
Defining them as follows is more sensible and cleaner.
__SYSCALL_COMMON(nr, sym) --> __x64_<sym>
__SYSCALL_X32(nr, sym) --> __x64_<sym>
This works because both x86_64 and x32 use the same ABI (RDI, RSI, RDX,
R10, R8, R9)
The ugly #define __x32_sys_* will go away.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210517073815.97426-2-masahiroy@kernel.org
Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
use by all architectures with sigaltstack(2). Over time, the hardware state
size grew, but these constants did not evolve. Today, literal use of these
constants on several architectures may result in signal stack overflow, and
thus user data corruption.
A few years ago, the ARM team addressed this issue by establishing
getauxval(AT_MINSIGSTKSZ). This enables the kernel to supply a value
at runtime that is an appropriate replacement on current and future
hardware.
Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
added for ARM in
94b07c1f8c ("arm64: signal: Report signal frame size to userspace via auxv").
Also, include a documentation to describe x86-specific auxiliary vectors.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210518200320.17239-4-chang.seok.bae@intel.com
Signal frames do not have a fixed format and can vary in size when a number
of things change: supported XSAVE features, 32 vs. 64-bit apps, etc.
Add support for a runtime method for userspace to dynamically discover
how large a signal stack needs to be.
Introduce a new variable, max_frame_size, and helper functions for the
calculation to be used in a new user interface. Set max_frame_size to a
system-wide worst-case value, instead of storing multiple app-specific
values.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: H.J. Lu <hjl.tools@gmail.com>
Link: https://lkml.kernel.org/r/20210518200320.17239-3-chang.seok.bae@intel.com
A basic IDT setup for the boot CPU has to be done before invoking
cpu_init() because that might trigger #GP when accessing certain MSRs. This
setup cannot install the IST variants on 64-bit because the TSS setup which
is required for ISTs to work happens in cpu_init(). That leaves a
theoretical window where a NMI would invoke the ASM entry point which
relies on IST being enabled on the kernel stack which is undefined
behaviour.
This setup logic has never worked correctly, but on the other hand a NMI
hitting the boot CPU before it has fully set up the IDT would be fatal
anyway. So the small window between the wrong NMI gate and the IST based
NMI gate is not really adding a substantial amount of risk.
But the setup logic is nevertheless more convoluted than necessary. The
recent separation of the TSS setup into a separate function to ensure that
setup so it can setup TSS first, then initialize IDT with the IST variants
before invoking cpu_init() and get rid of the post cpu_init() IST setup.
Move the invocation of cpu_init_exception_handling() ahead of
idt_setup_traps() and merge the IST setup into the default setup table.
Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210507114000.569244755@linutronix.de
SEV-ES guests require properly setup task register with which the TSS
descriptor in the GDT can be located so that the IST-type #VC exception
handler which they need to function properly, can be executed.
This setup needs to happen before attempting to load microcode in
ucode_cpu_init() on secondary CPUs which can cause such #VC exceptions.
Simplify the machinery by running that exception setup from a new function
cpu_init_secondary() and explicitly call cpu_init_exception_handling() for
the boot CPU before cpu_init(). The latter prepares for fixing and
simplifying the exception/IST setup on the boot CPU.
There should be no functional changes resulting from this patch.
[ tglx: Reworked it so cpu_init_exception_handling() stays seperate ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <laijs@linux.alibaba.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/87k0o6gtvu.ffs@nanos.tec.linutronix.de
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
sP2LC4iv
=57tg
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
- Use the ALARM feature check in the alarmtimer core code insted of
the old method of checking for the set_alarm() callback. Drivers
can have that callback set but the feature bit cleared. If such
a RTC device is selected then alarms wont work.
- Use a proper define to let the preprocessor check whether Hyper-V VDSO
clocksource should be active. The code used a constant in an enum with
#ifdef, which evaluates to always false and disabled the clocksource
for VDSO.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmChLI8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUJMD/wOQ/R7jXe/EWti3+w11TATvkP+ZzDv
LcAfZ/ZP8wgrUTbjLqTTyeOFoI9q39emnq3FvCoRsF+rdHRbnZNAB3kWQmh/i1tL
j8BuGogzvVLkBmriQIzVxYgEroCZVySWkO27B7ToBq64IeI4IBVB4jQiJis614m7
5wTHKgN0MkAtWUmwDqkqycFDuWyZNPkR3Ht26zk46Lvk0dmIPh14zbVzezfFEtq4
9DBeGuLDLVtzaBNLWUvnpXL7wxuFB+E8euO5otbmgRNz7CXaE6e6zy6zspK2ahmp
FRq+nrG6yK6ucoFhGFABfKZCGorhh1ghhniPUXQKP9B29z146pN6TLFAVAutBk4z
RoRdyGb9npoO1pB0f2tl0U65TBBlMCnLnDB3hcQ/eyMG7AC8ABHalBIFUjzEPB4b
3eDa+ZxfkW8/oiSLTssQiJ6TJW1EQNaVja1TuHvtPi5RdasbS4LEkQnDaePQ3/nl
tDLekfsDF4KxetZehIlRDqyN9cqIHVphs3pTysyWR7+aOTduWWF58ZtgR7SvTCVu
7Zu+PhP06A1MtEugnwcAcpG5XYCsAXdZXinuQhPndXqazN4wMJkanXNk03z//JmQ
wG//lFAC+9EfA8i9RDr2DeE6JISD2g+jj2Di9bjjxelp5Mi0bNZ0zdIiww6EJjRg
v4F0vCp3By8SQg==
=TruV
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for timers:
- Use the ALARM feature check in the alarmtimer core code insted of
the old method of checking for the set_alarm() callback.
Drivers can have that callback set but the feature bit cleared. If
such a RTC device is selected then alarms wont work.
- Use a proper define to let the preprocessor check whether Hyper-V
VDSO clocksource should be active.
The code used a constant in an enum with #ifdef, which evaluates to
always false and disabled the clocksource for VDSO"
* tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86
alarmtimer: Check RTC features instead of ops
- Reorganize SEV code to streamline and simplify future development
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCg1XQACgkQEsHwGGHe
VUpRKA//dwzDD1QU16JucfhgFlv/9OTm48ukSwAb9lZjDEy4H1CtVL3xEHFd7L3G
LJp0LTW+OQf0/0aGlQp/cP6sBF6G9Bf4mydx70Id4SyCQt8eZDodB+ZOOWbeteWq
p92fJPbX8CzAglutbE+3v/MD8CCAllTiLZnJZPVj4Kux2/wF6EryDgF1+rb5q8jp
ObTT9817mHVwWVUYzbgceZtd43IocOlKZRmF1qivwScMGylQTe1wfMjunpD5pVt8
Zg4UDNknNfYduqpaG546E6e1zerGNaJK7SHnsuzHRUVU5icNqtgBk061CehP9Ksq
DvYXLUl4xF16j6xJAqIZPNrBkJGdQf4q1g5x2FiBm7rSQU5owzqh5rkVk4EBFFzn
UtzeXpqbStbsZHXycyxBNdq2HXxkFPf2NXZ+bkripPg+DifOGots1uwvAft+6iAE
GudK6qxAvr8phR1cRyy6BahGtgOStXbZYEz0ZdU6t7qFfZMz+DomD5Jimj0kAe6B
s6ras5xm8q3/Py87N/KNjKtSEpgsHv/7F+idde7ODtHhpRL5HCBqhkZOSRkMMZqI
ptX1oSTvBXwRKyi5x9YhkKHUFqfFSUTfJhiRFCWK+IEAv3Y7SipJtfkqxRbI6fEV
FfCeueKDDdViBtseaRceVLJ8Tlr6Qjy27fkPPTqJpthqPpCdoZ0=
=ENfF
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"The three SEV commits are not really urgent material. But we figured
since getting them in now will avoid a huge amount of conflicts
between future SEV changes touching tip, the kvm and probably other
trees, sending them to you now would be best.
The idea is that the tip, kvm etc branches for 5.14 will all base
ontop of -rc2 and thus everything will be peachy. What is more, those
changes are purely mechanical and defines movement so they should be
fine to go now (famous last words).
Summary:
- Enable -Wundef for the compressed kernel build stage
- Reorganize SEV code to streamline and simplify future development"
* tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/compressed: Enable -Wundef
x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
x86/sev-es: Rename sev-es.{ch} to sev.{ch}
Mohammed reports (https://bugzilla.kernel.org/show_bug.cgi?id=213029)
the commit e4ab4658f1 ("clocksource/drivers/hyper-v: Handle vDSO
differences inline") broke vDSO on x86. The problem appears to be that
VDSO_CLOCKMODE_HVCLOCK is an enum value in 'enum vdso_clock_mode' and
'#ifdef VDSO_CLOCKMODE_HVCLOCK' branch evaluates to false (it is not
a define).
Use a dedicated HAVE_VDSO_CLOCKMODE_HVCLOCK define instead.
Fixes: e4ab4658f1 ("clocksource/drivers/hyper-v: Handle vDSO differences inline")
Reported-by: Mohammed Gamal <mgamal@redhat.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210513073246.1715070-1-vkuznets@redhat.com
Stephen Rothwell reported that the objtool cross-build breaks on
non-x86 hosts:
> tools/arch/x86/include/asm/asm.h:185:24: error: invalid register name for 'current_stack_pointer'
> 185 | register unsigned long current_stack_pointer asm(_ASM_SP);
> | ^~~~~~~~~~~~~~~~~~~~~
The PowerPC host obviously doesn't know much about x86 register names.
Protect the kernel-specific bits of <asm/asm.h>, so that it can be
included by tooling and cross-built.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some AMD Ryzen generations has different calculation method on maximum
performance. 255 is not for all ASICs, some specific generations should use 166
as the maximum performance. Otherwise, it will report incorrect frequency value
like below:
~ → lscpu | grep MHz
CPU MHz: 3400.000
CPU max MHz: 7228.3198
CPU min MHz: 2200.0000
[ mingo: Tidied up whitespace use. ]
[ Alexander Monakov <amonakov@ispras.ru>: fix 225 -> 255 typo. ]
Fixes: 41ea667227 ("x86, sched: Calculate frequency invariance for AMD systems")
Fixes: 3c55e94c0a ("cpufreq: ACPI: Extend frequency tables to cover boost frequencies")
Reported-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Fixed-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jason Bagavatsingham <jason.bagavatsingham@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425073451.2557394-1-ray.huang@amd.com
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211791
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that objtool is able to rewrite jump_label instructions, have the
compiler emit a JMP, such that it can decide on the optimal encoding,
and set jump_entry::key bit1 to indicate that objtool should rewrite
the instruction to a matching NOP.
For x86_64-allyesconfig this gives:
jl\ NOP JMP
short: 22997 124
long: 30874 90
IOW, we save (22997+124) * 3 bytes of kernel text in hotpaths.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194158.216763632@infradead.org
Now that we can patch short JMP/NOP, allow the compiler/assembler to
emit short JMP instructions.
There is no way to have the assembler emit short NOPs based on the
potential displacement, so leave those long for now.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.967034497@infradead.org
Both arch_static_branch() and arch_static_branch_jump() have the same
blurb to generate the __jump_table entry, share it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.663132781@infradead.org
In prepration for variable size jump_label support; remove all ASM
bits, which are currently unused.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210506194157.599716762@infradead.org
As pointed out by commit
de9b8f5dcb ("sched: Fix crash trying to dequeue/enqueue the idle thread")
init_idle() can and will be invoked more than once on the same idle
task. At boot time, it is invoked for the boot CPU thread by
sched_init(). Then smp_init() creates the threads for all the secondary
CPUs and invokes init_idle() on them.
As the hotplug machinery brings the secondaries to life, it will issue
calls to idle_thread_get(), which itself invokes init_idle() yet again.
In this case it's invoked twice more per secondary: at _cpu_up(), and at
bringup_cpu().
Given smp_init() already initializes the idle tasks for all *possible*
CPUs, no further initialization should be required. Now, removing
init_idle() from idle_thread_get() exposes some interesting expectations
with regards to the idle task's preempt_count: the secondary startup always
issues a preempt_disable(), requiring some reset of the preempt count to 0
between hot-unplug and hotplug, which is currently served by
idle_thread_get() -> idle_init().
Given the idle task is supposed to have preemption disabled once and never
see it re-enabled, it seems that what we actually want is to initialize its
preempt_count to PREEMPT_DISABLED and leave it there. Do that, and remove
init_idle() from idle_thread_get().
Secondary startups were patched via coccinelle:
@begone@
@@
-preempt_disable();
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE);
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512094636.2958515-1-valentin.schneider@arm.com
Drop the bitness ifdeffery in favor of using _ASM_SP,
which is the helper macro for the rSP register specification
for 32 and 64 bit depending on the build.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210512093310.5635-1-bp@alien8.de
Even if these APIs are disabled, and the arrays therefore do not
exist, having the prototypes allows us to use IS_ENABLED() rather than
using #ifdefs.
If something ends up trying to actually *use* these arrays a linker
error will ensue.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-4-hpa@zytor.com
Reverse the order of arguments to do_syscall_64() so that the first
argument is the pt_regs pointer. This is not only consistent with
*all* other entry points from assembly, but it actually makes the
compiled code slightly better.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510185316.3307264-3-hpa@zytor.com
* Fix virtualization of RDPID
* Virtualization of DR6_BUS_LOCK, which on bare metal is new in
the 5.13 merge window
* More nested virtualization migration fixes (nSVM and eVMCS)
* Fix for KVM guest hibernation
* Fix for warning in SEV-ES SRCU usage
* Block KVM from loading on AMD machines with 5-level page tables,
due to the APM not mentioning how host CR4.LA57 exactly impacts
the guest.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCZWwgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE9wgAk7Io8cuvnhC9ogVqzZWrPweWqFg8
fJcPMB584JRnMqYHBVYbkTPGe8SsCHKR2MKsNdc4cEP111cyr3suWsxOdmjJn58i
7ahy6PcKx7wWeWwEt7O599l6CeoX5XB9ExvA6eiXAv7iZeOJHFa+Ny2GlWgauy6Y
DELryEomx1r4IUkZaSR+2fYjzvOWTXQixwU/jwx8NcTJz0DrzknzLE7XOciPBfn0
t0Q2rCXdL2nF1uPksZbntx8Qoa6t6GDVIyrH/ZCPQYJtAX6cjxNAh3zwCe+hMnOd
fW8ntBH1nZRiNnberA4IICAzqnUokgPWdKBrZT2ntWHBK+aqxXHznrlPJA==
=e+gD
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Lots of bug fixes.
- Fix virtualization of RDPID
- Virtualization of DR6_BUS_LOCK, which on bare metal is new to this
release
- More nested virtualization migration fixes (nSVM and eVMCS)
- Fix for KVM guest hibernation
- Fix for warning in SEV-ES SRCU usage
- Block KVM from loading on AMD machines with 5-level page tables, due
to the APM not mentioning how host CR4.LA57 exactly impacts the
guest.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (48 commits)
KVM: SVM: Move GHCB unmapping to fix RCU warning
KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpers
kvm: Cap halt polling at kvm->max_halt_poll_ns
tools/kvm_stat: Fix documentation typo
KVM: x86: Prevent deadlock against tk_core.seq
KVM: x86: Cancel pvclock_gtod_work on module removal
KVM: x86: Prevent KVM SVM from loading on kernels with 5-level paging
KVM: X86: Expose bus lock debug exception to guest
KVM: X86: Add support for the emulation of DR6_BUS_LOCK bit
KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacks
KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failed
KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model
KVM: x86: Move uret MSR slot management to common x86
KVM: x86: Export the number of uret MSRs to vendor modules
KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional way
KVM: VMX: Use common x86's uret MSR list as the one true list
KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list
KVM: VMX: Configure list of user return MSRs at module init
KVM: x86: Add support for RDPID without RDTSCP
KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in host
...
There are several architectures that just duplicate the contents
of asm-generic/unaligned.h, so change those over to use the
file directly, to make future modifications easier.
The exceptions are:
- arm32 sets HAVE_EFFICIENT_UNALIGNED_ACCESS, but wants the
unaligned-struct version
- ppc64le disables HAVE_EFFICIENT_UNALIGNED_ACCESS but includes
the access-ok version
- most m68k also uses the access-ok version without setting
HAVE_EFFICIENT_UNALIGNED_ACCESS.
- sh4a has a custom inline asm version
- openrisc is the only one using the memmove version that
generally leads to worse code.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Use the new generalized _ASM_BYTES() macro from <asm/asm.h> instead of
the "home grown" _ASM_MK_NOP() in <asm/nops.h>.
Add <asm/asm.h> and update <asm/nops.h> in the tools directory...
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-4-hpa@zytor.com
Make it easy to create a sequence of bytes that can be used in either
assembly proper on in a C asm() statement.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-3-hpa@zytor.com
The __ASM_FORM macros are really useful, but in order to be able to
use them to define instructions via .byte directives breaks because of
the necessary commas. Change the macros to handle commas correctly.
[ mingo: Removed stray whitespaces & aligned the definitions vertically. ]
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210510090940.924953-2-hpa@zytor.com
The SYSCFG MSR continued being updated beyond the K8 family; drop the K8
name from it.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-4-brijesh.singh@amd.com
The guest and the hypervisor contain separate macros to get and set
the GHCB MSR protocol and NAE event fields. Consolidate the GHCB
protocol definitions and helper macros in one place.
Leave the supported protocol version define in separate files to keep
the guest and hypervisor flexibility to support different GHCB version
in the same release.
There is no functional change intended.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-3-brijesh.singh@amd.com
SEV-SNP builds upon the SEV-ES functionality while adding new hardware
protection. Version 2 of the GHCB specification adds new NAE events that
are SEV-SNP specific. Rename the sev-es.{ch} to sev.{ch} so that all
SEV* functionality can be consolidated in one place.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20210427111636.1207-2-brijesh.singh@amd.com
Disallow loading KVM SVM if 5-level paging is supported. In theory, NPT
for L1 should simply work, but there unknowns with respect to how the
guest's MAXPHYADDR will be handled by hardware.
Nested NPT is more problematic, as running an L1 VMM that is using
2-level page tables requires stacking single-entry PDP and PML4 tables in
KVM's NPT for L2, as there are no equivalent entries in L1's NPT to
shadow. Barring hardware magic, for 5-level paging, KVM would need stack
another layer to handle PML5.
Opportunistically rename the lm_root pointer, which is used for the
aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to
call out that it's specifically for PML4.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210505204221.1934471-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of
DR6) to indicate that bus lock #DB exception is generated. The set/clear
of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears
DR6_BUS_LOCK when the exception is generated. For all other #DB, the
processor sets this bit to 1. Software #DB handler should set this bit
before returning to the interrupted task.
In VMM, to avoid breaking the CPUs without bus lock #DB exception
support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits.
When intercepting the #DB exception caused by bus locks, bit 11 of the
exit qualification is set to identify it. The VMM should emulate the
exception by clearing the bit 11 of the guest DR6.
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior. While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.
Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX. On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out and export the number of configured user return MSRs so that
VMX can iterate over the set of MSRs without having to do its own tracking.
Keep the list itself internal to x86 so that vendor code still has to go
through the "official" APIs to add/modify entries.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop VMX's global list of user return MSRs now that VMX doesn't resort said
list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have
the same MSRs in the same order.
In addition to eliminating the redundant list, this will also allow moving
more of the list management into common x86.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable preemption when probing a user return MSR via RDSMR/WRMSR. If
the MSR holds a different value per logical CPU, the WRMSR could corrupt
the host's value if KVM is preempted between the RDMSR and WRMSR, and
then rescheduled on a different CPU.
Opportunistically land the helper in common x86, SVM will use the helper
in a future commit.
Fixes: 4be5341026 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation")
Cc: stable@vger.kernel.org
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-6-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Crash shutdown handler only disables kvmclock and steal time, other PV
features remain active so we risk corrupting memory or getting some
side-effects in kdump kernel. Move crash handler to kvm.c and unify
with CPU offline.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currenly, we disable kvmclock from machine_shutdown() hook and this
only happens for boot CPU. We need to disable it for all CPUs to
guard against memory corruption e.g. on restore from hibernate.
Note, writing '0' to kvmclock MSR doesn't clear memory location, it
just prevents hypervisor from updating the location so for the short
while after write and while CPU is still alive, the clock remains usable
and correct so we don't need to switch to some other clocksource.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In VMX, the host NMI handler needs to be invoked after NMI VM-Exit.
Before commit 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect
call instead of INTn"), this was done by INTn ("int $2"). But INTn
microcode is relatively expensive, so the commit reworked NMI VM-Exit
handling to invoke the kernel handler by function call.
But this missed a detail. The NMI entry point for direct invocation is
fetched from the IDT table and called on the kernel stack. But on 64-bit
the NMI entry installed in the IDT expects to be invoked on the IST stack.
It relies on the "NMI executing" variable on the IST stack to work
correctly, which is at a fixed position in the IST stack. When the entry
point is unexpectedly called on the kernel stack, the RSP-addressed "NMI
executing" variable is obviously also on the kernel stack and is
"uninitialized" and can cause the NMI entry code to run in the wrong way.
Provide a non-ist entry point for VMX which shares the C-function with
the regular NMI entry and invoke the new asm entry point instead.
On 32-bit this just maps to the regular NMI entry point as 32-bit has no
ISTs and is not affected.
[ tglx: Made it independent for backporting, massaged changelog ]
Fixes: 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lai Jiangshan <laijs@linux.alibaba.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de
Drop write_tsc() and write_rdtscp_aux(); the former has no users, and the
latter has only a single user and is slightly misleading since the only
in-kernel consumer of MSR_TSC_AUX is RDPID, not RDTSCP.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210504225632.1532621-3-seanjc@google.com
We used to generate this constant with static jumps, which certainly
works, but generates some quite unreadable and horrid code, and extra
jumps.
It's actually much simpler to just use our alternative_asm()
infrastructure to generate a simple alternative constant, making the
generated code much more obvious (and straight-line rather than "jump
around to load the right constant").
Acked-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
* Define and use an invalid GPA (all ones) for init value of last
and current nested vmcb physical addresses.
* Reset the current vmcb12 gpa to the invalid value when leaving
the nested mode, similar to what is done on nested vmexit.
* Reset the last seen vmcb12 address when disabling the nested SVM,
as it relies on vmcb02 fields which are freed at that point.
Fixes: 4995a3685f ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210503125446.1353307-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Merge misc updates from Andrew Morton:
"A few misc subsystems and some of MM.
175 patches.
Subsystems affected by this patch series: ia64, kbuild, scripts, sh,
ocfs2, kfifo, vfs, kernel/watchdog, and mm (slab-generic, slub,
kmemleak, debug, pagecache, msync, gup, memremap, memcg, pagemap,
mremap, dma, sparsemem, vmalloc, documentation, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (175 commits)
mm/memory-failure: unnecessary amount of unmapping
mm/mmzone.h: fix existing kernel-doc comments and link them to core-api
mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1
net: page_pool: use alloc_pages_bulk in refill code path
net: page_pool: refactor dma_map into own function page_pool_dma_map
SUNRPC: refresh rq_pages using a bulk page allocator
SUNRPC: set rq_page_end differently
mm/page_alloc: inline __rmqueue_pcplist
mm/page_alloc: optimize code layout for __alloc_pages_bulk
mm/page_alloc: add an array-based interface to the bulk page allocator
mm/page_alloc: add a bulk page allocator
mm/page_alloc: rename alloced to allocated
mm/page_alloc: duplicate include linux/vmalloc.h
mm, page_alloc: avoid page_to_pfn() in move_freepages()
mm/Kconfig: remove default DISCONTIGMEM_MANUAL
mm: page_alloc: dump migrate-failed pages
mm/mempolicy: fix mpol_misplaced kernel-doc
mm/mempolicy: rewrite alloc_pages_vma documentation
mm/mempolicy: rewrite alloc_pages documentation
mm/mempolicy: rename alloc_pages_current to alloc_pages
...
- Enable KFENCE for 32-bit.
- Implement EBPF for 32-bit.
- Convert 32-bit to do interrupt entry/exit in C.
- Convert 64-bit BookE to do interrupt entry/exit in C.
- Changes to our signal handling code to use user_access_begin/end() more extensively.
- Add support for time namespaces (CONFIG_TIME_NS)
- A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le Goater, Chen Huang, Chris
Packham, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Dan Carpenter, Daniel
Axtens, Daniel Henrique Barboza, David Gibson, Davidlohr Bueso, Denis Efremov,
dingsenjie, Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren Myneni, He Ying,
Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee Jones, Leonardo Bras, Li Huafei,
Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Nathan Chancellor, Nathan
Lynch, Nicholas Piggin, Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi
Bangoria, Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen Rothwell, Thadeu Lima
de Souza Cascardo, Thomas Gleixner, Tony Ambardar, Tyrel Datwyler, Vaibhav Jain,
Vincenzo Frascino, Xiongwei Song, Yang Li, Yu Kuai, Zhang Yunkai.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmCLV1kTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUyD/4jrTolG4sVec211hYO+0VuJzoqN4Cf
j2CA2Ju39butnSMiq4LJUPRB7QRZY1OofkoNFpZeDQspjfZXPz2ulpYAz+SxHWE2
ReHPmWH1rOABlUPXFboePF4OLwmAs9eR5mN2z9HpKXbT3k78HaToLqiONyB4fVCr
Q5TkJeRn/Y7ZJLdyPLTpczHHleQ8KoM6kT7ncXnTm6p97JOBJSrGaJ5N/8X5a4+e
6jtgB7Pvw8jNDShSr8BDLBgBZZcmoTiuG8KfgwRZ+m+mKB1yI2X8S/a54w/lDi9g
UcSv3jQcFLJuW+T/pYe4R330uWDYa0cwjJOtMmsJ98S4EYOevoe9fZuL97qNshme
xtBr4q1i03G1icYOJJ8dXtvabG2rUzj8t1SCDpwYfrynzTWVRikiQYTXUBhRSFoK
nsoklvKd2IZa485XYJ2ljSyClMy8S4yJJ9RuzZ94DTXDSJUesKuyRWGnso4mhkcl
wvl4wwMTJvnCMKVo6dsJyV24QWfd6dABxzm04uPA94CKhG33UwK8252jXVeaohSb
WSO7qWBONgDXQLJ0mXRcEYa9NHvFS4Jnp6APbxnHr1gS+K+PNkD4gPBf34FoyN0E
9s27kvEYk5vr8APUclETF6+FkbGUD5bFbusjt3hYloFpAoHQ/k5pFVDsOZNPA8sW
fDIRp05KunDojw==
=dfKL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Enable KFENCE for 32-bit.
- Implement EBPF for 32-bit.
- Convert 32-bit to do interrupt entry/exit in C.
- Convert 64-bit BookE to do interrupt entry/exit in C.
- Changes to our signal handling code to use user_access_begin/end()
more extensively.
- Add support for time namespaces (CONFIG_TIME_NS)
- A series of fixes that allow us to reenable STRICT_KERNEL_RWX.
- Other smaller features, fixes & cleanups.
Thanks to Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh
Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le
Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M.
Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique
Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie,
Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert
Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren
Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee
Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar,
Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin,
Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria,
Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior,
Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen
Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar,
Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li,
Yu Kuai, and Zhang Yunkai.
* tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (302 commits)
powerpc/signal32: Fix erroneous SIGSEGV on RT signal return
powerpc: Avoid clang uninitialized warning in __get_user_size_allowed
powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe
powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n
powerpc/kasan: Fix shadow start address with modules
powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
powerpc/44x: fix spelling mistake in Kconfig "varients" -> "variants"
powerpc/iommu: Annotate nested lock for lockdep
powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
powerpc/iommu: Allocate it_map by vmalloc
selftests/powerpc: remove unneeded semicolon
powerpc/64s: remove unneeded semicolon
powerpc/eeh: remove unneeded semicolon
powerpc/selftests: Add selftest to test concurrent perf/ptrace events
powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR
powerpc/selftests/perf-hwbreak: Coalesce event creation code
powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR
powerpc/configs: Add IBMVNIC to some 64-bit configs
selftests/powerpc: Add uaccess flush test
...
If an architecture doesn't support a particular page table level as a huge
vmap page size then allow it to skip defining the support query function.
Link: https://lkml.kernel.org/r/20210317062402.533919-11-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows unsupported levels to be constant folded away, and so
p4d_free_pud_page can be removed because it's no longer linked to.
Link: https://lkml.kernel.org/r/20210317062402.533919-10-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.
This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.
This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).
Link: https://lkml.kernel.org/r/20210317062402.533919-7-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Ding Tianhong <dingtianhong@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Implement concurrent TLB flushes, which overlaps the local TLB flush with the
remote TLB flush. In testing this improved sysbench performance measurably by
a couple of percentage points, especially if TLB-heavy security mitigations
are active.
- Further micro-optimizations to improve the performance of TLB flushes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCKbNcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hjYBAAsyNUa/gOu0g6/Cx8R86w9HtHHmm5vso/
6nJjWj2fd2qJ9JShlddxvXEMeXtPTYabVWQkiiriFMuofk6JeKnlHm1Jzl6keABX
OQFwjIFeNASPRcdXvuuYPOVWAJJdr2oL9QUr6OOK1ccQJTz/Cd0zA+VQ5YqcsCon
yaWbkxELwKXpgql+qt66eAZ6Q2Y1TKXyrTW7ZgxQi0yeeWqMaEOub0/oyS7Ax1Rg
qEJMwm1prb76NPzeqR/G3e4KTrDZfQ/B/KnSsz36GTJpl4eye6XqWDUgm1nAGNIc
5dbc4Vx7JtZsUOuC0AmzWb3hsDyzVcN/lQvijdZ2RsYR3gvuYGaBhKqExqV0XH6P
oqaWOKWCz+LqWbsgJmxCpqkt1LZl5+VUOcfJ97WkIS7DyIPtSHTzQXbBMZqKLeat
mn5UcKYB2Gi7wsUPv6VC2ChKbDqN0VT8G86XbYylGo4BE46KoZKPUNY/QWKLUPd6
0UKcVeNM2HFyf1C73p/tO/z7hzu3qLuMMnsphP6/c2pKLpdgawEXgbnVKNId1B/c
NrzyhTvVaMt+Um28bBRhHONIlzPJwWcnZbdY7NqMnu+LBKQ68cL/h4FOIV/RDLNb
GJLgfAr8fIw/zIpqYuFHiiMNo9wWqVtZko1MvXhGceXUL69QuzTra2XR/6aDxkPf
6gQVesetTvo=
=3Cyp
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 tlb updates from Ingo Molnar:
"The x86 MM changes in this cycle were:
- Implement concurrent TLB flushes, which overlaps the local TLB
flush with the remote TLB flush.
In testing this improved sysbench performance measurably by a
couple of percentage points, especially if TLB-heavy security
mitigations are active.
- Further micro-optimizations to improve the performance of TLB
flushes"
* tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Micro-optimize smp_call_function_many_cond()
smp: Inline on_each_cpu_cond() and on_each_cpu()
x86/mm/tlb: Remove unnecessary uses of the inline keyword
cpumask: Mark functions as pure
x86/mm/tlb: Do not make is_lazy dirty for no reason
x86/mm/tlb: Privatize cpu_tlbstate
x86/mm/tlb: Flush remote and local TLBs concurrently
x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
smp: Run functions concurrently in smp_call_function_many_cond()
- Refactoring powerpc and arm64 kexec DT handling to common code. This
enables IMA on arm64.
- Add kbuild support for applying DT overlays at build time. The first
user are the DT unittests.
- Fix kerneldoc formatting and W=1 warnings in drivers/of/
- Fix handling 64-bit flag on PCI resources
- Bump dtschema version required to v2021.2.1
- Enable undocumented compatible checks for dtbs_check. This allows
tracking of missing binding schemas.
- DT docs improvements. Regroup the DT docs and add the example schema
and DT kernel ABI docs to the doc build.
- Convert Broadcom Bluetooth and video-mux bindings to schema
- Add QCom sm8250 Venus video codec binding schema
- Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
Technologies Inc.
- Cleanup of DT schema type references on common properties and
standard unit properties
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmCIYdgQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw/PKEACkOCWDnLSY9U7w1uGDHr6UgXIWOY9j8bYy
2pTvDrVa6KZphT6yGU/hxrOk8Mqh5AMd2vUhO2OCoyyl/priTv+Ktqo+bikvJZLa
MQm3JnrLpPy/GetdmVD8wq1l+FoeOSTnRIJqRxInsd8UFVpZImtP22ELox6KgGiv
keVHIrjsHU/HpafK3w8wHCLikCZk+1Gl6pL/QgFDv2FaaCTKW16Dt64dPqYm49Xk
j7YMMQWl+3NJ9ywZV0+PMbl9udi3EjGm5Ap5VfKzpj53Nh07QObg/QtH/1sj0HPo
apyW7jAyQFyLytbjxzFL/tljtOeW/5rZos1GWThZ326e+Y0mTKUTDZShvNplfjIf
e26FvVi7gndWlRSr30Ia5gdNFAx72IkpJUAuypBXgb+qNPchBJjAXLn9tcIcg/k+
2R6BIB7SkVLpgTnJ1Bq1+PRqkKM+ggACdJNJIUApj44xoiG01vtGDGRaFuIio+Ch
HT4aBbic4kLvagm8VzuiIF/sL7af5pntzArcyOfQTaZ92DyGI2C0j90rK3yPRIYM
u9qX/24t1SXiUji74QpoQFzt/+Egy5hYXMJOJJSywUjKf7DBhehqklTjiJRQHKm6
0DJ/n8q4lNru8F0Y4keKSuYTfHBstF7fS3UTH/rUmBAbfEwkvZe6B29KQbs+7aph
GTw+jeoR5Q==
=rF27
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree updates from Rob Herring:
- Refactor powerpc and arm64 kexec DT handling to common code. This
enables IMA on arm64.
- Add kbuild support for applying DT overlays at build time. The first
user are the DT unittests.
- Fix kerneldoc formatting and W=1 warnings in drivers/of/
- Fix handling 64-bit flag on PCI resources
- Bump dtschema version required to v2021.2.1
- Enable undocumented compatible checks for dtbs_check. This allows
tracking of missing binding schemas.
- DT docs improvements. Regroup the DT docs and add the example schema
and DT kernel ABI docs to the doc build.
- Convert Broadcom Bluetooth and video-mux bindings to schema
- Add QCom sm8250 Venus video codec binding schema
- Add vendor prefixes for AESOP, YIC System Co., Ltd, and Siliconfile
Technologies Inc.
- Cleanup of DT schema type references on common properties and
standard unit properties
* tag 'devicetree-for-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (64 commits)
powerpc: If kexec_build_elf_info() fails return immediately from elf64_load()
powerpc: Free fdt on error in elf64_load()
of: overlay: Fix kerneldoc warning in of_overlay_remove()
of: linux/of.h: fix kernel-doc warnings
of/pci: Add IORESOURCE_MEM_64 to resource flags for 64-bit memory addresses
dt-bindings: bcm4329-fmac: add optional brcm,ccode-map
docs: dt: update writing-schema.rst references
dt-bindings: media: venus: Add sm8250 dt schema
of: base: Fix spelling issue with function param 'prop'
docs: dt: Add DT API documentation
of: Add missing 'Return' section in kerneldoc comments
of: Fix kerneldoc output formatting
docs: dt: Group DT docs into relevant sub-sections
docs: dt: Make 'Devicetree' wording more consistent
docs: dt: writing-schema: Include the example schema in the doc build
docs: dt: writing-schema: Remove spurious indentation
dt-bindings: Fix reference in submitting-patches.rst to the DT ABI doc
dt-bindings: ddr: Add optional manufacturer and revision ID to LPDDR3
dt-bindings: media: video-interfaces: Drop the example
devicetree: bindings: clock: Minor typo fix in the file armada3700-tbg-clock.txt
...
- Improve Intel uncore PMU support:
- Parse uncore 'discovery tables' - a new hardware capability enumeration method
introduced on the latest Intel platforms. This table is in a well-defined PCI
namespace location and is read via MMIO. It is organized in an rbtree.
These uncore tables will allow the discovery of standard counter blocks, but
fancier counters still need to be enumerated explicitly.
- Add Alder Lake support
- Improve IIO stacks to PMON mapping support on Skylake servers
- Add Intel Alder Lake PMU support - which requires the introduction of 'hybrid' CPUs
and PMUs. Alder Lake is a mix of Golden Cove ('big') and Gracemont ('small' - Atom derived)
cores.
The CPU-side feature set is entirely symmetrical - but on the PMU side there's
core type dependent PMU functionality.
- Reduce data loss with CPU level hardware tracing on Intel PT / AUX profiling, by
fixing the AUX allocation watermark logic.
- Improve ring buffer allocation on NUMA systems
- Put 'struct perf_event' into their separate kmem_cache pool
- Add support for synchronous signals for select perf events. The immediate motivation
is to support low-overhead sampling-based race detection for user-space code. The
feature consists of the following main changes:
- Add thread-only event inheritance via perf_event_attr::inherit_thread, which limits
inheritance of events to CLONE_THREAD.
- Add the ability for events to not leak through exec(), via perf_event_attr::remove_on_exec.
- Allow the generation of SIGTRAP via perf_event_attr::sigtrap, extend siginfo with an u64
::si_perf, and add the breakpoint information to ::si_addr and ::si_perf if the event is
PERF_TYPE_BREAKPOINT.
The siginfo support is adequate for breakpoints right now - but the new field can be used
to introduce support for other types of metadata passed over siginfo as well.
- Misc fixes, cleanups and smaller updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJGpERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j9zBAAuVbG2snV6SBSdXLhQcM66N3NckOXvSY5
QjjhQcuwJQEK/NJB3266K5d8qSmdyRBsWf3GCsrmyBT67P1V28K44Pu7oCV0UDtf
mpVRjEP0oR7hNsANSSgo8Fa4ZD7H5waX7dK7925Tvw8By3mMoZoddiD/84WJHhxO
NDF+GRFaRj+/dpbhV8cdCoXTjYdkC36vYuZs3b9lu0tS9D/AJgsNy7TinLvO02Cs
5peP+2y29dgvCXiGBiuJtEA6JyGnX3nUJCvfOZZ/DWDc3fdduARlRrc5Aiq4n/wY
UdSkw1VTZBlZ1wMSdmHQVeC5RIH3uWUtRoNqy0Yc90lBm55AQ0EENwIfWDUDC5zy
USdBqWTNWKMBxlEilUIyqKPQK8LW/31TRzqy8BWKPNcZt5yP5YS1SjAJRDDjSwL/
I+OBw1vjLJamYh8oNiD5b+VLqNQba81jFASfv+HVWcULumnY6ImECCpkg289Fkpi
BVR065boifJDlyENXFbvTxyMBXQsZfA+EhtxG7ju2Ni+TokBbogyCb3L2injPt9g
7jjtTOqmfad4gX1WSc+215iYZMkgECcUd9E+BfOseEjBohqlo7yNKIfYnT8mE/Xq
nb7eHjyvLiE8tRtZ+7SjsujOMHv9LhWFAbSaxU/kEVzpkp0zyd6mnnslDKaaHLhz
goUMOL/D0lg=
=NhQ7
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event updates from Ingo Molnar:
- Improve Intel uncore PMU support:
- Parse uncore 'discovery tables' - a new hardware capability
enumeration method introduced on the latest Intel platforms. This
table is in a well-defined PCI namespace location and is read via
MMIO. It is organized in an rbtree.
These uncore tables will allow the discovery of standard counter
blocks, but fancier counters still need to be enumerated
explicitly.
- Add Alder Lake support
- Improve IIO stacks to PMON mapping support on Skylake servers
- Add Intel Alder Lake PMU support - which requires the introduction of
'hybrid' CPUs and PMUs. Alder Lake is a mix of Golden Cove ('big')
and Gracemont ('small' - Atom derived) cores.
The CPU-side feature set is entirely symmetrical - but on the PMU
side there's core type dependent PMU functionality.
- Reduce data loss with CPU level hardware tracing on Intel PT / AUX
profiling, by fixing the AUX allocation watermark logic.
- Improve ring buffer allocation on NUMA systems
- Put 'struct perf_event' into their separate kmem_cache pool
- Add support for synchronous signals for select perf events. The
immediate motivation is to support low-overhead sampling-based race
detection for user-space code. The feature consists of the following
main changes:
- Add thread-only event inheritance via
perf_event_attr::inherit_thread, which limits inheritance of
events to CLONE_THREAD.
- Add the ability for events to not leak through exec(), via
perf_event_attr::remove_on_exec.
- Allow the generation of SIGTRAP via perf_event_attr::sigtrap,
extend siginfo with an u64 ::si_perf, and add the breakpoint
information to ::si_addr and ::si_perf if the event is
PERF_TYPE_BREAKPOINT.
The siginfo support is adequate for breakpoints right now - but the
new field can be used to introduce support for other types of
metadata passed over siginfo as well.
- Misc fixes, cleanups and smaller updates.
* tag 'perf-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
signal, perf: Add missing TRAP_PERF case in siginfo_layout()
signal, perf: Fix siginfo_t by avoiding u64 on 32-bit architectures
perf/x86: Allow for 8<num_fixed_counters<16
perf/x86/rapl: Add support for Intel Alder Lake
perf/x86/cstate: Add Alder Lake CPU support
perf/x86/msr: Add Alder Lake CPU support
perf/x86/intel/uncore: Add Alder Lake support
perf: Extend PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE
perf/x86/intel: Add Alder Lake Hybrid support
perf/x86: Support filter_match callback
perf/x86/intel: Add attr_update for Hybrid PMUs
perf/x86: Add structures for the attributes of Hybrid PMUs
perf/x86: Register hybrid PMUs
perf/x86: Factor out x86_pmu_show_pmu_cap
perf/x86: Remove temporary pmu assignment in event_init
perf/x86/intel: Factor out intel_pmu_check_extra_regs
perf/x86/intel: Factor out intel_pmu_check_event_constraints
perf/x86/intel: Factor out intel_pmu_check_num_counters
perf/x86: Hybrid PMU support for extra_regs
perf/x86: Hybrid PMU support for event constraints
...
- rtmutex cleanup & spring cleaning pass that removes ~400 lines of code
- Futex simplifications & cleanups
- Add debugging to the CSD code, to help track down a tenacious race (or hw problem)
- Add lockdep_assert_not_held(), to allow code to require a lock to not be held,
and propagate this into the ath10k driver
- Misc LKMM documentation updates
- Misc KCSAN updates: cleanups & documentation updates
- Misc fixes and cleanups
- Fix locktorture bugs with ww_mutexes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmCJDn0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hPrRAAryS4zPnuDsfkVk0smxo7a0lK5ljbH2Xo
28QUZXOl6upnEV8dzbjwG7eAjt5ZJVI5tKIeG0PV0NUJH2nsyHwESdtULGGYuPf/
4YUzNwZJa+nI/jeBnVsXCimLVxxnNCRdR7yOVOHm4ukEwa+YTNt1pvlYRmUd4YyH
Q5cCrpb3THvLka3AAamEbqnHnAdGxHKuuHYVRkODpMQ+zrQvtN8antYsuk8kJsqM
m+GZg/dVCuLEPah5k+lOACtcq/w7HCmTlxS8t4XLvD52jywFZLcCPvi1rk0+JR+k
Vd9TngC09GJ4jXuDpr42YKkU9/X6qy2Es39iA/ozCvc1Alrhspx/59XmaVSuWQGo
XYuEPx38Yuo/6w16haSgp0k4WSay15A4uhCTQ75VF4vli8Bqgg9PaxLyQH1uG8e2
xk8U90R7bDzLlhKYIx1Vu5Z0t7A1JtB5CJtgpcfg/zQLlzygo75fHzdAiU5fDBDm
3QQXSU2Oqzt7c5ZypioHWazARk7tL6th38KGN1gZDTm5zwifpaCtHi7sml6hhZ/4
ATH6zEPzIbXJL2UqumSli6H4ye5ORNjOu32r7YPqLI4IDbzpssfoSwfKYlQG4Tvn
4H1Ukirzni0gz5+wbleItzf2aeo1rocs4YQTnaT02j8NmUHUz4AzOHGOQFr5Tvh0
wk/P4MIoSb0=
=cOOk
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- rtmutex cleanup & spring cleaning pass that removes ~400 lines of
code
- Futex simplifications & cleanups
- Add debugging to the CSD code, to help track down a tenacious race
(or hw problem)
- Add lockdep_assert_not_held(), to allow code to require a lock to not
be held, and propagate this into the ath10k driver
- Misc LKMM documentation updates
- Misc KCSAN updates: cleanups & documentation updates
- Misc fixes and cleanups
- Fix locktorture bugs with ww_mutexes
* tag 'locking-core-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
kcsan: Fix printk format string
static_call: Relax static_call_update() function argument type
static_call: Fix unused variable warn w/o MODULE
locking/rtmutex: Clean up signal handling in __rt_mutex_slowlock()
locking/rtmutex: Restrict the trylock WARN_ON() to debug
locking/rtmutex: Fix misleading comment in rt_mutex_postunlock()
locking/rtmutex: Consolidate the fast/slowpath invocation
locking/rtmutex: Make text section and inlining consistent
locking/rtmutex: Move debug functions as inlines into common header
locking/rtmutex: Decrapify __rt_mutex_init()
locking/rtmutex: Remove pointless CONFIG_RT_MUTEXES=n stubs
locking/rtmutex: Inline chainwalk depth check
locking/rtmutex: Move rt_mutex_debug_task_free() to rtmutex.c
locking/rtmutex: Remove empty and unused debug stubs
locking/rtmutex: Consolidate rt_mutex_init()
locking/rtmutex: Remove output from deadlock detector
locking/rtmutex: Remove rtmutex deadlock tester leftovers
locking/rtmutex: Remove rt_mutex_timed_lock()
MAINTAINERS: Add myself as futex reviewer
locking/mutex: Remove repeated declaration
...
gets rid of the LAZY_GS stuff and a lot of code.
- Add an insn_decode() API which all users of the instruction decoder
should preferrably use. Its goal is to keep the details of the
instruction decoder away from its users and simplify and streamline how
one decodes insns in the kernel. Convert its users to it.
- kprobes improvements and fixes
- Set the maximum DIE per package variable on Hygon
- Rip out the dynamic NOP selection and simplify all the machinery around
selecting NOPs. Use the simplified NOPs in objtool now too.
- Add Xeon Sapphire Rapids to list of CPUs that support PPIN
- Simplify the retpolines by folding the entire thing into an
alternative now that objtool can handle alternatives with stack
ops. Then, have objtool rewrite the call to the retpoline with the
alternative which then will get patched at boot time.
- Document Intel uarch per models in intel-family.h
- Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
exception on Intel.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCHyJQACgkQEsHwGGHe
VUpjiRAAwPZdwwp08ypZuMHR4EhLNru6gYhbAoALGgtYnQjLtn5onQhIeieK+R4L
cmZpxHT9OFp5dXHk4kwygaQBsD4pPOiIpm60kye1dN3cSbOORRdkwEoQMpKMZ+5Y
kvVsmn7lrwRbp600KdE4G6L5+N6gEgr0r6fMFWWGK3mgVAyCzPexVHgydcp131ch
iYMo6/pPDcNkcV/hboVKgx7GISdQ7L356L1MAIW/Sxtw6uD/X4qGYW+kV2OQg9+t
nQDaAo7a8Jqlop5W5TQUdMLKQZ1xK8SFOSX/nTS15DZIOBQOGgXR7Xjywn1chBH/
PHLwM5s4XF6NT5VlIA8tXNZjWIZTiBdldr1kJAmdDYacrtZVs2LWSOC0ilXsd08Z
EWtvcpHfHEqcuYJlcdALuXY8xDWqf6Q2F7BeadEBAxwnnBg+pAEoLXI/1UwWcmsj
wpaZTCorhJpYo2pxXckVdHz2z0LldDCNOXOjjaWU8tyaOBKEK6MgAaYU7e0yyENv
mVc9n5+WuvXuivC6EdZ94Pcr/KQsd09ezpJYcVfMDGv58YZrb6XIEELAJIBTu2/B
Ua8QApgRgetx+1FKb8X6eGjPl0p40qjD381TADb4rgETPb1AgKaQflmrSTIik+7p
O+Eo/4x/GdIi9jFk3K+j4mIznRbUX0cheTJgXoiI4zXML9Jv94w=
=bm4S
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 updates from Borislav Petkov:
- Turn the stack canary into a normal __percpu variable on 32-bit which
gets rid of the LAZY_GS stuff and a lot of code.
- Add an insn_decode() API which all users of the instruction decoder
should preferrably use. Its goal is to keep the details of the
instruction decoder away from its users and simplify and streamline
how one decodes insns in the kernel. Convert its users to it.
- kprobes improvements and fixes
- Set the maximum DIE per package variable on Hygon
- Rip out the dynamic NOP selection and simplify all the machinery
around selecting NOPs. Use the simplified NOPs in objtool now too.
- Add Xeon Sapphire Rapids to list of CPUs that support PPIN
- Simplify the retpolines by folding the entire thing into an
alternative now that objtool can handle alternatives with stack ops.
Then, have objtool rewrite the call to the retpoline with the
alternative which then will get patched at boot time.
- Document Intel uarch per models in intel-family.h
- Make Sub-NUMA Clustering topology the default and Cluster-on-Die the
exception on Intel.
* tag 'x86_core_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
x86, sched: Treat Intel SNC topology as default, COD as exception
x86/cpu: Comment Skylake server stepping too
x86/cpu: Resort and comment Intel models
objtool/x86: Rewrite retpoline thunk calls
objtool: Skip magical retpoline .altinstr_replacement
objtool: Cache instruction relocs
objtool: Keep track of retpoline call sites
objtool: Add elf_create_undef_symbol()
objtool: Extract elf_symbol_add()
objtool: Extract elf_strtab_concat()
objtool: Create reloc sections implicitly
objtool: Add elf_create_reloc() helper
objtool: Rework the elf_rebuild_reloc_section() logic
objtool: Fix static_call list generation
objtool: Handle per arch retpoline naming
objtool: Correctly handle retpoline thunk calls
x86/retpoline: Simplify retpolines
x86/alternatives: Optimize optimize_nops()
x86: Add insn_decode_kernel()
x86/kprobes: Move 'inline' to the beginning of the kprobe_is_ss() declaration
...
Newer CPUs provide a second mechanism to detect operations with lock
prefix which go accross a cache line boundary. Such operations have to
take bus lock which causes a system wide performance degradation when
these operations happen frequently.
The new mechanism is not using the #AC exception. It triggers #DB and is
restricted to operations in user space. Kernel side split lock access can
only be detected by the #AC based variant. Contrary to the #AC based
mechanism the #DB based variant triggers _after_ the instruction was
executed. The mechanism is CPUID enumerated and contrary to the #AC
version which is based on the magic TEST_CTRL_MSR and model/family based
enumeration on the way to become architectural.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGkr8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodUKD/9tUXhInR7+1ykEHpMvdmSp48vqY3nc
sKmT22pPl+OchnJ62mw3T8gKpBYVleJmcCaY2qVx7hfaVcWApLGJvX4tmfXmv422
XDSJ6b8Os6wfgx5FR//I17z8ZtXnnuKkPrTMoRsQUw2qLq31y6fdQv+GW/cc1Kpw
mengjmPE+HnpaKbtuQfPdc4a+UvLjvzBMAlDZPTBPKYrP4FFqYVnUVwyTg5aLVDY
gHz4V8+b502RS/zPfTAtE3J848od+NmcUPdFlcG9DVA+hR0Rl0thvruCTFiD2vVh
i9DJ7INof5FoJDEzh0dGsD7x+MB6OY8GZyHdUMeGgIRPtWkqrG52feQQIn2YYlaL
fB3DlpNv7NIJ/0JMlALvh8S0tEoOcYdHqH+M/3K/zbzecg/FAo+lVo8WciGLPqWs
ykUG5/f/OnlTvgB8po1ebJu0h0jHnoK9heWWXk9zWIRVDPXHFOWKW3kSbTTb3icR
9hfjP/SNejpmt9Ju1OTwsgnV7NALIdVX+G5jyIEsjFl31Co1RZNYhHLFvi11FWlQ
/ssvFK9O5ZkliocGCAN9+yuOnM26VqWSCE4fis6/2aSgD2Y4Gpvb//cP96SrcNAH
u8eXNvGLlniJP3F3JImWIfIPQTrpvQhcU4eZ6NtviXqj/utQXX6c9PZ1PLYpcvUh
9AWF8rwhT8X4oA==
=lmi8
-----END PGP SIGNATURE-----
Merge tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 bus lock detection updates from Thomas Gleixner:
"Support for enhanced split lock detection:
Newer CPUs provide a second mechanism to detect operations with lock
prefix which go accross a cache line boundary. Such operations have to
take bus lock which causes a system wide performance degradation when
these operations happen frequently.
The new mechanism is not using the #AC exception. It triggers #DB and
is restricted to operations in user space. Kernel side split lock
access can only be detected by the #AC based variant.
Contrary to the #AC based mechanism the #DB based variant triggers
_after_ the instruction was executed. The mechanism is CPUID
enumerated and contrary to the #AC version which is based on the magic
TEST_CTRL_MSR and model/family based enumeration on the way to become
architectural"
* tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/admin-guide: Change doc for split_lock_detect parameter
x86/traps: Handle #DB for bus lock
x86/cpufeatures: Enumerate #DB for bus lock detection
Provide support for randomized stack offsets per syscall to make
stack-based attacks harder which rely on the deterministic stack layout.
The feature is based on the original idea of PaX's RANDSTACK feature, but
uses a significantly different implementation.
The offset does not affect the pt_regs location on the task stack as this
was agreed on to be of dubious value. The offset is applied before the
actual syscall is invoked.
The offset is stored per cpu and the randomization happens at the end of
the syscall which is less predictable than on syscall entry.
The mechanism to apply the offset is via alloca(), i.e. abusing the
dispised VLAs. This comes with the drawback that stack-clash-protection
has to be disabled for the affected compilation units and there is also
a negative interaction with stack-protector.
Those downsides are traded with the advantage that this approach does not
require any intrusive changes to the low level assembly entry code, does
not affect the unwinder and the correct stack alignment is handled
automatically by the compiler.
The feature is guarded with a static branch which avoids the overhead when
disabled.
Currently this is supported for X86 and ARM64.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGjz8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoWsvD/4tGnPAurd6lbzxWzRjW7jOOVyzkODM
UXtIxxICaj7o6MNcloaGe1QtJ8+QOCw3yPQfLG/SoWHse5+oUKQRL9dmWVeJyRSt
JZ1pirkKqWrB+OmPbJKUiO3/TsZ2Z/vO41JVgVTL5/HWhOECSDzZsJkuvF/H+qYD
ReDzd7FUNd76pwVOsXq/cxXclRa81/wMNZRVwmyAwFYE2XoPtQyTERTLrfj6aQKF
P0txr9fEjYlPPwYOk1kjBAoJfDltNm48BBL7CGZtRlsqpNpdsJ1MkeGffhodb6F0
pJYQMlQJHXABZb5GF+v93+iASDpRFn0EvPmLkCxQUfZYLOkRsnuEF2S/fsYX/WPo
uin/wQKwLVdeQq9d9BwlZUKEgsQuV7Q0GVN+JnEQerwD6cWTxv4a1RIUH+K/4Wo5
nTeJVRKcs6m7UkGQRm8JbqnUP0vCV+PSiWWB8J9CmjYeCPbkGjt6mBIsmPaDZ9VL
4i+UX5DJayoREF/rspOBcJftUmExize49p9860UI9N6fd7DsDt7Dq9Ai+ADtZa4C
9BPbF4NWzJq8IWLqBi+PpKBAT3JMX9qQi7s9sbrRxpxtew9Keu5qggKZJYumX71V
qgUMk+xB86HZOrtF6F3oY0zxYv3haPvDydsDgqojtqNGk4PdAdgDYJQwMlb8QSly
SwIWPHIfvP4R9w==
=GMlJ
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull entry code update from Thomas Gleixner:
"Provide support for randomized stack offsets per syscall to make
stack-based attacks harder which rely on the deterministic stack
layout.
The feature is based on the original idea of PaX's RANDSTACK feature,
but uses a significantly different implementation.
The offset does not affect the pt_regs location on the task stack as
this was agreed on to be of dubious value. The offset is applied
before the actual syscall is invoked.
The offset is stored per cpu and the randomization happens at the end
of the syscall which is less predictable than on syscall entry.
The mechanism to apply the offset is via alloca(), i.e. abusing the
dispised VLAs. This comes with the drawback that
stack-clash-protection has to be disabled for the affected compilation
units and there is also a negative interaction with stack-protector.
Those downsides are traded with the advantage that this approach does
not require any intrusive changes to the low level assembly entry
code, does not affect the unwinder and the correct stack alignment is
handled automatically by the compiler.
The feature is guarded with a static branch which avoids the overhead
when disabled.
Currently this is supported for X86 and ARM64"
* tag 'x86-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64: entry: Enable random_kstack_offset support
lkdtm: Add REPORT_STACK for checking stack offsets
x86/entry: Enable random_kstack_offset support
stack: Optionally randomize kernel stack offset each syscall
init_on_alloc: Optimize static branches
jump_label: Provide CONFIG-driven build state defaults
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
+HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
/uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
=bO1i
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Borislav Petkov:
"Trivial cleanups and fixes all over the place"
* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove me from IDE/ATAPI section
x86/pat: Do not compile stubbed functions when X86_PAT is off
x86/asm: Ensure asm/proto.h can be included stand-alone
x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
x86/msr: Make locally used functions static
x86/cacheinfo: Remove unneeded dead-store initialization
x86/process/64: Move cpu_current_top_of_stack out of TSS
tools/turbostat: Unmark non-kernel-doc comment
x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
x86/fpu/math-emu: Fix function cast warning
x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
x86: Fix various typos in comments, take #2
x86: Remove unusual Unicode characters from comments
x86/kaslr: Return boolean values from a function returning bool
x86: Fix various typos in comments
x86/setup: Remove unused RESERVE_BRK_ARRAY()
stacktrace: Move documentation for arch_stack_walk_reliable() to header
x86: Remove duplicate TSC DEADLINE MSR definitions
Christopherson, Kai Huang and Jarkko Sakkinen. Along with the usual
fixes, cleanups and improvements.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGlgYACgkQEsHwGGHe
VUqbYA/+IgX7uBkATndzTBL6l/D3QQaMRUkOk5nO9sOzQaYJ/Qwarfakax61CZrl
dZFdF07T/kSpMXQ6HIjzEaRx6j12xMYksrm8xBBSfXjtkIYu4auVloX2ldKhHwaK
OyiKS+R0O/Q7XvozEiPsQCf7XwraZFO+iMJ0jMxbPO7ZvxDXDBv0Fx3d9yzPx9Qg
BbJuIEKMoFPR3P39CWw0cOXr12Z9mmFReBKoSV4dZbZMRmv7FrA/Qlc+uS+RNZFK
/5sCn7x27qVx8Ha/Lh42kQf+yqv1l3437aqmG2vAbHQPmnbDmBeApZ6jhaoX3jhD
9ylkcpWFFf26oSbYAdmztZENLXRWLH6OIPxtmbf2HMsROiNR/cV0s4d2aduN/dHz
s1VnaDFayoub9CPWtiv0RJJnwmB6d+wF2JbQGh+kPZMX3VaxVPwTVLWQdsAVaB8Y
y7A2vZeWWHvP1a7ATbTFRDlTKKV3qDpMTD1B+hFELLNjMvyDU5c/1GhrIh0o1Jo3
jGrauylSInMxDkpDTDhQqU+/CSnV03zdzq1DSzxgig2Q0Es6pKxQHbL0honTf0GJ
l+8nefsQqRguZ1rVeuuSYvGPF++eqfyOiTZgN4fWdtZWJKMabsPNUbc4U3sP0/Sn
oe3Ixo2F41E9++MODF1G80DKLD/mVLYxdzC91suOmgfB2gbRhSg=
=KFYo
-----END PGP SIGNATURE-----
Merge tag 'x86_sgx_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Borislav Petkov:
"Add the guest side of SGX support in KVM guests. Work by Sean
Christopherson, Kai Huang and Jarkko Sakkinen.
Along with the usual fixes, cleanups and improvements"
* tag 'x86_sgx_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/sgx: Mark sgx_vepc_vm_ops static
x86/sgx: Do not update sgx_nr_free_pages in sgx_setup_epc_section()
x86/sgx: Move provisioning device creation out of SGX driver
x86/sgx: Add helpers to expose ECREATE and EINIT to KVM
x86/sgx: Add helper to update SGX_LEPUBKEYHASHn MSRs
x86/sgx: Add encls_faulted() helper
x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)
x86/sgx: Move ENCLS leaf definitions to sgx.h
x86/sgx: Expose SGX architectural definitions to the kernel
x86/sgx: Initialize virtual EPC driver even when SGX driver is disabled
x86/cpu/intel: Allow SGX virtualization without Launch Control support
x86/sgx: Introduce virtual EPC for use by KVM guests
x86/sgx: Add SGX_CHILD_PRESENT hardware error code
x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
x86/cpufeatures: Add SGX1 and SGX2 sub-features
x86/cpufeatures: Make SGX_LC feature bit depend on SGX bit
x86/sgx: Remove unnecessary kmap() from sgx_ioc_enclave_init()
selftests/sgx: Use getauxval() to simplify test code
selftests/sgx: Improve error detection and messages
x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()
...
eliminate custom code patching. For that, the alternatives infra is
extended to accomodate paravirt's needs and, as a result, a lot of
paravirt patching code goes away, leading to a sizeable cleanup and
simplification. Work by Juergen Gross.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGiXQACgkQEsHwGGHe
VUocbw/+OkFzphK6zlNA8O3RJ24u2csXUWWUtpGlZ2220Nn/Bgyso2+fyg/NEeQg
EmEttaY3JG/riCDfHk5Xm2saeVtsbPXN4f0sJm/Io/djF7Cm03WS0eS0aA2Rnuca
MhmvvkrzYqZXAYVaxKkIH6sNlPgyXX7vDNPbTd/0ZCOb3ZKIyXwL+SaLatMCtE5o
ou7e8Bj8xPSwcaCyK6sqjrT6jdpPjoTrxxrwENW8AlRu5lCU1pIY03GGhARPVoEm
fWkZsIPn7DxhpyIqzJtEMX8EK1xN96E+NGkNuSAtJGP9HRb+3j5f4s3IUAfXiLXq
r7NecFw8zHhPKl9J0pPCiW7JvMrCMU5xGwyeUmmhKyK2BxwvvAC173ohgMlCfB2Q
FPIsQWemat17tSue8LIA8SmlSDQz6R+tTdUFT+vqmNV34PxOIEeSdV7HG8rs87Ec
dYB9ENUgXqI+h2t7atE68CpTLpWXzNDcq2olEsaEUXenky2hvsi+VxNkWpmlKQ3I
NOMU/AyH8oUzn5O0o3oxdPhDLmK5ItEFxjYjwrgLfKFQ+Y8vIMMq3LrKQGwOj+ZU
n9qC7JjOwDKZGjd3YqNNRhnXp+w0IJvUHbyr3vIAcp8ohQwEKgpUvpZzf/BKUvHh
nJgJSJ53GFJBbVOJMfgVq+JcFr+WO8MDKHaw6zWeCkivFZdSs4g=
=h+km
-----END PGP SIGNATURE-----
Merge tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 alternatives/paravirt updates from Borislav Petkov:
"First big cleanup to the paravirt infra to use alternatives and thus
eliminate custom code patching.
For that, the alternatives infrastructure is extended to accomodate
paravirt's needs and, as a result, a lot of paravirt patching code
goes away, leading to a sizeable cleanup and simplification.
Work by Juergen Gross"
* tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Have only one paravirt patch function
x86/paravirt: Switch functions with custom code to ALTERNATIVE
x86/paravirt: Add new PVOP_ALT* macros to support pvops in ALTERNATIVEs
x86/paravirt: Switch iret pvops to ALTERNATIVE
x86/paravirt: Simplify paravirt macros
x86/paravirt: Remove no longer needed 32-bit pvops cruft
x86/paravirt: Add new features for paravirt patching
x86/alternative: Use ALTERNATIVE_TERNARY() in _static_cpu_has()
x86/alternative: Support ALTERNATIVE_TERNARY
x86/alternative: Support not-feature
x86/paravirt: Switch time pvops functions to use static_call()
static_call: Add function to query current function
static_call: Move struct static_call_key definition to static_call_types.h
x86/alternative: Merge include files
x86/alternative: Drop unused feature parameter from ALTINSTR_REPLACEMENT()
Drop the sev_enabled flag and switch its one user over to sev_active().
sev_enabled was made redundant with the introduction of sev_status in
commit b57de6cd16 ("x86/sev-es: Add SEV-ES Feature Detection").
sev_enabled and sev_active() are guaranteed to be equivalent, as each is
true iff 'sev_status & MSR_AMD64_SEV_ENABLED' is true, and are only ever
written in tandem (ignoring compressed boot's version of sev_status).
Removing sev_enabled avoids confusion over whether it refers to the guest
or the host, and will also allow KVM to usurp "sev_enabled" for its own
purposes.
No functional change intended.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422021125.3417167-7-seanjc@google.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.
The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).
This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.
This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.
For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.
Signed-off-by: Nathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is not a consistent pattern for checking Hyper-V hypercall status.
Existing code uses a number of variants. The variants work, but a consistent
pattern would improve the readability of the code, and be more conformant
to what the Hyper-V TLFS says about hypercall status.
Implemented new helper functions hv_result(), hv_result_success(), and
hv_repcomp(). Changed the places where hv_do_hypercall() and related variants
are used to use the helper functions.
Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1618620183-9967-2-git-send-email-joseph.salisbury@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
This patch makes no functional changes. It simply moves hv_do_rep_hypercall()
out of arch/x86/include/asm/mshyperv.h and into asm-generic/mshyperv.h
hv_do_rep_hypercall() is architecture independent, so it makes sense that it
should be in the architecture independent mshyperv.h, not in the x86-specific
mshyperv.h.
This is done in preperation for a follow up patch which creates a consistent
pattern for checking Hyper-V hypercall status.
Signed-off-by: Joseph Salisbury <joseph.salisbury@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1618620183-9967-1-git-send-email-joseph.salisbury@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The variable st is being assigned a value that is never read and
it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Denis Efremov <efremov@linux.com>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20210415130020.1959951-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add an ECREATE handler that will be used to intercept ECREATE for the
purpose of enforcing and enclave's MISCSELECT, ATTRIBUTES and XFRM, i.e.
to allow userspace to restrict SGX features via CPUID. ECREATE will be
intercepted when any of the aforementioned masks diverges from hardware
in order to enforce the desired CPUID model, i.e. inject #GP if the
guest attempts to set a bit that hasn't been enumerated as allowed-1 in
CPUID.
Note, access to the PROVISIONKEY is not yet supported.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <c3a97684f1b71b4f4626a1fc3879472a95651725.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for handling VM-Exits that originate from a guest SGX
enclave. In SGX, an "enclave" is a new CPL3-only execution environment,
wherein the CPU and memory state is protected by hardware to make the
state inaccesible to code running outside of the enclave. When exiting
an enclave due to an asynchronous event (from the perspective of the
enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
is automatically saved and scrubbed (the CPU loads synthetic state), and
then reloaded when re-entering the enclave. E.g. after an instruction
based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
of the enclave instruction that trigered VM-Exit, but will instead point
to a RIP in the enclave's untrusted runtime (the guest userspace code
that coordinates entry/exit to/from the enclave).
To help a VMM recognize and handle exits from enclaves, SGX adds bits to
existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR. Define the
new architectural bits, and add a boolean to struct vcpu_vmx to cache
VMX_EXIT_REASON_FROM_ENCLAVE. Clear the bit in exit_reason so that
checks against exit_reason do not need to account for SGX, e.g.
"if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
KVM is a largely a passive observer of the new bits, e.g. KVM needs to
account for the bits when propagating information to a nested VMM, but
otherwise doesn't need to act differently for the majority of VM-Exits
from enclaves.
The one scenario that is directly impacted is emulation, which is for
all intents and purposes impossible[1] since KVM does not have access to
the RIP or instruction stream that triggered the VM-Exit. The inability
to emulate is a non-issue for KVM, as most instructions that might
trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
check. For the few instruction that conditionally #UD, KVM either never
sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
if the feature is not exposed to the guest in order to inject a #UD,
e.g. RDRAND_EXITING.
But, because it is still possible for a guest to trigger emulation,
e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
from an enclave. This is architecturally accurate for instruction
VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
to killing the VM. In practice, only broken or particularly stupid
guests should ever encounter this behavior.
Add a WARN in skip_emulated_instruction to detect any attempt to
modify the guest's RIP during an SGX enclave VM-Exit as all such flows
should either be unreachable or must handle exits from enclaves before
getting to skip_emulated_instruction.
[1] Impossible for all practical purposes. Not truly impossible
since KVM could implement some form of para-virtualization scheme.
[2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
CPL3, so we also don't need to worry about that interaction.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Page faults that are signaled by the SGX Enclave Page Cache Map (EPCM),
as opposed to the traditional IA32/EPT page tables, set an SGX bit in
the error code to indicate that the #PF was induced by SGX. KVM will
need to emulate this behavior as part of its trap-and-execute scheme for
virtualizing SGX Launch Control, e.g. to inject SGX-induced #PFs if
EINIT faults in the host, and to support live migration.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <e170c5175cb9f35f53218a7512c9e3db972b97a2.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
just remove it.
Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
Message-Id: <20210406063504.17552-1-zhukeqian1@huawei.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To analyze some performance issues with lock contention and scheduling,
it is nice to know when directed yield are successful or failing.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some platforms, e.g. Alder Lake, have hybrid architecture. Although most
PMU capabilities are the same, there are still some unique PMU
capabilities for different hybrid PMUs. Perf should register a dedicated
pmu for each hybrid PMU.
Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and
capabilities for each hybrid PMU.
The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the
architecture features which are available on all hybrid PMUs. The
architecture features are stored in the global x86_pmu.intel_cap.
For Alder Lake, the model-specific features are perf metrics and
PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap
should be 0 for these two features. Perf should not use the global
intel_cap to check the features on a hybrid system.
Add a dedicated intel_cap in the x86_hybrid_pmu to store the
model-specific capabilities. Use the dedicated intel_cap to replace
the global intel_cap for thse two features. The dedicated intel_cap
will be set in the following "Add Alder Lake Hybrid support" patch.
Add is_hybrid() to distinguish a hybrid system. ADL may have an
alternative configuration. With that configuration, the
X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit.
Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system.
It will be assigned in the following "Add Alder Lake Hybrid support"
patch as well.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
On processors with Intel Hybrid Technology (i.e., one having more than
one type of CPU in the same package), all CPUs support the same
instruction set and enumerate the same features on CPUID. Thus, all
software can run on any CPU without restrictions. However, there may be
model-specific differences among types of CPUs. For instance, each type
of CPU may support a different number of performance counters. Also,
machine check error banks may be wired differently. Even though most
software will not care about these differences, kernel subsystems
dealing with these differences must know.
Add and expose a new helper function get_this_hybrid_cpu_type() to query
the type of the current hybrid CPU. The function will be used later in
the perf subsystem.
The Intel Software Developer's Manual defines the CPU type as 8-bit
identifier.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-3-git-send-email-kan.liang@linux.intel.com
Add feature enumeration to identify a processor with Intel Hybrid
Technology: one in which CPUs of more than one type are the same package.
On a hybrid processor, all CPUs support the same homogeneous (i.e.,
symmetric) instruction set. All CPUs enumerate the same features in CPUID.
Thus, software (user space and kernel) can run and migrate to any CPU in
the system as well as utilize any of the enumerated features without any
change or special provisions. The main difference among CPUs in a hybrid
processor are power and performance properties.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1618237865-33448-2-git-send-email-kan.liang@linux.intel.com
Protect the contents of the TDP MMU roots list with RCU in preparation
for a future patch which will allow the iterator macro to be used under
the MMU lock in read mode.
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210401233736.638171-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.
In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.
The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.
Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.
Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.
MIPS, PPC, and arm64 will be converted one at a time in future patches.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210402005658.3024832-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the prototypes for the MMU notifier callbacks out of arch code and
into common code. There is no benefit to having each arch replicate the
prototypes since any deviation from the invocation in common code will
explode.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210326021957.1424875-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the same reason as commit e876f0b69d ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.
As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Acked-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/539c4204b1baa77c55f758904a1ea239abbc7a5c.1617209142.git.christophe.leroy@csgroup.eu
Fix:
../arch/x86/include/asm/proto.h:14:30: warning: ‘struct task_struct’ declared \
inside parameter list will not be visible outside of this definition or declaration
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2);
^~~~~~~~~~~
.../arch/x86/include/asm/proto.h:40:34: warning: ‘struct task_struct’ declared \
inside parameter list will not be visible outside of this definition or declaration
long do_arch_prctl_common(struct task_struct *task, int option,
^~~~~~~~~~~
if linux/sched.h hasn't be included previously. This fixes a build error
when this header is used outside of the kernel tree.
[ bp: Massage commit message. ]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/b76b4be3-cf66-f6b2-9a6c-3e7ef54f9845@web.de