Add a helper, kvm_get_filtered_xcr0(), to dedup code that needs to account
for XCR0 features that require explicit opt-in on a per-process basis. In
addition to documenting when KVM should/shouldn't consult
xstate_get_guest_group_perm(), the helper will also allow sanitizing the
filtered XCR0 to avoid enumerating architecturally illegal XCR0 values,
e.g. XTILE_CFG without XTILE_DATA.
No functional changes intended.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
[sean: rename helper, move to x86.h, massage changelog]
Reviewed-by: Aaron Lewis <aaronlewis@google.com>
Tested-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20230405004520.421768-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that KVM disallows changing feature MSRs, i.e. PERF_CAPABILITIES,
after running a vCPU, WARN and bug the VM if the PMU is refreshed after
the vCPU has run.
Note, KVM has disallowed CPUID updates after running a vCPU since commit
feb627e8d6 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN"), i.e.
PERF_CAPABILITIES was the only remaining way to trigger a PMU refresh
after KVM_RUN.
Cc: Like Xu <like.xu.linux@gmail.com>
Link: https://lore.kernel.org/r/20230311004618.920745-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disallow writes to feature MSRs after KVM_RUN to prevent userspace from
changing the vCPU model after running the vCPU. Similar to guest CPUID,
KVM uses feature MSRs to configure intercepts, determine what operations
are/aren't allowed, etc. Changing the capabilities while the vCPU is
active will at best yield unpredictable guest behavior, and at worst
could be dangerous to KVM.
Allow writing the current value, e.g. so that userspace can blindly set
all MSRs when emulating RESET, and unconditionally allow writes to
MSR_IA32_UCODE_REV so that userspace can emulate patch loads.
Special case the VMX MSRs to keep the generic list small, i.e. so that
KVM can do a linear walk of the generic list without incurring meaningful
overhead.
Cc: Like Xu <like.xu.linux@gmail.com>
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add VMX MSRs to the runtime list of feature MSRs by iterating over the
range of emulated MSRs instead of manually defining each MSR in the "all"
list. Using the range definition reduces the cost of emulating a new VMX
MSR, e.g. prevents forgetting to add an MSR to the list.
Extracting the VMX MSRs from the "all" list, which is a compile-time
constant, also shrinks the list to the point where the compiler can
heavily optimize code that iterates over the list.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename kvm_init_msr_list() to kvm_init_msr_lists() to clarify that it
initializes multiple lists: MSRs to save, emulated MSRs, and feature MSRs.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20230311004618.920745-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Virtualize FLUSH_L1D so that the guest can use the performant L1D flush
if one of the many mitigations might require a flush in the guest, e.g.
Linux provides an option to flush the L1D when switching mms.
Passthrough MSR_IA32_FLUSH_CMD for write when it's supported in hardware
and exposed to the guest, i.e. always let the guest write it directly if
FLUSH_L1D is fully supported.
Forward writes to hardware in host context on the off chance that KVM
ends up emulating a WRMSR, or in the really unlikely scenario where
userspace wants to force a flush. Restrict these forwarded WRMSRs to
the known command out of an abundance of caution. Passing through the
MSR means the guest can throw any and all values at hardware, but doing
so in host context is arguably a bit more dangerous.
Link: https://lkml.kernel.org/r/CALMp9eTt3xzAEoQ038bJQ9LN0ZOXrSWsN7xnNUD%2B0SS%3DWwF7Pg%40mail.gmail.com
Link: https://lore.kernel.org/all/20230201132905.549148-2-eesposit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322011440.2195485-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Dedup the handling of MSR_IA32_PRED_CMD across VMX and SVM by moving the
logic to kvm_set_msr_common(). Now that the MSR interception toggling is
handled as part of setting guest CPUID, the VMX and SVM paths are
identical.
Opportunistically massage the code to make it a wee bit denser.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20230322011440.2195485-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
smatch reports
arch/x86/kvm/x86.c:199:20: warning: symbol
'mitigate_smt_rsb' was not declared. Should it be static?
This variable is only used in one file so it should be static.
Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20230404010141.1913667-1-trix@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The 'longmode' field is a bit annoying as it blows an entire __u32 to
represent a boolean value. Since other architectures are looking to add
support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it
up.
Redefine the field (and the remaining padding) as a set of flags.
Preserve the existing ABI by using bit 0 to indicate if the guest was in
long mode and requiring that the remaining 31 bits must be zero.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
When introduced, IRQFD resampling worked on POWER8 with XICS. However
KVM on POWER9 has never implemented it - the compatibility mode code
("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native
XIVE mode does not handle INTx in KVM at all.
This moved the capability support advertising to platforms and stops
advertising it on XIVE, i.e. POWER9 and later.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When injecting an exception into a vCPU in Real Mode, suppress the error
code by clearing the flag that tracks whether the error code is valid, not
by clearing the error code itself. The "typo" was introduced by recent
fix for SVM's funky Paged Real Mode.
Opportunistically hoist the logic above the tracepoint so that the trace
is coherent with respect to what is actually injected (this was also the
behavior prior to the buggy commit).
Fixes: b97f074583 ("KVM: x86: determine if an exception has an error code only when injecting it.")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322143300.2209476-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear vcpu->mmio_needed when injecting an exception from the emulator to
squash a (legitimate) warning about vcpu->mmio_needed being true at the
start of KVM_RUN without a callback being registered to complete the
userspace MMIO exit. Suppressing the MMIO write exit is inarguably wrong
from an architectural perspective, but it is the least awful hack-a-fix
due to shortcomings in KVM's uAPI, not to mention that KVM already
suppresses MMIO writes in this scenario.
Outside of REP string instructions, KVM doesn't provide a way to resume
an instruction at the exact point where it was "interrupted" if said
instruction partially completed before encountering an MMIO access. For
MMIO reads, KVM immediately exits to userspace upon detecting MMIO as
userspace provides the to-be-read value in a buffer, and so KVM can safely
(more or less) restart the instruction from the beginning. When the
emulator re-encounters the MMIO read, KVM will service the MMIO by getting
the value from the buffer instead of exiting to userspace, i.e. KVM won't
put the vCPU into an infinite loop.
On an emulated MMIO write, KVM finishes the instruction before exiting to
userspace, as exiting immediately would ultimately hang the vCPU due to
the aforementioned shortcoming of KVM not being able to resume emulation
in the middle of an instruction.
For the vast majority of _emulated_ instructions, deferring the userspace
exit doesn't cause problems as very few x86 instructions (again ignoring
string operations) generate multiple writes. But for instructions that
generate multiple writes, e.g. PUSHA (multiple pushes onto the stack),
deferring the exit effectively results in only the final write triggering
an exit to userspace. KVM does support multiple MMIO "fragments", but
only for page splits; if an instruction performs multiple distinct MMIO
writes, the number of fragments gets reset when the next MMIO write comes
along and any previous MMIO writes are dropped.
Circling back to the warning, if a deferred MMIO write coincides with an
exception, e.g. in this case a #SS due to PUSHA underflowing the stack
after queueing a write to an MMIO page on a previous push, KVM injects
the exceptions and leaves the deferred MMIO pending without registering a
callback, thus triggering the splat.
Sweep the problem under the proverbial rug as dropping MMIO writes is not
unique to the exception scenario (see above), i.e. instructions like PUSHA
are fundamentally broken with respect to MMIO, and have been since KVM's
inception.
Reported-by: zhangjianguo <zhangjianguo18@huawei.com>
Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com
Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230322141220.2206241-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To be able to trace invocations of smp_send_reschedule(), rename the
arch-specific definitions of it to arch_smp_send_reschedule() and wrap it
into an smp_send_reschedule() that contains a tracepoint.
Changes to include the declaration of the tracepoint were driven by the
following coccinelle script:
@func_use@
@@
smp_send_reschedule(...);
@include@
@@
#include <trace/events/ipi.h>
@no_include depends on func_use && !include@
@@
#include <...>
+
+ #include <trace/events/ipi.h>
[csky bits]
[riscv bits]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
If !guest_cpuid_has(vcpu, X86_FEATURE_PCID), CR4.PCIDE would have been in
vcpu->arch.cr4_guest_rsvd_bits and failed earlier kvm_is_valid_cr4() check.
Remove this meaningless check.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Fixes: 4683d758f4 ("KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID check")
Link: https://lore.kernel.org/r/20230308072936.1293101-1-robert.hu@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support for SVM's Virtual NMIs implementation, which adds proper
tracking of virtual NMI blocking, and an intr_ctrl flag that software can
set to mark a virtual NMI as pending. Pending virtual NMIs are serviced
by hardware if/when virtual NMIs become unblocked, i.e. act more or less
like real NMIs.
Introduce two new kvm_x86_ops callbacks so to support SVM's vNMI, as KVM
needs to treat a pending vNMI as partially injected. Specifically, if
two NMIs (for L1) arrive concurrently in KVM's software model, KVM's ABI
is to inject one and pend the other. Without vNMI, KVM manually tracks
the pending NMI and uses NMI windows to detect when the NMI should be
injected.
With vNMI, the pending NMI is simply stuffed into the VMCB and handed
off to hardware. This means that KVM needs to be able to set a vNMI
pending on-demand, and also query if a vNMI is pending, e.g. to honor the
"at most one NMI pending" rule and to preserve all NMIs across save and
restore.
Warn if KVM attempts to open an NMI window when vNMI is fully enabled,
as the above logic should prevent KVM from ever getting to
kvm_check_and_inject_events() with two NMIs pending _in software_, and
the "at most one NMI pending" logic should prevent having an NMI pending
in hardware and an NMI pending in software if NMIs are also blocked, i.e.
if KVM can't immediately inject the second NMI.
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230227084016.3368-11-santosh.shukla@amd.com
[sean: rewrite shortlog and changelog, massage code comments]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the asynchronous NMI queue to handle pending NMIs coming in from
userspace during KVM_SET_VCPU_EVENTS so that all of KVM's logic for
handling multiple NMIs goes through process_nmi(). This will simplify
supporting SVM's upcoming "virtual NMI" functionality, which will need
changes KVM manages pending NMIs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Save all pending NMIs in KVM_GET_VCPU_EVENTS, and queue KVM_REQ_NMI if one
or more NMIs are pending after KVM_SET_VCPU_EVENTS in order to re-evaluate
pending NMIs with respect to NMI blocking.
KVM allows multiple NMIs to be pending in order to faithfully emulate bare
metal handling of simultaneous NMIs (on bare metal, truly simultaneous
NMIs are impossible, i.e. one will always arrive first and be consumed).
Support for simultaneous NMIs botched the save/restore though. KVM only
saves one pending NMI, but allows userspace to restore 255 pending NMIs
as kvm_vcpu_events.nmi.pending is a u8, and KVM's internal state is stored
in an unsigned int.
Fixes: 7460fb4a34 ("KVM: Fix simultaneous NMIs")
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-8-santosh.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tweak the code and comment that deals with concurrent NMIs to explicitly
call out that x86 allows exactly one pending NMI, but that KVM needs to
temporarily allow two pending NMIs in order to workaround the fact that
the target vCPU cannot immediately recognize an incoming NMI, unlike bare
metal.
No functional change intended.
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-7-santosh.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't raise KVM_REQ_EVENT if no NMIs are pending at the end of
process_nmi(). Finishing process_nmi() without a pending NMI will become
much more likely when KVM gains support for AMD's vNMI, which allows
pending vNMIs in hardware, i.e. doesn't require explicit injection.
Signed-off-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20230227084016.3368-6-santosh.shukla@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add helpers to check if a specific CR0/CR4 bit is set to avoid a plethora
of implicit casts from the "unsigned long" return of kvm_read_cr*_bits(),
and to make each caller's intent more obvious.
Defer converting helpers that do truly ugly casts from "unsigned long" to
"int", e.g. is_pse(), to a future commit so that their conversion is more
isolated.
Opportunistically drop the superfluous pcid_enabled from kvm_set_cr3();
the local variable is used only once, immediately after its declaration.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20230322045824.22970-2-binbin.wu@linux.intel.com
[sean: move "obvious" conversions to this commit, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
If paging is disabled, there are no permission bits to emulate.
Micro-optimize this case to avoid unnecessary work.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-4-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
There is no need to unload the MMU roots with TDP enabled when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.
One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.
The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):
legacy TDP shadow
kvm-x86/next@d8708b 8.43s 9.45s 70.3s
+patch 5.39s 5.63s 70.2s
For legacy MMU this is ~36% faster, for TDP MMU even ~40% faster. Also
TDP and legacy MMU now both have a similar runtime which vanishes the
need to disable TDP MMU for grsecurity.
Shadow MMU sees no measurable difference and is still slow, as expected.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-3-minipli@grsecurity.net
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
The @root_hpa for kvm_mmu_invalidate_addr() is called with @mmu->root.hpa
or INVALID_PAGE where @mmu->root.hpa is to invalidate gva for the current
root (the same meaning as KVM_MMU_ROOT_CURRENT) and INVALID_PAGE is to
invalidate gva for all roots (the same meaning as KVM_MMU_ROOTS_ALL).
Change the argument type of kvm_mmu_invalidate_addr() and use
KVM_MMU_ROOT_XXX instead so that we can reuse the function for
kvm_mmu_invpcid_gva() and nested_ept_invalidate_addr() for invalidating
gva for different set of roots.
No fuctionalities changed.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-9-jiangshanlai@gmail.com
[sean: massage comment slightly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
FNAME(invlpg)() and kvm_mmu_invalidate_gva() take a gva_t, i.e. unsigned
long, as the type of the address to invalidate. On 32-bit kernels, the
upper 32 bits of the GPA will get dropped when an L2 GPA address is
invalidated in the shadowed nested TDP MMU.
Convert it to u64 to fix the problem.
Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230216154115.710033-2-jiangshanlai@gmail.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
All kvm_arch_vm_ioctl() implementations now only deal with "int"
types as return values, so we can change the return type of these
functions to use "int" instead of "long".
Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20230208140105.655814-7-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_GET_NR_MMU_PAGES ioctl is quite questionable on 64-bit hosts
since it fails to return the full 64 bits of the value that can be
set with the corresponding KVM_SET_NR_MMU_PAGES call. Its "long" return
value is truncated into an "int" in the kvm_arch_vm_ioctl() function.
Since this ioctl also never has been used by userspace applications
(QEMU, Google's internal VMM, kvmtool and CrosVM have been checked),
it's likely the best if we remove this badly designed ioctl before
anybody really tries to use it.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230208140105.655814-4-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use a new EMULTYPE flag, EMULTYPE_WRITE_PF_TO_SP, to track page faults
on self-changing writes to shadowed page tables instead of propagating
that information to the emulator via a semi-persistent vCPU flag. Using
a flag in "struct kvm_vcpu_arch" is confusing, especially as implemented,
as it's not at all obvious that clearing the flag only when emulation
actually occurs is correct.
E.g. if KVM sets the flag and then retries the fault without ever getting
to the emulator, the flag will be left set for future calls into the
emulator. But because the flag is consumed if and only if both
EMULTYPE_PF and EMULTYPE_ALLOW_RETRY_PF are set, and because
EMULTYPE_ALLOW_RETRY_PF is deliberately not set for direct MMUs, emulated
MMIO, or while L2 is active, KVM avoids false positives on a stale flag
since FNAME(page_fault) is guaranteed to be run and refresh the flag
before it's ultimately consumed by the tail end of reexecute_instruction().
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230202182817.407394-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place.
- Add support for taking stage-2 access faults in parallel. This was an
accidental omission in the original parallel faults implementation,
but should provide a marginal improvement to machines w/o FEAT_HAFDBS
(such as hardware from the fruit company).
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception handling
and masking unsupported features for nested guests.
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM.
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at reducing
the trap overhead of running nested.
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems.
- Avoid VM-wide stop-the-world operations when a vCPU accesses its own
redistributor.
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected exceptions
in the host.
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
This also drags in arm64's 'for-next/sme2' branch, because both it and
the PSCI relay changes touch the EL2 initialization code.
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Two patches sorting out confusion between virtual and physical
addresses, which currently are the same on s390.
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world,
some of them affecting architecurally legal but unlikely to
happen in practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give SVM
similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at this
point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the PMU and
MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't support
EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just
let the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how
to do initialization.
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to emit
the correct hypercall instruction instead of relying on KVM to patch
in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmP2YA0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPg/Qf+J6nT+TkIa+8Ei+fN1oMTDp4YuIOx
mXvJ9mRK9sQ+tAUVwvDz3qN/fK5mjsYbRHIDlVc5p2Q3bCrVGDDqXPFfCcLx1u+O
9U9xjkO4JxD2LS9pc70FYOyzVNeJ8VMGOBbC2b0lkdYZ4KnUc6e/WWFKJs96bK+H
duo+RIVyaMthnvbTwSv1K3qQb61n6lSJXplywS8KWFK6NZAmBiEFDAWGRYQE9lLs
VcVcG0iDJNL/BQJ5InKCcvXVGskcCm9erDszPo7w4Bypa4S9AMS42DHUaRZrBJwV
/WqdH7ckIz7+OSV0W1j+bKTHAFVTCjXYOM7wQykgjawjICzMSnnG9Gpskw==
=goe1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Provide a virtual cache topology to the guest to avoid
inconsistencies with migration on heterogenous systems. Non secure
software has no practical need to traverse the caches by set/way in
the first place
- Add support for taking stage-2 access faults in parallel. This was
an accidental omission in the original parallel faults
implementation, but should provide a marginal improvement to
machines w/o FEAT_HAFDBS (such as hardware from the fruit company)
- A preamble to adding support for nested virtualization to KVM,
including vEL2 register state, rudimentary nested exception
handling and masking unsupported features for nested guests
- Fixes to the PSCI relay that avoid an unexpected host SVE trap when
resuming a CPU when running pKVM
- VGIC maintenance interrupt support for the AIC
- Improvements to the arch timer emulation, primarily aimed at
reducing the trap overhead of running nested
- Add CONFIG_USERFAULTFD to the KVM selftests config fragment in the
interest of CI systems
- Avoid VM-wide stop-the-world operations when a vCPU accesses its
own redistributor
- Serialize when toggling CPACR_EL1.SMEN to avoid unexpected
exceptions in the host
- Aesthetic and comment/kerneldoc fixes
- Drop the vestiges of the old Columbia mailing list and add [Oliver]
as co-maintainer
RISC-V:
- Fix wrong usage of PGDIR_SIZE instead of PUD_SIZE
- Correctly place the guest in S-mode after redirecting a trap to the
guest
- Redirect illegal instruction traps to guest
- SBI PMU support for guest
s390:
- Sort out confusion between virtual and physical addresses, which
currently are the same on s390
- A new ioctl that performs cmpxchg on guest memory
- A few fixes
x86:
- Change tdp_mmu to a read-only parameter
- Separate TDP and shadow MMU page fault paths
- Enable Hyper-V invariant TSC control
- Fix a variety of APICv and AVIC bugs, some of them real-world, some
of them affecting architecurally legal but unlikely to happen in
practice
- Mark APIC timer as expired if its in one-shot mode and the count
underflows while the vCPU task was being migrated
- Advertise support for Intel's new fast REP string features
- Fix a double-shootdown issue in the emergency reboot code
- Ensure GIF=1 and disable SVM during an emergency reboot, i.e. give
SVM similar treatment to VMX
- Update Xen's TSC info CPUID sub-leaves as appropriate
- Add support for Hyper-V's extended hypercalls, where "support" at
this point is just forwarding the hypercalls to userspace
- Clean up the kvm->lock vs. kvm->srcu sequences when updating the
PMU and MSR filters
- One-off fixes and cleanups
- Fix and cleanup the range-based TLB flushing code, used when KVM is
running on Hyper-V
- Add support for filtering PMU events using a mask. If userspace
wants to restrict heavily what events the guest can use, it can now
do so without needing an absurd number of filter entries
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel Sapphire Rapids
- Fix a mostly benign overflow bug in SEV's
send|receive_update_data()
- Move several SVM-specific flags into vcpu_svm
x86 Intel:
- Handle NMI VM-Exits before leaving the noinstr region
- A few trivial cleanups in the VM-Enter flows
- Stop enabling VMFUNC for L1 purely to document that KVM doesn't
support EPTP switching (or any other VM function) for L1
- Fix a crash when using eVMCS's enlighted MSR bitmaps
Generic:
- Clean up the hardware enable and initialization flow, which was
scattered around multiple arch-specific hooks. Instead, just let
the arch code call into generic code. Both x86 and ARM should
benefit from not having to fight common KVM code's notion of how to
do initialization
- Account allocations in generic kvm_arch_alloc_vm()
- Fix a memory leak if coalesced MMIO unregistration fails
selftests:
- On x86, cache the CPU vendor (AMD vs. Intel) and use the info to
emit the correct hypercall instruction instead of relying on KVM to
patch in VMMCALL
- Use TAP interface for kvm_binary_stats_test and tsc_msrs_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (325 commits)
KVM: SVM: hyper-v: placate modpost section mismatch error
KVM: x86/mmu: Make tdp_mmu_allowed static
KVM: arm64: nv: Use reg_to_encoding() to get sysreg ID
KVM: arm64: nv: Only toggle cache for virtual EL2 when SCTLR_EL2 changes
KVM: arm64: nv: Filter out unsupported features from ID regs
KVM: arm64: nv: Emulate EL12 register accesses from the virtual EL2
KVM: arm64: nv: Allow a sysreg to be hidden from userspace only
KVM: arm64: nv: Emulate PSTATE.M for a guest hypervisor
KVM: arm64: nv: Add accessors for SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
KVM: arm64: nv: Handle SMCs taken from virtual EL2
KVM: arm64: nv: Handle trapped ERET from virtual EL2
KVM: arm64: nv: Inject HVC exceptions to the virtual EL2
KVM: arm64: nv: Support virtual EL2 exceptions
KVM: arm64: nv: Handle HCR_EL2.NV system register traps
KVM: arm64: nv: Add nested virt VCPU primitives for vEL2 VCPU state
KVM: arm64: nv: Add EL2 system registers to vcpu context
KVM: arm64: nv: Allow userspace to set PSR_MODE_EL2x
KVM: arm64: nv: Reset VCPU to EL2 registers if VCPU nested virt is set
KVM: arm64: nv: Introduce nested virtualization VCPU feature
KVM: arm64: Use the S2 MMU context to iterate over S2 table
...
where possible, when supporting a debug registers swap feature for
SEV-ES guests
- Add support for AMD's version of eIBRS called Automatic IBRS which is
a set-and-forget control of indirect branch restriction speculation
resources on privilege change
- Add support for a new x86 instruction - LKGS - Load kernel GS which is
part of the FRED infrastructure
- Reset SPEC_CTRL upon init to accomodate use cases like kexec which
rediscover
- Other smaller fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmP1RDIACgkQEsHwGGHe
VUohBw//ZB9ZRqsrKdm6D9YaP2x4Zb+kqKqo6rjYeWaYqyPyCwDujPwh+pb3Oq1t
aj62muDv1t/wEJc8mKNkfXkjEEtBVAOcpb5YIpKreoEvNKyevol83Ih0u5iJcTRE
E5qf8HDS8b/JZrcazJJLl6WQmQNH5RiKSu5bbCpRhoeOcyo5pRYR5MztK9vNmAQk
GMdwHsUSU+jN8uiE4HnpaOb/luhgFindRwZVTpdjJegQWLABS8cl3CKeTv4+PW45
isvv37XnQP248wsptIEVRHeG6g3g/HtvwRx7DikUw06QwUyUK7H9hJssOoSP8TL9
u4psRwfWnJ1OxU6klL+s0Ii+pjQ97wXmK/oqK7QkdUwhWqR/mQAW2e9kWHAngyDn
A6mKbzSM6HFAeSXQpB9cMb6uvYRD44SngDFe3WXtEK8jiiQ70ikUm4E28I5KJOPg
s+RyioHk0NFRHYSOOBqNG1NKz6ED7L3GbgbbzxkgMh21AAyI3X351t+PtGoLV5ew
eqOsM7lbg9Scg1LvPk1JcoALS8USWqgar397rz9qGUs+OkPWBtEBCmTdMz/Eb+2t
g/WHdLS5/ajSs5gNhT99W3DeqZMPDEkgBRSeyBBmY3CUD3gBL2wXEktRXv504zBR
RC4oyUPX3c9E2ib6GATLE3kBLbcz9hTWbMxF+X3lLJvTVd/Qc2o=
=v/ZC
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Cache the AMD debug registers in per-CPU variables to avoid MSR
writes where possible, when supporting a debug registers swap feature
for SEV-ES guests
- Add support for AMD's version of eIBRS called Automatic IBRS which is
a set-and-forget control of indirect branch restriction speculation
resources on privilege change
- Add support for a new x86 instruction - LKGS - Load kernel GS which
is part of the FRED infrastructure
- Reset SPEC_CTRL upon init to accomodate use cases like kexec which
rediscover
- Other smaller fixes and cleanups
* tag 'x86_cpu_for_v6.3_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd: Cache debug register values in percpu variables
KVM: x86: Propagate the AMD Automatic IBRS feature to the guest
x86/cpu: Support AMD Automatic IBRS
x86/cpu, kvm: Add the SMM_CTL MSR not present feature
x86/cpu, kvm: Add the Null Selector Clears Base feature
x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf
x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature
KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
x86/cpu, kvm: Add support for CPUID_80000021_EAX
x86/gsseg: Add the new <asm/gsseg.h> header to <asm/asm-prototypes.h>
x86/gsseg: Use the LKGS instruction if available for load_gs_index()
x86/gsseg: Move load_gs_index() to its own new header file
x86/gsseg: Make asm_load_gs_index() take an u16
x86/opcode: Add the LKGS instruction to x86-opcode-map
x86/cpufeature: Add the CPU feature bit for LKGS
x86/bugs: Reset speculation control settings on init
x86/cpu: Remove redundant extern x86_read_arch_cap_msr()
When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there
might be some unitialized portions of the kvm_debugregs structure that
could be copied to userspace. Prevent this as is done in the other kvm
ioctls, by setting the whole structure to 0 before copying anything into
it.
Bonus is that this reduces the lines of code as the explicit flag
setting and reserved space zeroing out can be removed.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable <stable@kernel.org>
Reported-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org>
Tested-by: Xingyuan Mo <hdthky0@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsH7kSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/dQQAJSVCYA7F7LcfJf+c1ULG0XCd8rHnXdR
EGTygTnWrzwTCRaunOBPE4AJRxKdkwTKy+yfnnQVRdYYfRe1SZKpUQ2XNEDGn0+v
zVfOmSFFcCWXmJeY8y5n1GlDH4ENO3G7nD1ncDQ0I9PazmOsmxChoVZ9afFJ5bpo
73hjcYVfUDYxGkeRLWSSSFtWIGguE8BkpRH3wZ8MGZi+ueoFUJPPBKHeDtxCV2/T
KcJLne8tQVTiWCdMO3EFwxgIvsjQDoT0gZYLYNHJ6KqD9Szc3jA9v2ryTm5IYlpb
akYUqePaD0SGrrfDBrwz3bLu3fehDu7eduXESRlzzb8S4xP7/qXeo9KeVN+b4MBb
nmBBFncvMWbC8Po5wB5OVAfAa7ACmGiXeBV8pfgGI6FTq1fpc4VNm2PevKkDvlqN
O2eZ1KuNkwBnbIPj3JVPPnsJcUjYXFjZyzfpMV1T+ExmL/IYceatX4S7zfgL5nUg
3qFi5mX2Cufk2EBvBu+Dkpt/H4lze+ysZRciMC+v7Q4LWAYZ8HW1a44pnBVUJMPM
bWiJ1/O8RIWM1tWIrlO38+ZZalbu3spIVMBXKzqEGXvpUwJ4UgZM1tFiWvISTVFe
2X6N3d7aT/DQ1PzZU6BsyVZWAFaodHBauMcr9FUkWqqGu3HOhqC4rSJZ9eRR7V5O
WSp1gTVY1JXy
=AVpx
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.3:
- Fix a mostly benign overflow bug in SEV's send|receive_update_data()
- Move the SVM-specific "host flags" into vcpu_svm (extracted from the
vNMI enabling series)
- A handful for fixes and cleanups
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmPsFZ4SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5eKEP/0qeZsOQot53wkf+wNiGh1X6qDacBPFP
A8GAPC70fEisxAt776DeKEBwikHpARPglCt1Il9dFvkG+0jgYpvPu8UGF1LpouKX
cD/7itr2k8GZlXZBg2Rgu3TRyFBJEGHT6tAu7PBhZyL6yWQDUxao8FPFrRGfmJ7O
Z6eFMo1cERNHICQm+W/2TBd1xguiF+m4CXKlA70R4wzM37aPF9o5HvmIwAvPzyhU
w4WzcIQbjVPs1VpBTzwPqRmyZ8omSlDYo7VqmsDiRtJbucqgbhFI2wR+nyImFCa9
D2pI5TV3CFTt0fvd8SZpH19nR3S6cMLCXONOsijmvR2BmS3PhJMP4dMm5m4R06nF
RBtnTj9fkbeL1ghFEkMxHBZVTG3bBlO4ySOxIqNHCvPjqQ37mJ+xP4C8kcIC9p5F
+xL3AvZ7zenPv3A29SY9YH+QvZLBwyDJzAsveLeYkLFoJxoDT4glOY/Wpi1rkZ17
/zHDZWoF49l1Eu3Bql0hFetkCreUNFGpa4moUmEC0evYOvV2WCb+39TDXZ8CPCGD
+cDiRnD8MFQpBw47F03EnFheFHxiJoL0Clv5vvM3C+xOq2J9WVG9mqQWCk+4ta2B
Um4D++0a9lwvJhOImaR7uyiV3K7oVm+rU8+46x+nTNGaIP2bnE+vronY+b6KGeUx
7+xzTKlYygGe
=ev5v
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.3' of https://github.com/kvm-x86/linux into HEAD
KVM x86 PMU changes for 6.3:
- Add support for created masked events for the PMU filter to allow
userspace to heavily restrict what events the guest can use without
needing to create an absurd number of events
- Clean up KVM's handling of "PMU MSRs to save", especially when vPMU
support is disabled
- Add PEBS support for Intel SPR
By default, KVM/SVM will intercept attempts by the guest to transition
out of C0. However, the KVM_CAP_X86_DISABLE_EXITS capability can be used
by a VMM to change this behavior. To mitigate the cross-thread return
address predictions bug (X86_BUG_SMT_RSB), a VMM must not be allowed to
override the default behavior to intercept C0 transitions.
Use a module parameter to control the mitigation on processors that are
vulnerable to X86_BUG_SMT_RSB. If the processor is vulnerable to the
X86_BUG_SMT_RSB bug and the module parameter is set to mitigate the bug,
KVM will not allow the disabling of the HLT, MWAIT and CSTATE exits.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <4019348b5e07148eb4d593380a5f6713b93c9a16.1675956146.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As of commit bccf2150fe ("KVM: Per-vcpu inodes"), __msr_io() doesn't
return a negative value. Remove unnecessary checks.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-7-mhal@rbox.co
[sean: call out commit which left behind the unnecessary check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace `1` with the actual mutex_is_locked() check.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-5-mhal@rbox.co
[sean: delete the comment that explained the hardocded '1']
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace srcu_dereference()+rcu_assign_pointer() sequence with
a single rcu_replace_pointer().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-4-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reduce time spent holding kvm->lock: unlock mutex before calling
synchronize_srcu(). There is no need to hold kvm->lock until all vCPUs
have been kicked, KVM only needs to guarantee that all vCPUs will switch
to the new filter before exiting to userspace.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-3-mhal@rbox.co
[sean: expand changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Instead of re-defining the "host flags" bits, just expose dedicated
helpers for each of the two remaining flags that are consumed by the
emulator. The emulator never consumes both "is guest" and "is SMM" in
close proximity, so there is no motivation to avoid additional indirect
branches.
Also while at it, garbage collect the recently removed host flags.
No functional change is intended.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com
[sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Provide "error" semantics (read zeros, drop writes) for userspace accesses
to MSRs that are ultimately unsupported for whatever reason, but for which
KVM told userspace to save and restore the MSR, i.e. for MSRs that KVM
included in KVM_GET_MSR_INDEX_LIST.
Previously, KVM special cased a few PMU MSRs that were problematic at one
point or another. Extend the treatment to all PMU MSRs, e.g. to avoid
spurious unsupported accesses.
Note, the logic can also be used for non-PMU MSRs, but as of today only
PMU MSRs can end up being unsupported after KVM told userspace to save and
restore them.
Link: https://lore.kernel.org/r/20230124234905.3774678-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Limit the set of MSRs for fixed PMU counters based on the number of fixed
counters actually supported by the host so that userspace doesn't waste
time saving and restoring dummy values.
Signed-off-by: Like Xu <likexu@tencent.com>
[sean: split for !enable_pmu logic, drop min(), write changelog]
Link: https://lore.kernel.org/r/20230124234905.3774678-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Omit all PMU MSRs from the "MSRs to save" list if the PMU is disabled so
that userspace doesn't waste time saving and restoring dummy values. KVM
provides "error" semantics (read zeros, drop writes) for such known-but-
unsupported MSRs, i.e. has fudged around this issue for quite some time.
Keep the "error" semantics as-is for now, the logic will be cleaned up in
a separate patch.
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Weijiang Yang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move all potential to-be-saved PMU MSRs into a separate array so that a
future patch can easily omit all PMU MSRs from the list when the PMU is
disabled.
No functional change intended.
Link: https://lore.kernel.org/r/20230124234905.3774678-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add helpers to print unimplemented MSR accesses and condition all such
prints on report_ignored_msrs, i.e. honor userspace's request to not
print unimplemented MSRs. Even though vcpu_unimpl() is ratelimited,
printing can still be problematic, e.g. if a print gets stalled when host
userspace is writing MSRs during live migration, an effective stall can
result in very noticeable disruption in the guest.
E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU
counters while the PMU was disabled in KVM.
- 99.75% 0.00% [.] __ioctl
- __ioctl
- 99.74% entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_ioctl
- do_vfs_ioctl
- 92.48% kvm_vcpu_ioctl
- kvm_arch_vcpu_ioctl
- 85.12% kvm_set_msr_ignored_check
svm_set_msr
kvm_set_msr_common
printk
vprintk_func
vprintk_default
vprintk_emit
console_unlock
call_console_drivers
univ8250_console_write
serial8250_console_write
uart_console_write
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based
on the vendor PMU capabilities so that consuming num_counters_gp naturally
does the right thing. This fixes a mostly theoretical bug where KVM could
over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g.
if the number of counters reported by perf is greater than KVM's
hardcoded internal limit. Incorporating input from the AMD PMU also
avoids over-reporting MSRs to save when running on AMD.
Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add the AMD Automatic IBRS feature bit to those being propagated to the guest,
and enable the guest EFER bit.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-9-kim.phillips@amd.com
Avoid type casts that are needed for IS_ERR() and use
IS_ERR_VALUE() instead.
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Link: https://lore.kernel.org/r/202211161718436948912@zte.com.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
When building a list of filter events, it can sometimes be a challenge
to fit all the events needed to adequately restrict the guest into the
limited space available in the pmu event filter. This stems from the
fact that the pmu event filter requires each event (i.e. event select +
unit mask) be listed, when the intention might be to restrict the
event select all together, regardless of it's unit mask. Instead of
increasing the number of filter events in the pmu event filter, add a
new encoding that is able to do a more generalized match on the unit mask.
Introduce masked events as another encoding the pmu event filter
understands. Masked events has the fields: mask, match, and exclude.
When filtering based on these events, the mask is applied to the guest's
unit mask to see if it matches the match value (i.e. umask & mask ==
match). The exclude bit can then be used to exclude events from that
match. E.g. for a given event select, if it's easier to say which unit
mask values shouldn't be filtered, a masked event can be set up to match
all possible unit mask values, then another masked event can be set up to
match the unit mask values that shouldn't be filtered.
Userspace can query to see if this feature exists by looking for the
capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS.
This feature is enabled by setting the flags field in the pmu event
filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS.
Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY().
It is an error to have a bit set outside the valid bits for a masked
event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in
such cases, including the high bits of the event select (35:32) if
called on Intel.
With these updates the filter matching code has been updated to match on
a common event. Masked events were flexible enough to handle both event
types, so they were used as the common event. This changes how guest
events get filtered because regardless of the type of event used in the
uAPI, they will be converted to masked events. Because of this there
could be a slight performance hit because instead of matching the filter
event with a lookup on event select + unit mask, it does a lookup on event
select then walks the unit masks to find the match. This shouldn't be a
big problem because I would expect the set of common event selects to be
small, and if they aren't the set can likely be reduced by using masked
events to generalize the unit mask. Using one type of event when
filtering guest events allows for a common code path to be used.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The scaling information in subleaf 1 should match the values set by KVM in
the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info)
which is shared with the guest, but is not directly available to the VMM.
The offset values are not set since a TSC offset is already applied.
The TSC frequency should also be set in sub-leaf 2.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging.
Both fields count the number of memslots that have dirty-logging enabled,
with the only difference being that cpu_dirty_logging_count is only
incremented when using PML. So while nr_memslots_dirty_logging is not a
direct replacement for cpu_dirty_logging_count, it can be combined with
enable_pml to get the same information.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The first half or so patches fix semi-urgent, real-world relevant APICv
and AVIC bugs.
The second half fixes a variety of AVIC and optimized APIC map bugs
where KVM doesn't play nice with various edge cases that are
architecturally legal(ish), but are unlikely to occur in most real world
scenarios
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Track the per-vendor required APICv inhibits with a variable instead of
calling into vendor code every time KVM wants to query the set of
required inhibits. The required inhibits are a property of the vendor's
virtualization architecture, i.e. are 100% static.
Using a variable allows the compiler to inline the check, e.g. generate
a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking
inhibits for performance reasons.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Free the APIC access page memslot if any vCPU enables x2APIC and SVM's
AVIC is enabled to prevent accesses to the virtual APIC on vCPUs with
x2APIC enabled. On AMD, if its "hybrid" mode is enabled (AVIC is enabled
when x2APIC is enabled even without x2AVIC support), keeping the APIC
access page memslot results in the guest being able to access the virtual
APIC page as x2APIC is fully emulated by KVM. I.e. hardware isn't aware
that the guest is operating in x2APIC mode.
Exempt nested SVM's update of APICv state from the new logic as x2APIC
can't be toggled on VM-Exit. In practice, invoking the x2APIC logic
should be harmless precisely because it should be a glorified nop, but
play it safe to avoid latent bugs, e.g. with dropping the vCPU's SRCU
lock.
Intel doesn't suffer from the same issue as APICv has fully independent
VMCS controls for xAPIC vs. x2APIC virtualization. Technically, KVM
should provide bus error semantics and not memory semantics for the APIC
page when x2APIC is enabled, but KVM already provides memory semantics in
other scenarios, e.g. if APICv/AVIC is enabled and the APIC is hardware
disabled (via APIC_BASE MSR).
Note, checking apic_access_memslot_enabled without taking locks relies
it being set during vCPU creation (before kvm_vcpu_reset()). vCPUs can
race to set the inhibit and delete the memslot, i.e. can get false
positives, but can't get false negatives as apic_access_memslot_enabled
can't be toggled "on" once any vCPU reaches KVM_RUN.
Opportunistically drop the "can" while updating avic_activate_vmcb()'s
comment, i.e. to state that KVM _does_ support the hybrid mode. Move
the "Note:" down a line to conform to preferred kernel/KVM multi-line
comment style.
Opportunistically update the apicv_update_lock comment, as it isn't
actually used to protect apic_access_memslot_enabled (which is protected
by slots_lock).
Fixes: 0e311d33bf ("KVM: SVM: Introduce hybrid-AVIC mode")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable CPU hotplug when enabling/disabling hardware to prevent the
corner case where if the following sequence occurs:
1. A hotplugged CPU marks itself online in cpu_online_mask
2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
callback
3 hardware_{en,dis}able_all() is invoked on another CPU
the hotplugged CPU will be included in on_each_cpu() and thus get sent
through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called.
start_secondary { ...
set_cpu_online(smp_processor_id(), true); <- 1
...
local_irq_enable(); <- 2
...
cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
}
KVM currently fudges around this race by keeping track of which CPUs have
done hardware enabling (see commit 1b6c016818 "KVM: Keep track of which
cpus have virtualization enabled"), but that's an inefficient, convoluted,
and hacky solution.
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-43-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The CPU STARTING section doesn't allow callbacks to fail. Move KVM's
hotplug callback to ONLINE section so that it can abort onlining a CPU in
certain cases to avoid potentially breaking VMs running on existing CPUs.
For example, when KVM fails to enable hardware virtualization on the
hotplugged CPU.
Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures
when offlining a CPU, all user tasks and non-pinned kernel tasks have left
the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's
CPU offline callback to disable hardware virtualization at that point.
Likewise, KVM's online callback can enable hardware virtualization before
any vCPU task gets a chance to run on hotplugged CPUs.
Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are
disabled, as the ONLINE section runs with IRQs disabled. The WARN wasn't
intended to be a requirement, e.g. disabling preemption is sufficient,
the IRQ thing was purely an aggressive sanity check since the helper was
only ever invoked via SMP function call.
Rename KVM's CPU hotplug callbacks accordingly.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
[sean: drop WARN that IRQs are disabled]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-42-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do compatibility checks when enabling hardware to effectively add
compatibility checks when onlining a CPU. Abort enabling, i.e. the
online process, if the (hotplugged) CPU is incompatible with the known
good setup.
At init time, KVM does compatibility checks to ensure that all online
CPUs support hardware virtualization and a common set of features. But
KVM uses hotplugged CPUs without such compatibility checks. On Intel
CPUs, this leads to #GP if the hotplugged CPU doesn't support VMX, or
VM-Entry failure if the hotplugged CPU doesn't support all features
enabled by KVM.
Note, this is little more than a NOP on SVM, as SVM already checks for
full SVM support during hardware enabling.
Opportunistically add a pr_err() if setup_vmcs_config() fails, and
tweak all error messages to output which CPU failed.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-41-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the .check_processor_compatibility() callback from kvm_x86_init_ops
to kvm_x86_ops to allow a future patch to do compatibility checks during
CPU hotplug.
Do kvm_ops_update() before compat checks so that static_call() can be
used during compat checks.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-40-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do basic VMX/SVM support checks directly in vendor code instead of
implementing them via kvm_x86_ops hooks. Beyond the superficial benefit
of providing common messages, which isn't even clearly a net positive
since vendor code can provide more precise/detailed messages, there's
zero advantage to bouncing through common x86 code.
Consolidating the checks will also simplify performing the checks
across all CPUs (in a future patch).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-37-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_check_processor_compat() and its support code now that all
architecture implementations are nops.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the CPU compatibility checks to pure x86 code, i.e. drop x86's use
of the common kvm_x86_check_cpu_compat() arch hook. x86 is the only
architecture that "needs" to do per-CPU compatibility checks, moving
the logic to x86 will allow dropping the common code, and will also
give x86 more control over when/how the compatibility checks are
performed, e.g. TDX will need to enable hardware (do VMXON) in order to
perform compatibility checks.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20221130230934.1014142-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations
are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acquire a new mutex, vendor_module_lock, in kvm_x86_vendor_init() while
doing hardware setup to ensure that concurrent calls are fully serialized.
KVM rejects attempts to load vendor modules if a different module has
already been loaded, but doesn't handle the case where multiple vendor
modules are loaded at the same time, and module_init() doesn't run under
the global module_mutex.
Note, in practice, this is likely a benign bug as no platform exists that
supports both SVM and VMX, i.e. barring a weird VM setup, one of the
vendor modules is guaranteed to fail a support check before modifying
common KVM state.
Alternatively, KVM could perform an atomic CMPXCHG on .hardware_enable,
but that comes with its own ugliness as it would require setting
.hardware_enable before success is guaranteed, e.g. attempting to load
the "wrong" could result in spurious failure to load the "right" module.
Introduce a new mutex as using kvm_lock is extremely deadlock prone due
to kvm_lock being taken under cpus_write_lock(), and in the future, under
under cpus_read_lock(). Any operation that takes cpus_read_lock() while
holding kvm_lock would potentially deadlock, e.g. kvm_timer_init() takes
cpus_read_lock() to register a callback. In theory, KVM could avoid
such problematic paths, i.e. do less setup under kvm_lock, but avoiding
all calls to cpus_read_lock() is subtly difficult and thus fragile. E.g.
updating static calls also acquires cpus_read_lock().
Inverting the lock ordering, i.e. always taking kvm_lock outside
cpus_read_lock(), is not a viable option as kvm_lock is taken in various
callbacks that may be invoked under cpus_read_lock(), e.g. x86's
kvmclock_cpufreq_notifier().
The lockdep splat below is dependent on future patches to take
cpus_read_lock() in hardware_enable_all(), but as above, deadlock is
already is already possible.
======================================================
WARNING: possible circular locking dependency detected
6.0.0-smp--7ec93244f194-init2 #27 Tainted: G O
------------------------------------------------------
stable/251833 is trying to acquire lock:
ffffffffc097ea28 (kvm_lock){+.+.}-{3:3}, at: hardware_enable_all+0x1f/0xc0 [kvm]
but task is already holding lock:
ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2a/0xa0
__cpuhp_setup_state+0x2b/0x60
__kvm_x86_vendor_init+0x16a/0x1870 [kvm]
kvm_x86_vendor_init+0x23/0x40 [kvm]
0xffffffffc0a4d02b
do_one_initcall+0x110/0x200
do_init_module+0x4f/0x250
load_module+0x1730/0x18f0
__se_sys_finit_module+0xca/0x100
__x64_sys_finit_module+0x1d/0x20
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (kvm_lock){+.+.}-{3:3}:
__lock_acquire+0x16f4/0x30d0
lock_acquire+0xb2/0x190
__mutex_lock+0x98/0x6f0
mutex_lock_nested+0x1b/0x20
hardware_enable_all+0x1f/0xc0 [kvm]
kvm_dev_ioctl+0x45e/0x930 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpu_hotplug_lock);
lock(kvm_lock);
lock(cpu_hotplug_lock);
lock(kvm_lock);
*** DEADLOCK ***
1 lock held by stable/251833:
#0: ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the guts of kvm_arch_init() to a new helper, kvm_x86_vendor_init(),
so that VMX can do _all_ arch and vendor initialization before calling
kvm_init(). Calling kvm_init() must be the _very_ last step during init,
as kvm_init() exposes /dev/kvm to userspace, i.e. allows creating VMs.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that
all implementations are nops.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390
Acked-by: Anup Patel <anup@brainfault.org>
Message-Id: <20221130230934.1014142-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that kvm_arch_hardware_setup() is called immediately after
kvm_arch_init(), fold the guts of kvm_arch_hardware_(un)setup() into
kvm_arch_{init,exit}() as a step towards dropping one of the hooks.
To avoid having to unwind various setup, e.g registration of several
notifiers, slot in the vendor hardware setup before the registration of
said notifiers and callbacks. Introducing a functional change while
moving code is less than ideal, but the alternative is adding a pile of
unwinding code, which is much more error prone, e.g. several attempts to
move the setup code verbatim all introduced bugs.
Add a comment to document that kvm_ops_update() is effectively the point
of no return, e.g. it sets the kvm_x86_ops.hardware_enable canary and so
needs to be unwound.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_arch_init()'s call to kvm_timer_init() down a few lines below
the XCR0 configuration code. A future patch will move hardware setup
into kvm_arch_init() and slot in vendor hardware setup before the call
to kvm_timer_init() so that timer initialization (among other stuff)
doesn't need to be unwound if vendor setup fails. XCR0 setup on the
other hand needs to happen before vendor hardware setup.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86:
* Change tdp_mmu to a read-only parameter
* Separate TDP and shadow MMU page fault paths
* Enable Hyper-V invariant TSC control
selftests:
* Use TAP interface for kvm_binary_stats_test and tsc_msrs_test
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Normally, genuine Hyper-V doesn't expose architectural invariant TSC
(CPUID.80000007H:EDX[8]) to its guests by default. A special PV MSR
(HV_X64_MSR_TSC_INVARIANT_CONTROL, 0x40000118) and corresponding CPUID
feature bit (CPUID.0x40000003.EAX[15]) were introduced. When bit 0 of the
PV MSR is set, invariant TSC bit starts to show up in CPUID. When the
feature is exposed to Hyper-V guests, reenlightenment becomes unneeded.
Add the feature to KVM. Keep CPUID output intact when the feature
wasn't exposed to L1 and implement the required logic for hiding
invariant TSC when the feature was exposed and invariant TSC control
MSR wasn't written to. Copy genuine Hyper-V behavior and forbid to
disable the feature once it was enabled.
For the reference, for linux guests, support for the feature was added
in commit dce7cd6275 ("x86/hyperv: Allow guests to enable InvariantTSC").
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86:
* several fixes to nested VMX execution controls
* fixes and clarification to the documentation for Xen emulation
* do not unnecessarily release a pmu event with zero period
* MMU fixes
* fix Coverity warning in kvm_hv_flush_tlb()
selftests:
* fixes for the ucall mechanism in selftests
* other fixes mostly related to compilation with clang
Normally, genuine Hyper-V doesn't expose architectural invariant TSC
(CPUID.80000007H:EDX[8]) to its guests by default. A special PV MSR
(HV_X64_MSR_TSC_INVARIANT_CONTROL, 0x40000118) and corresponding CPUID
feature bit (CPUID.0x40000003.EAX[15]) were introduced. When bit 0 of the
PV MSR is set, invariant TSC bit starts to show up in CPUID. When the
feature is exposed to Hyper-V guests, reenlightenment becomes unneeded.
Add the feature to KVM. Keep CPUID output intact when the feature
wasn't exposed to L1 and implement the required logic for hiding
invariant TSC when the feature was exposed and invariant TSC control
MSR wasn't written to. Copy genuine Hyper-V behavior and forbid to
disable the feature once it was enabled.
For the reference, for linux guests, support for the feature was added
in commit dce7cd6275 ("x86/hyperv: Allow guests to enable InvariantTSC").
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221013095849.705943-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a sanity check in kvm_handle_memory_failure() to assert that a valid
x86_exception structure is provided if the memory "failure" wants to
propagate a fault into the guest. If a memory failure happens during a
direct guest physical memory access, e.g. for nested VMX, KVM hardcodes
the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer
(because the exception struct would just be filled with garbage).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220153427.514032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
* Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
* Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option,
which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a9:
"Fix a number of issues with MTE, such as races on the tags being
initialised vs the PG_mte_tagged flag as well as the lack of support
for VM_SHARED when KVM is involved. Patches from Catalin Marinas and
Peter Collingbourne").
* Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
* Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
* Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
* Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
* Second batch of the lazy destroy patches
* First batch of KVM changes for kernel virtual != physical address support
* Removal of a unused function
x86:
* Allow compiling out SMM support
* Cleanup and documentation of SMM state save area format
* Preserve interrupt shadow in SMM state save area
* Respond to generic signals during slow page faults
* Fixes and optimizations for the non-executable huge page errata fix.
* Reprogram all performance counters on PMU filter change
* Cleanups to Hyper-V emulation and tests
* Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest
running on top of a L1 Hyper-V hypervisor)
* Advertise several new Intel features
* x86 Xen-for-KVM:
** Allow the Xen runstate information to cross a page boundary
** Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
** Add support for 32-bit guests in SCHEDOP_poll
* Notable x86 fixes and cleanups:
** One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
** Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
** Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
** Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
** Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
** Advertise (on AMD) that the SMM_CTL MSR is not supported
** Remove unnecessary exports
Generic:
* Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
* Add support for pinning vCPUs in dirty_log_perf_test.
* Rename the so called "perf_util" framework to "memstress".
* Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress tests.
* Add a common ucall implementation; code dedup and pre-work for running
SEV (and beyond) guests in selftests.
* Provide a common constructor and arch hook, which will eventually be
used by x86 to automatically select the right hypercall (AMD vs. Intel).
* A bunch of added/enabled/fixed selftests for ARM64, covering memslots,
breakpoints, stage-2 faults and access tracking.
* x86-specific selftest changes:
** Clean up x86's page table management.
** Clean up and enhance the "smaller maxphyaddr" test, and add a related
test to cover generic emulation failure.
** Clean up the nEPT support checks.
** Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
** Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
Documentation:
* Remove deleted ioctls from documentation
* Clean up the docs for the x86 MSR filter.
* Various fixes
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmOaFrcUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPemQgAq49excg2Cc+EsHnZw3vu/QWdA0Rt
KhL3OgKxuHNjCbD2O9n2t5di7eJOTQ7F7T0eDm3xPTr4FS8LQ2327/mQePU/H2CF
mWOpq9RBWLzFsSTeVA2Mz9TUTkYSnDHYuRsBvHyw/n9cL76BWVzjImldFtjYjjex
yAwl8c5itKH6bc7KO+5ydswbvBzODkeYKUSBNdbn6m0JGQST7XppNwIAJvpiHsii
Qgpk0e4Xx9q4PXG/r5DedI6BlufBsLhv0aE9SHPzyKH3JbbUFhJYI8ZD5OhBQuYW
MwxK2KlM5Jm5ud2NZDDlsMmmvd1lnYCFDyqNozaKEWC1Y5rq1AbMa51fXA==
=QAYX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM64:
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on (see merge
commit 382b5b87a9: "Fix a number of issues with MTE, such as
races on the tags being initialised vs the PG_mte_tagged flag as
well as the lack of support for VM_SHARED when KVM is involved.
Patches from Catalin Marinas and Peter Collingbourne").
- Merge the pKVM shadow vcpu state tracking that allows the
hypervisor to have its own view of a vcpu, keeping that state
private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB
pages only) as a prefix of the oncoming support for 4kB and 16kB
pages.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
s390:
- Second batch of the lazy destroy patches
- First batch of KVM changes for kernel virtual != physical address
support
- Removal of a unused function
x86:
- Allow compiling out SMM support
- Cleanup and documentation of SMM state save area format
- Preserve interrupt shadow in SMM state save area
- Respond to generic signals during slow page faults
- Fixes and optimizations for the non-executable huge page errata
fix.
- Reprogram all performance counters on PMU filter change
- Cleanups to Hyper-V emulation and tests
- Process Hyper-V TLB flushes from a nested guest (i.e. from a L2
guest running on top of a L1 Hyper-V hypervisor)
- Advertise several new Intel features
- x86 Xen-for-KVM:
- Allow the Xen runstate information to cross a page boundary
- Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
- Add support for 32-bit guests in SCHEDOP_poll
- Notable x86 fixes and cleanups:
- One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
- Reinstate IBPB on emulated VM-Exit that was incorrectly dropped
a few years back when eliminating unnecessary barriers when
switching between vmcs01 and vmcs02.
- Clean up vmread_error_trampoline() to make it more obvious that
params must be passed on the stack, even for x86-64.
- Let userspace set all supported bits in MSR_IA32_FEAT_CTL
irrespective of the current guest CPUID.
- Fudge around a race with TSC refinement that results in KVM
incorrectly thinking a guest needs TSC scaling when running on a
CPU with a constant TSC, but no hardware-enumerated TSC
frequency.
- Advertise (on AMD) that the SMM_CTL MSR is not supported
- Remove unnecessary exports
Generic:
- Support for responding to signals during page faults; introduces
new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks
Selftests:
- Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
- Fix build errors that occur in certain setups (unsure exactly what
is unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
- Introduce actual atomics for clear/set_bit() in selftests
- Add support for pinning vCPUs in dirty_log_perf_test.
- Rename the so called "perf_util" framework to "memstress".
- Add a lightweight psuedo RNG for guest use, and use it to randomize
the access pattern and write vs. read percentage in the memstress
tests.
- Add a common ucall implementation; code dedup and pre-work for
running SEV (and beyond) guests in selftests.
- Provide a common constructor and arch hook, which will eventually
be used by x86 to automatically select the right hypercall (AMD vs.
Intel).
- A bunch of added/enabled/fixed selftests for ARM64, covering
memslots, breakpoints, stage-2 faults and access tracking.
- x86-specific selftest changes:
- Clean up x86's page table management.
- Clean up and enhance the "smaller maxphyaddr" test, and add a
related test to cover generic emulation failure.
- Clean up the nEPT support checks.
- Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values.
- Fix an ordering issue in the AMX test introduced by recent
conversions to use kvm_cpu_has(), and harden the code to guard
against similar bugs in the future. Anything that tiggers
caching of KVM's supported CPUID, kvm_cpu_has() in this case,
effectively hides opt-in XSAVE features if the caching occurs
before the test opts in via prctl().
Documentation:
- Remove deleted ioctls from documentation
- Clean up the docs for the x86 MSR filter.
- Various fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits)
KVM: x86: Add proper ReST tables for userspace MSR exits/flags
KVM: selftests: Allocate ucall pool from MEM_REGION_DATA
KVM: arm64: selftests: Align VA space allocator with TTBR0
KVM: arm64: Fix benign bug with incorrect use of VA_BITS
KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow
KVM: x86: Advertise that the SMM_CTL MSR is not supported
KVM: x86: remove unnecessary exports
KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic"
tools: KVM: selftests: Convert clear/set_bit() to actual atomics
tools: Drop "atomic_" prefix from atomic test_and_set_bit()
tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers
KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests
perf tools: Use dedicated non-atomic clear/set bit helpers
tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers
KVM: arm64: selftests: Enable single-step without a "full" ucall()
KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself
KVM: Remove stale comment about KVM_REQ_UNHALT
KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR
KVM: Reference to kvm_userspace_memory_region in doc and comments
KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl
...
x86 Xen-for-KVM:
* Allow the Xen runstate information to cross a page boundary
* Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
* add support for 32-bit guests in SCHEDOP_poll
x86 fixes:
* One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
* Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
* Clean up the MSR filter docs.
* Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
* Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
* Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
* Advertise (on AMD) that the SMM_CTL MSR is not supported
* Remove unnecessary exports
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
Documentation:
* Remove deleted ioctls from documentation
* Various fixes
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on.
- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
stage-2 faults and access tracking. You name it, we got it, we
probably broke it.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
As a side effect, this tag also drags:
- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
series
- A shared branch with the arm64 tree that repaints all the system
registers to match the ARM ARM's naming, and resulting in
interesting conflicts
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmOODb0PHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDztsQAInRnsgLl57/SpqhZzExNCllN6AT/bdeB3uz
rnw3ScJOV174uNKp8lnPWoTvu2YUGiVtBp6tFHhDI8le7zHX438ZT8KE5mcs8p5i
KfFKnb8SHV2DDpqkcy24c0Xl/6vsg1qkKrdfJb49yl5ZakRITDpynW/7tn6dXsxX
wASeGFdCYeW4g2xMQzsCbtx6LgeQ8uomBmzRfPrOtZHYYxAn6+4Mj4595EC1sWxM
AQnbp8tW3Vw46saEZAQvUEOGOW9q0Nls7G21YqQ52IA+ZVDK1LmAF2b1XY3edjkk
pX8EsXOURfqdasBxfSfF3SgnUazoz9GHpSzp1cTVTktrPp40rrT7Ldtml0ktq69d
1malPj47KVMDsIq0kNJGnMxciXFgAHw+VaCQX+k4zhIatNwviMbSop2fEoxj22jc
4YGgGOxaGrnvmAJhreCIbr4CkZk5CJ8Zvmtfg+QM6npIp8BY8896nvORx/d4i6tT
H4caadd8AAR56ANUyd3+KqF3x0WrkaU0PLHJLy1tKwOXJUUTjcpvIfahBAAeUlSR
qEFrtb+EEMPgAwLfNOICcNkPZR/yyuYvM+FiUQNVy5cNiwFkpztpIctfOFaHySGF
K07O2/a1F6xKL0OKRUg7hGKknF9ecmux4vHhiUMuIk9VOgNTWobHozBDorLKXMzC
aWa6oGVC
=iIPT
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-6.2' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.2
- Enable the per-vcpu dirty-ring tracking mechanism, together with an
option to keep the good old dirty log around for pages that are
dirtied by something other than a vcpu.
- Switch to the relaxed parallel fault handling, using RCU to delay
page table reclaim and giving better performance under load.
- Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping
option, which multi-process VMMs such as crosvm rely on.
- Merge the pKVM shadow vcpu state tracking that allows the hypervisor
to have its own view of a vcpu, keeping that state private.
- Add support for the PMUv3p5 architecture revision, bringing support
for 64bit counters on systems that support it, and fix the
no-quite-compliant CHAIN-ed counter support for the machines that
actually exist out there.
- Fix a handful of minor issues around 52bit VA/PA support (64kB pages
only) as a prefix of the oncoming support for 4kB and 16kB pages.
- Add/Enable/Fix a bunch of selftests covering memslots, breakpoints,
stage-2 faults and access tracking. You name it, we got it, we
probably broke it.
- Pick a small set of documentation and spelling fixes, because no
good merge window would be complete without those.
As a side effect, this tag also drags:
- The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring
series
- A shared branch with the arm64 tree that repaints all the system
registers to match the ARM ARM's naming, and resulting in
interesting conflicts
* kvm-arm64/dirty-ring:
: .
: Add support for the "per-vcpu dirty-ring tracking with a bitmap
: and sprinkles on top", courtesy of Gavin Shan.
:
: This branch drags the kvmarm-fixes-6.1-3 tag which was already
: merged in 6.1-rc4 so that the branch is in a working state.
: .
KVM: Push dirty information unconditionally to backup bitmap
KVM: selftests: Automate choosing dirty ring size in dirty_log_test
KVM: selftests: Clear dirty ring states between two modes in dirty_log_test
KVM: selftests: Use host page size to map ring buffer in dirty_log_test
KVM: arm64: Enable ring-based dirty memory tracking
KVM: Support dirty ring in conjunction with bitmap
KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h
KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL
Signed-off-by: Marc Zyngier <maz@kernel.org>
Pull Xen-for-KVM changes from David Woodhouse:
* add support for 32-bit guests in SCHEDOP_poll
* the rest of the gfn-to-pfn cache API cleanup
"I still haven't reinstated the last of those patches to make gpc->len
immutable."
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several symbols are not used by vendor modules but still exported.
Removing them ensures that new coupling between kvm.ko and kvm-*.ko
is noticed and reviewed.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is
constant, in which case the actual TSC frequency will never change and thus
capturing TSC during initialization is unnecessary, KVM can simply use
tsc_khz. This value is snapshotted from
kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL)
On CPUs with constant TSC, but not a hardware-specified TSC frequency,
snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency
can lead to VM to think its TSC frequency is not what it actually is if
refining the TSC completes after KVM snapshots tsc_khz. The actual
frequency never changes, only the kernel's calculation of what that
frequency is changes.
Ideally, KVM would not be able to race with TSC refinement, or would have
a hook into tsc_refine_calibration_work() to get an alert when refinement
is complete. Avoiding the race altogether isn't practical as refinement
takes a relative eternity; it's deliberately put on a work queue outside of
the normal boot sequence to avoid unnecessarily delaying boot.
Adding a hook is doable, but somewhat gross due to KVM's ability to be
built as a module. And if the TSC is constant, which is likely the case
for every VMX/SVM-capable CPU produced in the last decade, the race can be
hit if and only if userspace is able to create a VM before TSC refinement
completes; refinement is slow, but not that slow.
For now, punt on a proper fix, as not taking a snapshot can help some uses
cases and not taking a snapshot is arguably correct irrespective of the
race with refinement.
Signed-off-by: Anton Romanov <romanton@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Treat any exception during instruction decode for EMULTYPE_SKIP as a
"full" emulation failure, i.e. signal failure instead of queuing the
exception. When decoding purely to skip an instruction, KVM and/or the
CPU has already done some amount of emulation that cannot be unwound,
e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated
MMIO. KVM already does this if a #UD is encountered, but not for other
exceptions, e.g. if a #PF is encountered during fetch.
In SVM's soft-injection use case, queueing the exception is particularly
problematic as queueing exceptions while injecting events can put KVM
into an infinite loop due to bailing from VM-Enter to service the newly
pending exception. E.g. multiple warnings to detect such behavior fire:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
Modules linked in: kvm_amd ccp kvm irqbypass
CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
Call Trace:
kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
__x64_sys_ioctl+0x85/0xc0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
Modules linked in: kvm_amd ccp kvm irqbypass
CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G W 6.0.0-rc1+ #220
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
Call Trace:
kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
__x64_sys_ioctl+0x85/0xc0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
---[ end trace 0000000000000000 ]---
Fixes: 6ea6e84309 ("KVM: x86: inject exceptions produced by x86_decode_insn")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
Drop the @gpa param from the exported check()+refresh() helpers and limit
changing the cache's GPA to the activate path. All external users just
feed in gpc->gpa, i.e. this is a fancy nop.
Allowing users to change the GPA at check()+refresh() is dangerous as
those helpers explicitly allow concurrent calls, e.g. KVM could get into
a livelock scenario. It's also unclear as to what the expected behavior
should be if multiple tasks attempt to refresh with different GPAs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
[sean: leave kvm_gpc_unmap() as-is]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Move the assignment of immutable properties @kvm, @vcpu, and @usage to
the initializer. Make _activate() and _deactivate() use stored values.
Note, @len is also effectively immutable for most cases, but not in the
case of the Xen runstate cache, which may be split across two pages and
the length of the first segment will depend on its address.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
[sean: handle @len in a separate patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[dwmw2: acknowledge that @len can actually change for some use cases]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by
turning it into a vmexit), there is no need to leave vcpu_enter_guest().
Any vcpu->requests will be caught later before the actual vmentry,
and in fact vcpu_enter_guest() was not initializing the "r" variable.
Depending on the compiler's whims, this could cause the
x86_64/triple_fault_event_test test to fail.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: 92e7d5c83a ("KVM: x86: allow L1 to not intercept triple fault")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by
turning it into a vmexit), there is no need to leave vcpu_enter_guest().
Any vcpu->requests will be caught later before the actual vmentry,
and in fact vcpu_enter_guest() was not initializing the "r" variable.
Depending on the compiler's whims, this could cause the
x86_64/triple_fault_event_test test to fail.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Fixes: 92e7d5c83a ("KVM: x86: allow L1 to not intercept triple fault")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Formalize "gpc" as the acronym and use it in function names.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Closer inspection of the Xen code shows that we aren't supposed to be
using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be
explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall.
If we randomly set the top bit of ->state_entry_time for a guest that
hasn't asked for it and doesn't expect it, that could make the runtimes
fail to add up and confuse the guest. Without the flag it's perfectly
safe for a vCPU to read its own vcpu_runstate_info; just not for one
vCPU to read *another's*.
I briefly pondered adding a word for the whole set of VMASST_TYPE_*
flags but the only one we care about for HVM guests is this, so it
seemed a bit pointless.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221127122210.248427-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Fixes for Xen emulation. While nobody should be enabling it in
the kernel (the only public users of the feature are the selftests),
the bug effectively allows userspace to read arbitrary memory.
* Correctness fixes for nested hypervisors that do not intercept INIT
or SHUTDOWN on AMD; the subsequent CPU reset can cause a use-after-free
when it disables virtualization extensions. While downgrading the panic
to a WARN is quite easy, the full fix is a bit more laborious; there
are also tests. This is the bulk of the pull request.
* Fix race condition due to incorrect mmu_lock use around
make_mmu_pages_available().
Generic:
* Obey changes to the kvm.halt_poll_ns module parameter in VMs
not using KVM_CAP_HALT_POLL, restoring behavior from before
the introduction of the capability
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmODI84UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPVJwgAombWOBf549JiHGPtwejuQO20nTSj
Om9pzWQ9dR182P+ju/FdqSPXt/Lc8i+z5zSXDrV3HQ6/a3zIItA+bOAUiMFvHNAQ
w/7pEb1MzVOsEg2SXGOjZvW3WouB4Z4R0PosInYjrFrRGRAaw5iaTOZHGezE44t2
WBWk1PpdMap7J/8sjNT1ble72ig9JdSW4qeJUQ1GWxHCigI5sESCQVqF446KM0jF
gTYPGX5TqpbWiIejF0yNew9yNKMi/yO4Pz8I5j3vtopeHx24DCIqUAGaEg6ykErX
vnzYbVP7NaFrqtje49PsK6i1cu2u7uFPArj0dxo3DviQVZVHV1q6tNmI4A==
=Qgei
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"x86:
- Fixes for Xen emulation. While nobody should be enabling it in the
kernel (the only public users of the feature are the selftests),
the bug effectively allows userspace to read arbitrary memory.
- Correctness fixes for nested hypervisors that do not intercept INIT
or SHUTDOWN on AMD; the subsequent CPU reset can cause a
use-after-free when it disables virtualization extensions. While
downgrading the panic to a WARN is quite easy, the full fix is a
bit more laborious; there are also tests. This is the bulk of the
pull request.
- Fix race condition due to incorrect mmu_lock use around
make_mmu_pages_available().
Generic:
- Obey changes to the kvm.halt_poll_ns module parameter in VMs not
using KVM_CAP_HALT_POLL, restoring behavior from before the
introduction of the capability"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Update gfn_to_pfn_cache khva when it moves within the same page
KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
KVM: x86/xen: Validate port number in SCHEDOP_poll
KVM: x86/mmu: Fix race condition in direct_page_fault
KVM: x86: remove exit_int_info warning in svm_handle_exit
KVM: selftests: add svm part to triple_fault_test
KVM: x86: allow L1 to not intercept triple fault
kvm: selftests: add svm nested shutdown test
KVM: selftests: move idt_entry to header
KVM: x86: forcibly leave nested mode on vCPU reset
KVM: x86: add kvm_leave_nested
KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use
KVM: x86: nSVM: leave nested mode on vCPU free
KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL
KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-polling
KVM: Cap vcpu->halt_poll_ns before halting rather than after
To allow flushing individual GVAs instead of always flushing the whole
VPID a per-vCPU structure to pass the requests is needed. Use standard
'kfifo' to queue two types of entries: individual GVA (GFN + up to 4095
following GFNs in the lower 12 bits) and 'flush all'.
The size of the fifo is arbitrarily set to '16'.
Note, kvm_hv_flush_tlb() only queues 'flush all' entries for now and
kvm_hv_vcpu_flush_tlb() doesn't actually read the fifo just resets the
queue before returning -EOPNOTSUPP (which triggers full TLB flush) so
the functional change is very small but the infrastructure is prepared
to handle individual GVA flush requests.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-10-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to implementing fine-grained Hyper-V TLB flush and
L2 TLB flush, resurrect dedicated KVM_REQ_HV_TLB_FLUSH request bit. As
KVM_REQ_TLB_FLUSH_GUEST is a stronger operation, clear KVM_REQ_HV_TLB_FLUSH
request in kvm_vcpu_flush_tlb_guest().
The flush itself is temporary handled by kvm_vcpu_flush_tlb_guest().
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear KVM_REQ_TLB_FLUSH_CURRENT in kvm_vcpu_flush_tlb_all() instead of in
its sole caller that processes KVM_REQ_TLB_FLUSH. Regardless of why/when
kvm_vcpu_flush_tlb_all() is called, flushing "all" TLB entries also
flushes "current" TLB entries.
Ideally, there will never be another caller of kvm_vcpu_flush_tlb_all(),
and moving the handling "requires" extra work to document the ordering
requirement, but future Hyper-V paravirt TLB flushing support will add
similar logic for flush "guest" (Hyper-V can flush a subset of "guest"
entries). And in the Hyper-V case, KVM needs to do more than just clear
the request, the queue of GPAs to flush also needs to purged, and doing
all only in the request path is undesirable as kvm_vcpu_flush_tlb_guest()
does have multiple callers (though it's unlikely KVM's paravirt TLB flush
will coincide with Hyper-V's paravirt TLB flush).
Move the logic even though it adds extra "work" so that KVM will be
consistent with how flush requests are processed when the Hyper-V support
lands.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To make terminology between Hyper-V-on-KVM and KVM-on-Hyper-V consistent,
rename 'enable_direct_tlbflush' to 'enable_l2_tlb_flush'. The change
eliminates the use of confusing 'direct' and adds the missing underscore.
No functional change.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221101145426.251680-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This fixes three issues in nested SVM:
1) in the shutdown_interception() vmexit handler we call kvm_vcpu_reset().
However, if running nested and L1 doesn't intercept shutdown, the function
resets vcpu->arch.hflags without properly leaving the nested state.
This leaves the vCPU in inconsistent state and later triggers a kernel
panic in SVM code. The same bug can likely be triggered by sending INIT
via local apic to a vCPU which runs a nested guest.
On VMX we are lucky that the issue can't happen because VMX always
intercepts triple faults, thus triple fault in L2 will always be
redirected to L1. Plus, handle_triple_fault() doesn't reset the vCPU.
INIT IPI can't happen on VMX either because INIT events are masked while
in VMX mode.
Secondarily, KVM doesn't honour SHUTDOWN intercept bit of L1 on SVM.
A normal hypervisor should always intercept SHUTDOWN, a unit test on
the other hand might want to not do so.
Finally, the guest can trigger a kernel non rate limited printk on SVM
from the guest, which is fixed as well.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is SVM correctness fix - although a sane L1 would intercept
SHUTDOWN event, it doesn't have to, so we have to honour this.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing
'vcpu->arch.hflags' but it does so without all the required housekeeping.
On SVM, it is possible to have a vCPU reset while in guest mode because
unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in
addition to that L1 doesn't have to intercept triple fault, which should
also trigger L1's reset if happens in L2 while L1 didn't intercept it.
If one of the above conditions happen, KVM will continue to use vmcb02
while not having in the guest mode.
Later the IA32_EFER will be cleared which will lead to freeing of the
nested guest state which will (correctly) free the vmcb02, but since
KVM still uses it (incorrectly) this will lead to a use after free
and kernel crash.
This issue is assigned CVE-2022-3344
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
add kvm_leave_nested which wraps a call to nested_ops->leave_nested
into a function.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DE_CFG contains the LFENCE serializing bit, restore it on resume too.
This is relevant to older families due to the way how they do S3.
Unify and correct naming while at it.
Fixes: e4d0e84e49 ("x86/cpu/AMD: Make LFENCE a serializing instruction")
Reported-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Reported-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VCPU isn't expected to be runnable when the dirty ring becomes soft
full, until the dirty pages are harvested and the dirty ring is reset
from userspace. So there is a check in each guest's entrace to see if
the dirty ring is soft full or not. The VCPU is stopped from running if
its dirty ring has been soft full. The similar check will be needed when
the feature is going to be supported on ARM64. As Marc Zyngier suggested,
a new event will avoid pointless overhead to check the size of the dirty
ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance.
Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring
becomes soft full in kvm_dirty_ring_push(). The event is only cleared in
the check, done in the newly added helper kvm_dirty_ring_check_request().
Since the VCPU is not runnable when the dirty ring becomes soft full, the
KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from
running until the dirty pages are harvested and the dirty ring is reset by
userspace.
kvm_dirty_ring_soft_full() becomes a private function with the newly added
helper kvm_dirty_ring_check_request(). The alignment for the various event
definitions in kvm_host.h is changed to tab character by the way. In order
to avoid using 'container_of()', the argument @ring is replaced by @vcpu
in kvm_dirty_ring_push().
Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the
struct kvm_msr_filter_range. This simplifies checks that validate
these flags, and makes it easier to introduce new flags in the future.
No functional change intended.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220921151525.904162-5-aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>