Refuse to load KVM if NX support is not available and EPT is not enabled.
Shadow paging has assumed NX support since commit 9167ab7993 ("KVM:
vmx, svm: always run with EFER.NXE=1 when shadow paging is active"), so
for all intents and purposes this has been a de facto requirement for
over a year.
Do not require NX support if EPT is enabled purely because Intel CPUs let
firmware disable NX support via MSR_IA32_MISC_ENABLES. If not for that,
VMX (and KVM as a whole) could require NX support with minimal risk to
breaking userspace.
Fixes: 9167ab7993 ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210615164535.2146172-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the explicit check on EPTP switching being enabled. The EPTP
switching check is handled in the generic VMFUNC function check, while
the underlying VMFUNC enablement check is done by hardware and redone
by generic VMFUNC emulation.
The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a
consistency check, keep it but add a WARN.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN and inject #UD when emulating VMFUNC for L2 if the function is
out-of-bounds or if VMFUNC is not enabled in vmcs12. Neither condition
should occur in practice, as the CPU is supposed to prioritize the #UD
over VM-Exit for out-of-bounds input and KVM is supposed to enable
VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of
those dependencies is obvious.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the @reset_roots param from kvm_init_mmu(), the one user,
kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
invalidated all roots. This also happens to be why the reset_roots=true
paths doesn't leak roots; they're already invalid.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate
VMFUNC[EPTP_SWITCH] instead of nuking all MMUs. EPTP_SWITCH is the EPT
equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow,
the only hiccup being that A/D enabling is buried in the EPTP. But, that
is easily handled by bouncing through kvm_init_shadow_ept_mmu().
Explicitly request a guest TLB flush if VPID is disabled. Per Intel's
SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined
mappings associated with VPID 0000H (for all PCIDs and for all EP4TA
values, where EP4TA is the value of bits 51:12 of EPTP)".
Note, this technically is a very bizarre bug fix of sorts if L2 is using
PAE paging, as avoiding the full MMU reload also avoids incorrectly
reloading the PDPTEs, which the SDM explicitly states are not touched:
If PAE paging is in use, an EPTP-switching VMFUNC does not load the
four page-directory-pointer-table entries (PDPTEs) from the
guest-physical address in CR3. The logical processor continues to use
the four guest-physical addresses already present in the PDPTEs. The
guest-physical address in CR3 is not translated through the new EPT
paging structures (until some operation that would load the PDPTEs).
In addition to optimizing L2's MMU shenanigans, avoiding the full reload
also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both
root_mmu and guest_mmu.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
tag in the MMU role to identify L2+ roots. From L1's perspective, its
own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
entries. Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
but it is not required to do so.
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync
is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the
all-encompassing nested_vmx_transition_tlb_flush().
Opportunistically add a comment explaning why nested EPT never needs to
sync the MMU on VM-Enter.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bogus logic that incorrectly clobbers the accessed/dirty enabling
status of the nested MMU on an EPTP switch. When nested EPT is enabled,
walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2.
This is likely a benign bug, as mmu->ept_ad is never consumed (since the
MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will
never propagate into future shadow pages since the nested MMU isn't used
to map anything, just to walk L2's page tables.
Note, KVM also does a full MMU reload, i.e. the guest_mmu will be
recreated using the new EPTP, and thus any change in A/D enabling will be
properly recognized in the relevant MMU.
Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use BIT_ULL() instead of an open-coded shift to check whether or not a
function is enabled in L1's VMFUNC bitmap. This is a benign bug as KVM
supports only bit 0, and will fail VM-Enter if any other bits are set,
i.e. bits 63:32 are guaranteed to be zero.
Note, "function" is bounded by hardware as VMFUNC will #UD before taking
a VM-Exit if the function is greater than 63.
Before:
if ((vmcs12->vm_function_control & (1 << function)) == 0)
0x000000000001a916 <+118>: mov $0x1,%eax
0x000000000001a91b <+123>: shl %cl,%eax
0x000000000001a91d <+125>: cltq
0x000000000001a91f <+127>: and 0x128(%rbx),%rax
After:
if (!(vmcs12->vm_function_control & BIT_ULL(function & 63)))
0x000000000001a955 <+117>: mov 0x128(%rbx),%rdx
0x000000000001a95c <+124>: bt %rax,%rdx
Fixes: 27c42a1bb8 ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.
Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.
Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.
Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().
Fixes: 41fab65e7c ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMCS12 is used to keep the authoritative state during nested state
migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
in between L2->L1 vmexit and L1 guest run when actual sync to
enlightened (or shadow) VMCS happens. Nested state, however, has
no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
set_current_vmptr() always sets it. Enlightened vmptrld path, however,
doesn't have the quirk so some VMCS12 changes may not get properly
reflected to eVMCS and L1 will see an incorrect state.
Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
set the change is effectively a nop: in the former case all changes
will get reflected during the first L2->L1 vmexit and in the later
case VMCS12 and eVMCS are already in sync (thanks to
copy_enlightened_to_vmcs12() in vmx_get_nested_state()).
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When nested state migration happens during L1's execution, it
is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
At least genuine Hyper-V seems to not be very happy when 'clean fields'
data changes underneath it.
'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.
While at it, update a comment stating why exactly we need to reset
'hv_clean_fields' data from L0.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-10-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
validly states why shadow vmcs sync can be omitted but this doesn't
apply to enlightened VMCS as it 'shadows' all VMCS12 fields.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
hypervisor is not obliged to keep it up-to-date while it is mangling L2's
state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
VMCS12 is used as a source of ultimate truth, we must make sure we pick all
the changes to eVMCS and thus 'clean fields' data must be ignored.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
enlightened VMCS is in use. TLFS has the following brief description:
"The L1 hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state". Normally,
this change can be ignored as unmapping active eVMCS can be postponed
until the next VMLAUNCH instruction but in case nested state is migrated
with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
may result in its synchronization with VMCS12 and this is incorrect:
L1 hypervisor is free to reuse inactive eVMCS memory for something else.
Inactive eVMCS after VMCLEAR can just be unmapped.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
can not be called directly from vmx_set_nested_state() as KVM may not have
all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
restored yet). Enlightened VMCS is mapped later while getting nested state
pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID'
and it's indistinguishable from 'evmcs is not in use' case. This leads to
certain issues, in particular, if KVM_GET_NESTED_STATE is called right
after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting
state will be unset (and such state will later fail to load).
Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after
restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12()
is no longer needed.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result,
make them return 'void'.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In theory, L1 can try to disable enlightened VMENTRY in VP assist page and
try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld()
properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS
remains mapped and thus all evmptr_is_valid() checks will still pass and
nested_vmx_run() will proceed when it shouldn't.
Release eVMCS immediately when we detect that enlightened vmentry was
disabled by L1.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02()
and both checks look like:
'vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)'
so for eVMCS case the flag changes nothing. Drop the assignment to avoid
the confusion.
No functional change intended.
Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of checking 'vmx->nested.hv_evmcs' use '-1' in
'vmx->nested.hv_evmcs_vmptr' to indicate 'evmcs is not in use' state. This
matches how we check 'vmx->nested.current_vmptr'. Introduce EVMPTR_INVALID
and evmptr_is_valid() and use it instead of raw '-1' check as a preparation
to adding other 'special' values.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to the rest of guest page accesses after a migration,
this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the pdptrs_changed() check when loading L2's CR3. The set of
available registers is always reset when switching VMCSes (see commit
e5d03de593, "KVM: nVMX: Reset register cache (available and dirty
masks) on VMCS switch"), thus the "are PDPTRs available" check will
always fail. And even if it didn't fail, reading guest memory to check
the PDPTRs is just as expensive as reading guest memory to load 'em.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently the remote TLB flush logic is specific to VMX.
Move it to a common place so that SVM can use it as well.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the 'nested_run' statistic counts all guest-entry attempts,
including those that fail during vmentry checks on Intel and during
consistency checks on AMD. Convert this statistic to count only those
guest-entries that make it past these state checks and make it to guest
code. This will tell us the number of guest-entries that actually executed
or tried to execute guest code.
Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that .post_leave_smm() is gone, drop "pre_" from the remaining
helpers. The helpers aren't invoked purely before SMI/RSM processing,
e.g. both helpers are invoked after state is snapshotted (from regs or
SMRAM), and the RSM helper is invoked after some amount of register state
has been stuffed.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609185619.992058-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable,
there's no need to call kvm_apicv_init() from vendor specific code.
No functional change intended.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common
'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't allow posted interrupts to modify a stale posted interrupt
descriptor (including the initial value of 0).
Empirical tests on real hardware reveal that a posted interrupt
descriptor referencing an unbacked address has PCI bus error semantics
(reads as all 1's; writes are ignored). However, kvm can't distinguish
unbacked addresses from device-backed (MMIO) addresses, so it should
really ask userspace for an MMIO completion. That's overly
complicated, so just punt with KVM_INTERNAL_ERROR.
Don't return the error until the posted interrupt descriptor is
actually accessed. We don't want to break the existing kvm-unit-tests
that assume they can launch an L2 VM with a posted interrupt
descriptor that references MMIO space in L1.
Fixes: 6beb7bd52e ("kvm: nVMX: Refactor nested_get_vmcs12_pages()")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-8-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the kernel has no mapping for the vmcs02 virtual APIC page,
userspace MMIO completion is necessary to process nested posted
interrupts. This is not a configuration that KVM supports. Rather than
silently ignoring the problem, try to exit to userspace with
KVM_INTERNAL_ERROR.
Note that the event that triggers this error is consumed as a
side-effect of a call to kvm_check_nested_events. On some paths
(notably through kvm_vcpu_check_block), the error is dropped. In any
case, this is an incremental improvement over always ignoring the
error.
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-7-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functional change intended.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210604172611.281819-4-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate the TSC offset and multiplier on nested transitions and expose
the TSC scaling feature to L1.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-11-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.
Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The write_l1_tsc_offset() callback has a misleading name. It does not
set L1's TSC offset, it rather updates the current TSC offset which
might be different if a nested guest is executing. Additionally, both
the vmx and svm implementations use the same logic for calculating the
current TSC before writing it to hardware.
Rename the function and move the common logic to the caller. The vmx/svm
specific code now merely sets the given offset to the corresponding
hardware structure.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-9-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In order to implement as much of the nested TSC scaling logic as
possible in common code, we need these vendor callbacks for retrieving
the TSC offset and the TSC multiplier that L1 has set for L2.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-7-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store L1's scaling ratio in the kvm_vcpu_arch struct like we already do
for L1's TSC offset. This allows for easy save/restore when we enter and
then exit the nested guest.
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-3-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
of warnings by explicitly adding break statements instead of just letting
the code fall through to the next case.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20210528200756.GA39320@embeddedor>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_get_linear_rip() handles x86/long mode cases well and has
better readability, __kvm_set_rflags() also use the paired
function kvm_is_linear_rip() to check the vcpu->arch.singlestep_rip
set in kvm_arch_vcpu_ioctl_set_guest_debug(), so change the
"CS.BASE + RIP" code in kvm_arch_vcpu_ioctl_set_guest_debug() and
handle_exception_nmi() to this one.
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Message-Id: <20210526063828.1173-1-yuan.yao@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For VMX, when a vcpu enters HLT emulation, pi_post_block will:
1) Add vcpu to per-cpu list of blocked vcpus.
2) Program the posted-interrupt descriptor "notification vector"
to POSTED_INTR_WAKEUP_VECTOR
With interrupt remapping, an interrupt will set the PIR bit for the
vector programmed for the device on the CPU, test-and-set the
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.
This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.
Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:
1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.
To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is
properly reprogrammed to the wakeup vector.
Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Message-Id: <20210526172014.GA29007@fuller.cnet>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CONFIG_X86_LOCAL_APIC is always on when CONFIG_KVM (on x86) since
commit e42eef4ba3 ("KVM: add X86_LOCAL_APIC dependency").
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210518144339.1987982-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmCfmUoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDei8QAMOWMA9wFTydsMTyRwDDZzD9i3Vg4bYlTdj1
1C1FiHHGL37t44coo1eHtnydWBuhxhhwDHWQE8owFbDHyOnPzEX+NwhmJ4gVlUW5
51aSxfPgXzKiv17WyncqZO9SfA5/RFyA/C2gRq9/fMr/7CpQJjqrvdQXaWh4kPVa
9jFMVd1sCDUPd5c9Jyxd42CmVZjg6mCorOKaEwlI7NZkulRBlFW21A5y+M57sGTF
RLIuQcggFJaG17kZN4p6v55Yoclt8O4xVbDv8SZV3vO1gjpaF1LtXdsmAKvbDZrZ
lEtdumPHyD1maFhwXQFMOyvOgEaRhlhiNaTgKUOyX2LgeW1utCiYO/KwysflZvIC
oLsfx3x+G0nSxa+MWGL9m52Hrt4yyscfbKfBg6nqJB+AqD3teH20xfsEUHTEuYkW
kEgeWcJcWkadL5+ngs6S4PwFr88NyVBdUAagNd5VXE/KFhxCcr4B9oOXk5WdOaMi
ZvLG5IQfIH6k3w+h2wR2WSoxYwltriZ3PwrPIeJ2Se33bK15xtQy1k/IIqvZP/oK
0xxRVoY+nwuru0QZGwyI7zCFFvZzEKOXJ3qzJ2NeQxoTBky/e0bvUwnU8gXLXGPM
lx2Gzw6t+xlTfcF9oIaQq7WlOsrC7Zr4uiTurZGLZKWklso9tLdzW35zmdN6D3qx
sP2LC4iv
=57tg
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.13, take #1
- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code
Bus lock debug exception is an ability to notify the kernel by an #DB
trap after the instruction acquires a bus lock and is executed when
CPL>0. This allows the kernel to enforce user application throttling or
mitigations.
Existence of bus lock debug exception is enumerated via
CPUID.(EAX=7,ECX=0).ECX[24]. Software can enable these exceptions by
setting bit 2 of the MSR_IA32_DEBUGCTL. Expose the CPUID to guest and
emulate the MSR handling when guest enables it.
Support for this feature was originally developed by Xiaoyao Li and
Chenyi Qiang, but code has since changed enough that this patch has
nothing in common with theirs, except for this commit message.
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to
the guest CPU model instead of the host CPU behavior. While not strictly
necessary to avoid guest breakage, emulating cross-vendor "architecture"
will provide consistent behavior for the guest, e.g. WRMSR fault behavior
won't change if the vCPU is migrated to a host with divergent behavior.
Note, the "new" kvm_is_supported_user_return_msr() checks do not add new
functionality on either SVM or VMX. On SVM, the equivalent was
"tsc_aux_uret_slot < 0", and on VMX the check was buried in the
vmx_find_uret_msr() call at the find_uret_msr label.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that SVM and VMX both probe MSRs before "defining" user return slots
for them, consolidate the code for probe+define into common x86 and
eliminate the odd behavior of having the vendor code define the slot for
a given MSR.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tag TSX_CTRL as not needing to be loaded when RTM isn't supported in the
host. Crushing the write mask to '0' has the same effect, but requires
more mental gymnastics to understand.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop VMX's global list of user return MSRs now that VMX doesn't resort said
list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have
the same MSRs in the same order.
In addition to eliminating the redundant list, this will also allow moving
more of the list management into common x86.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly flag a uret MSR as needing to be loaded into hardware instead of
resorting the list of "active" MSRs and tracking how many MSRs in total
need to be loaded. The only benefit to sorting the list is that the loop
to load MSRs during vmx_prepare_switch_to_guest() doesn't need to iterate
over all supported uret MRS, only those that are active. But that is a
pointless optimization, as the most common case, running a 64-bit guest,
will load the vast majority of MSRs. Not to mention that a single WRMSR is
far more expensive than iterating over the list.
Providing a stable list order obviates the need to track a given MSR's
"slot" in the per-CPU list of user return MSRs; all lists simply use the
same ordering. Future patches will take advantage of the stable order to
further simplify the related code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Configure the list of user return MSRs that are actually supported at
module init instead of reprobing the list of possible MSRs every time a
vCPU is created. Curating the list on a per-vCPU basis is pointless; KVM
is completely hosed if the set of supported MSRs changes after module init,
or if the set of MSRs differs per physical PCU.
The per-vCPU lists also increase complexity (see __vmx_find_uret_msr()) and
creates corner cases that _should_ be impossible, but theoretically exist
in KVM, e.g. advertising RDTSCP to userspace without actually being able to
virtualize RDTSCP if probing MSR_TSC_AUX fails.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow userspace to enable RDPID for a guest without also enabling RDTSCP.
Aside from checking for RDPID support in the obvious flows, VMX also needs
to set ENABLE_RDTSCP=1 when RDPID is exposed.
For the record, there is no known scenario where enabling RDPID without
RDTSCP is desirable. But, both AMD and Intel architectures allow for the
condition, i.e. this is purely to make KVM more architecturally accurate.
Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Reported-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable preemption when probing a user return MSR via RDSMR/WRMSR. If
the MSR holds a different value per logical CPU, the WRMSR could corrupt
the host's value if KVM is preempted between the RDMSR and WRMSR, and
then rescheduled on a different CPU.
Opportunistically land the helper in common x86, SVM will use the helper
in a future commit.
Fixes: 4be5341026 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation")
Cc: stable@vger.kernel.org
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-6-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated intercept enum for RDPID instead of piggybacking RDTSCP.
Unlike VMX's ENABLE_RDTSCP, RDPID is not bound to SVM's RDTSCP intercept.
Fixes: fb6d4d340e ("KVM: x86: emulate RDPID")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-5-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear KVM's RDPID capability if the ENABLE_RDTSCP secondary exec control is
unsupported. Despite being enumerated in a separate CPUID flag, RDPID is
bundled under the same VMCS control as RDTSCP and will #UD in VMX non-root
if ENABLE_RDTSCP is not enabled.
Fixes: 41cd02c6f7 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210504171734.1434054-2-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When enlightened VMCS is in use and nested state is migrated with
vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
and we can't read it from VP assist page because userspace may decide
to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
(and QEMU, for example, does exactly that). To make sure eVMCS is
mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
request.
Commit f2c7ef3ba9 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
when an immediate exit from L2 to L1 happens right after migration (caused
by a pending event, for example). Unfortunately, in the exact same
situation we still need to have eVMCS mapped so
nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.
As a band-aid, restore nested_get_evmcs_page() when clearing
KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
from being ideal as we can't easily propagate possible failures and even if
we could, this is most likely already too late to do so. The whole
'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
seems to be fragile as we diverge too much from the 'native' path when
vmptr loading happens on vmx_set_nested_state().
Fixes: f2c7ef3ba9 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210503150854.1144255-2-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers. Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210505002735.1684165-9-seanjc@google.com
Defer the call to account guest time until after servicing any IRQ(s)
that happened in the guest or immediately after VM-Exit. Tick-based
accounting of vCPU time relies on PF_VCPU being set when the tick IRQ
handler runs, and IRQs are blocked throughout the main sequence of
vcpu_enter_guest(), including the call into vendor code to actually
enter and exit the guest.
This fixes a bug where reported guest time remains '0', even when
running an infinite loop in the guest:
https://bugzilla.kernel.org/show_bug.cgi?id=209831
Fixes: 87fa7f3e98 ("x86/kvm: Move context tracking where it belongs")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210505002735.1684165-4-seanjc@google.com
In VMX, the host NMI handler needs to be invoked after NMI VM-Exit.
Before commit 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect
call instead of INTn"), this was done by INTn ("int $2"). But INTn
microcode is relatively expensive, so the commit reworked NMI VM-Exit
handling to invoke the kernel handler by function call.
But this missed a detail. The NMI entry point for direct invocation is
fetched from the IDT table and called on the kernel stack. But on 64-bit
the NMI entry installed in the IDT expects to be invoked on the IST stack.
It relies on the "NMI executing" variable on the IST stack to work
correctly, which is at a fixed position in the IST stack. When the entry
point is unexpectedly called on the kernel stack, the RSP-addressed "NMI
executing" variable is obviously also on the kernel stack and is
"uninitialized" and can cause the NMI entry code to run in the wrong way.
Provide a non-ist entry point for VMX which shares the C-function with
the regular NMI entry and invoke the new asm entry point instead.
On 32-bit this just maps to the regular NMI entry point as 32-bit has no
ISTs and is not affected.
[ tglx: Made it independent for backporting, massaged changelog ]
Fixes: 1a5488ef0d ("KVM: VMX: Invoke NMI handler via indirect call instead of INTn")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Lai Jiangshan <laijs@linux.alibaba.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87r1imi8i1.ffs@nanos.tec.linutronix.de
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmCGmYIACgkQEsHwGGHe
VUr45w/8CSXr7MXaFBj4To0hTWJXSZyF6YGqlZOSJXFcFh4cWTNwfVOoFaV47aDo
+HsCNTkGENcKhLrDUWDRiG/Uo46jxtOtl1vhq7U4pGemSYH871XWOKfb5k5XNMwn
/uhaHMI4aEfd6bUFnF518NeyRIsD0BdqFj4tB7RbAiyFwdETDX9Tkj/uBKnQ4zon
4tEDoXgThuK5YKK9zVQg5pa7aFp2zg1CAdX/WzBkS8BHVBPXSV0CF97AJYQOM/V+
lUHv+BN3wp97GYHPQMPsbkNr8IuFoe2mIvikwjxg8iOFpzEU1G1u09XV9R+PXByX
LclFTRqK/2uU5hJlcsBiKfUuidyErYMRYImbMAOREt2w0ogWVu2zQ7HkjVve25h1
sQPwPudbAt6STbqRxvpmB3yoV4TCYwnF91FcWgEy+rcEK2BDsHCnScA45TsK5I1C
kGR1K17pHXprgMZFPveH+LgxewB6smDv+HllxQdSG67LhMJXcs2Epz0TsN8VsXw8
dlD3lGReK+5qy9FTgO7mY0xhiXGz1IbEdAPU4eRBgih13puu03+jqgMaMabvBWKD
wax+BWJUrPtetwD5fBPhlS/XdJDnd8Mkv2xsf//+wT0s4p+g++l1APYxeB8QEehm
Pd7Mvxm4GvQkfE13QEVIPYQRIXCMH/e9qixtY5SHUZDBVkUyFM0=
=bO1i
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Borislav Petkov:
"Trivial cleanups and fixes all over the place"
* tag 'x86_cleanups_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove me from IDE/ATAPI section
x86/pat: Do not compile stubbed functions when X86_PAT is off
x86/asm: Ensure asm/proto.h can be included stand-alone
x86/platform/intel/quark: Fix incorrect kernel-doc comment syntax in files
x86/msr: Make locally used functions static
x86/cacheinfo: Remove unneeded dead-store initialization
x86/process/64: Move cpu_current_top_of_stack out of TSS
tools/turbostat: Unmark non-kernel-doc comment
x86/syscalls: Fix -Wmissing-prototypes warnings from COND_SYSCALL()
x86/fpu/math-emu: Fix function cast warning
x86/msr: Fix wr/rdmsr_safe_regs_on_cpu() prototypes
x86: Fix various typos in comments, take #2
x86: Remove unusual Unicode characters from comments
x86/kaslr: Return boolean values from a function returning bool
x86: Fix various typos in comments
x86/setup: Remove unused RESERVE_BRK_ARRAY()
stacktrace: Move documentation for arch_stack_walk_reliable() to header
x86: Remove duplicate TSC DEADLINE MSR definitions
Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants. I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use. Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bits 63:32 of the base and/or index GPRs when calculating the
effective address of a VMX instruction memory operand. Outside of 64-bit
mode, memory encodings are strictly limited to E*X and below.
Fixes: 064aea7747 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bits 63:32 of the VMCS field encoding when checking for a nested
VM-Exit on VMREAD/VMWRITE in !64-bit mode. VMREAD and VMWRITE always
use 32-bit operands outside of 64-bit mode.
The actual emulation of VMREAD/VMWRITE does the right thing, this bug is
purely limited to incorrectly causing a nested VM-Exit if a GPR happens
to have bits 63:32 set outside of 64-bit mode.
Fixes: a7cde481b6 ("KVM: nVMX: Do not forward VMREAD/VMWRITE VMExits to L1 if required so by vmcs12 vmread/vmwrite bitmaps")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop bits 63:32 when storing a DR/CR to a GPR when the vCPU is not in
64-bit mode. Per the SDM:
The operand size for these instructions is always 32 bits in non-64-bit
modes, regardless of the operand-size attribute.
CR8 technically isn't affected as CR8 isn't accessible outside of 64-bit
mode, but fix it up for consistency and to allow for future cleanup.
Fixes: 6aa8b732ca ("[PATCH] kvm: userspace interface")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable pass-through of the FS and GS base MSRs for 32-bit KVM. Intel's
SDM unequivocally states that the MSRs exist if and only if the CPU
supports x86-64. FS_BASE and GS_BASE are mostly a non-issue; a clever
guest could opportunistically use the MSRs without issue. KERNEL_GS_BASE
is a bigger problem, as a clever guest would subtly be broken if it were
migrated, as KVM disallows software access to the MSRs, and unlike the
direct variants, KERNEL_GS_BASE needs to be explicitly migrated as it's
not captured in the VMCS.
Fixes: 25c5f225be ("KVM: VMX: Enable MSR Bitmap feature")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422023831.3473491-1-seanjc@google.com>
[*NOT* for stable kernels. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invert the inline declarations of the MSR interception helpers between
the wrapper, vmx_set_intercept_for_msr(), and the core implementations,
vmx_{dis,en}able_intercept_for_msr(). Letting the compiler _not_
inline the implementation reduces KVM's code footprint by ~3k bytes.
Back when the helpers were added in commit 904e14fb7c ("KVM: VMX: make
MSR bitmaps per-VCPU"), both the wrapper and the implementations were
__always_inline because the end code distilled down to a few conditionals
and a bit operation. Today, the implementations involve a variety of
checks and bit ops in order to support userspace MSR filtering.
Furthermore, the vast majority of calls to manipulate MSR interception
are not performance sensitive, e.g. vCPU creation and x2APIC toggling.
On the other hand, the one path that is performance sensitive, dynamic
LBR passthrough, uses the wrappers, i.e. is largely untouched by
inverting the inlining.
In short, forcing the low level MSR interception code to be inlined no
longer makes sense.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210423221912.3857243-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use symbolic value, EPT_VIOLATION_GVA_TRANSLATED, instead of 0x100
in handle_ept_violation().
Signed-off-by: Yao Yuan <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <724e8271ea301aece3eb2afe286a9e2e92a70b18.1619136576.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enable SGX virtualization now that KVM has the VM-Exit handlers needed
to trap-and-execute ENCLS to ensure correctness and/or enforce the CPU
model exposed to the guest. Add a KVM module param, "sgx", to allow an
admin to disable SGX virtualization independent of the kernel.
When supported in hardware and the kernel, advertise SGX1, SGX2 and SGX
LC to userspace via CPUID and wire up the ENCLS_EXITING bitmap based on
the guest's SGX capabilities, i.e. to allow ENCLS to be executed in an
SGX-enabled guest. With the exception of the provision key, all SGX
attribute bits may be exposed to the guest. Guest access to the
provision key, which is controlled via securityfs, will be added in a
future patch.
Note, KVM does not yet support exposing ENCLS_C leafs or ENCLV leafs.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <a99e9c23310c79f2f4175c1af4c4cbcef913c3e5.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a VM-Exit handler to trap-and-execute EINIT when SGX LC is enabled
in the host. When SGX LC is enabled, the host kernel may rewrite the
hardware values at will, e.g. to launch enclaves with different signers,
thus KVM needs to intercept EINIT to ensure it is executed with the
correct LE hash (even if the guest sees a hardwired hash).
Switching the LE hash MSRs on VM-Enter/VM-Exit is not a viable option as
writing the MSRs is prohibitively expensive, e.g. on SKL hardware each
WRMSR is ~400 cycles. And because EINIT takes tens of thousands of
cycles to execute, the ~1500 cycle overhead to trap-and-execute EINIT is
unlikely to be noticed by the guest, let alone impact its overall SGX
performance.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <57c92fa4d2083eb3be9e6355e3882fc90cffea87.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulate the four Launch Enclave public key hash MSRs (LE hash MSRs) that
exist on CPUs that support SGX Launch Control (LC). SGX LC modifies the
behavior of ENCLS[EINIT] to use the LE hash MSRs when verifying the key
used to sign an enclave. On CPUs without LC support, the LE hash is
hardwired into the CPU to an Intel controlled key (the Intel key is also
the reset value of the LE hash MSRs). Track the guest's desired hash so
that a future patch can stuff the hash into the hardware MSRs when
executing EINIT on behalf of the guest, when those MSRs are writable in
host.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <c58ef601ddf88f3a113add837969533099b1364a.1618196135.git.kai.huang@intel.com>
[Add a comment regarding the MSRs being available until SGX is locked.
- Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an ECREATE handler that will be used to intercept ECREATE for the
purpose of enforcing and enclave's MISCSELECT, ATTRIBUTES and XFRM, i.e.
to allow userspace to restrict SGX features via CPUID. ECREATE will be
intercepted when any of the aforementioned masks diverges from hardware
in order to enforce the desired CPUID model, i.e. inject #GP if the
guest attempts to set a bit that hasn't been enumerated as allowed-1 in
CPUID.
Note, access to the PROVISIONKEY is not yet supported.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <c3a97684f1b71b4f4626a1fc3879472a95651725.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce sgx.c and sgx.h, along with the framework for handling ENCLS
VM-Exits. Add a bool, enable_sgx, that will eventually be wired up to a
module param to control whether or not SGX virtualization is enabled at
runtime.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <1c782269608b2f5e1034be450f375a8432fb705d.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for handling VM-Exits that originate from a guest SGX
enclave. In SGX, an "enclave" is a new CPL3-only execution environment,
wherein the CPU and memory state is protected by hardware to make the
state inaccesible to code running outside of the enclave. When exiting
an enclave due to an asynchronous event (from the perspective of the
enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
is automatically saved and scrubbed (the CPU loads synthetic state), and
then reloaded when re-entering the enclave. E.g. after an instruction
based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
of the enclave instruction that trigered VM-Exit, but will instead point
to a RIP in the enclave's untrusted runtime (the guest userspace code
that coordinates entry/exit to/from the enclave).
To help a VMM recognize and handle exits from enclaves, SGX adds bits to
existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR. Define the
new architectural bits, and add a boolean to struct vcpu_vmx to cache
VMX_EXIT_REASON_FROM_ENCLAVE. Clear the bit in exit_reason so that
checks against exit_reason do not need to account for SGX, e.g.
"if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
KVM is a largely a passive observer of the new bits, e.g. KVM needs to
account for the bits when propagating information to a nested VMM, but
otherwise doesn't need to act differently for the majority of VM-Exits
from enclaves.
The one scenario that is directly impacted is emulation, which is for
all intents and purposes impossible[1] since KVM does not have access to
the RIP or instruction stream that triggered the VM-Exit. The inability
to emulate is a non-issue for KVM, as most instructions that might
trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
check. For the few instruction that conditionally #UD, KVM either never
sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
if the feature is not exposed to the guest in order to inject a #UD,
e.g. RDRAND_EXITING.
But, because it is still possible for a guest to trigger emulation,
e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
from an enclave. This is architecturally accurate for instruction
VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
to killing the VM. In practice, only broken or particularly stupid
guests should ever encounter this behavior.
Add a WARN in skip_emulated_instruction to detect any attempt to
modify the guest's RIP during an SGX enclave VM-Exit as all such flows
should either be unreachable or must handle exits from enclaves before
getting to skip_emulated_instruction.
[1] Impossible for all practical purposes. Not truly impossible
since KVM could implement some form of para-virtualization scheme.
[2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
CPL3, so we also don't need to worry about that interaction.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add compile-time assertions in vmcs_check32() to disallow accesses to
64-bit and 64-bit high fields via vmcs_{read,write}32(). Upper level KVM
code should never do partial accesses to VMCS fields. KVM handles the
split accesses automatically in vmcs_{read,write}64() when running as a
32-bit kernel.
Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Message-Id: <20210409022456.23528-1-lihaiwei.kernel@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Injected interrupts/nmi should not block a pending exception,
but rather be either lost if nested hypervisor doesn't
intercept the pending exception (as in stock x86), or be delivered
in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
that corresponds to the pending exception.
The only reason for an exception to be blocked is when nested run
is pending (and that can't really happen currently
but still worth checking for).
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When dumping the current VMCS state, include the MSRs that are being
automatically loaded/stored during VM entry/exit.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If EFER is not being loaded from the VMCS, show the effective value by
reference to the MSR autoload list or calculation.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT
fields of the VMCS, examine only the VM entry load controls, as saving
on VM exit has no effect on whether VM entry succeeds or fails.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Show EFER and PAT based on their individual entry/exit controls.
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the VM entry/exit controls for loading/saving MSR_EFER are either
not available (an older processor or explicitly disabled) or not
used (host and guest values are the same), reading GUEST_IA32_EFER
from the VMCS returns an inaccurate value.
Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
whether to print the PDPTRs - always do so if the fields exist.
Fixes: 4eb64dce8d ("KVM: x86: dump VMCS on invalid entry")
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are
clearly associated with a single task/VM.
Note, there are a several SEV allocations that aren't accounted, but
those can (hopefully) be fixed by using the local stack for memory.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210331023025.2485960-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
__vmx_handle_exit() uses vcpu->run->internal.ndata as an index for
an array access. Since vcpu->run is (can be) mapped to a user address
space with a writer permission, the 'ndata' could be updated by the
user process at anytime (the user process can set it to outside the
bounds of the array).
So, it is not safe that __vmx_handle_exit() uses the 'ndata' that way.
Fixes: 1aa561b1a4 ("kvm: x86: Add "last CPU" to some KVM_EXIT information")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210413154739.490299-1-reijiw@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix another ~42 single-word typos in arch/x86/ code comments,
missed a few in the first pass, in particular in .S files.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
Fix ~144 single-word typos in arch/x86/ code comments.
Doing this in a single commit should reduce the churn.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
Track the address of the top-level EPT struct, a.k.a. the root HPA,
instead of the EPTP itself for Hyper-V's paravirt TLB flush. The
paravirt API takes only the address, not the full EPTP, and in theory
tracking the EPTP could lead to false negatives, e.g. if the HPA matched
but the attributes in the EPTP do not. In practice, such a mismatch is
extremely unlikely, if not flat out impossible, given how KVM generates
the EPTP.
Opportunsitically rename the related fields to use the 'root'
nomenclature, and to prefix them with 'hv_' to connect them to Hyper-V's
paravirt TLB flushing.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Skip additional EPTP flushes if one fails when processing EPTPs for
Hyper-V's paravirt TLB flushing. If _any_ flush fails, KVM falls back
to a full global flush, i.e. additional flushes are unnecessary (and
will likely fail anyways).
Continue processing the loop unless a mismatch was already detected,
e.g. to handle the case where the first flush fails and there is a
yet-to-be-detected mismatch.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ifdef away the Hyper-V specific fields in structs kvm_vmx and vcpu_vmx
as each field has only a single reference outside of the struct itself
that isn't already wrapped in ifdeffery (and both are initialization).
vcpu_vmx.ept_pointer in particular should be wrapped as it is valid if
and only if Hyper-v is active, i.e. non-Hyper-V code cannot rely on it
to actually track the current EPTP (without additional code changes).
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly check that kvm_x86_ops.tlb_remote_flush() points at Hyper-V's
implementation for PV flushing instead of assuming that a non-NULL
implementation means running on Hyper-V. Wrap the related logic in
ifdeffery as hv_remote_flush_tlb() is defined iff CONFIG_HYPERV!=n.
Short term, the explicit check makes it more obvious why a non-NULL
tlb_remote_flush() triggers EPTP shenanigans. Long term, this will
allow TDX to define its own implementation of tlb_remote_flush() without
running afoul of Hyper-V.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Don't invalidate the common EPTP, and thus trigger rechecking of EPTPs
across all vCPUs, if the new EPTP matches the old/common EPTP. In all
likelihood this is a meaningless optimization, but there are (uncommon)
scenarios where KVM can reload the same EPTP.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the dedicated 'ept_pointers_match' field in favor of stuffing
'hv_tlb_eptp' with INVALID_PAGE to mark it as invalid, i.e. to denote
that there is at least one EPTP mismatch. Use a local variable to
track whether or not a mismatch is detected so that hv_tlb_eptp can be
used to skip redundant flushes.
No functional change intended.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Combine the for-loops for Hyper-V TLB EPTP checking and flushing, and in
doing so skip flushes for vCPUs whose EPTP matches the target EPTP.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fold check_ept_pointer_match() into hv_remote_flush_tlb_with_range() in
preparation for combining the kvm_for_each_vcpu loops of the ==CHECK and
!=MATCH statements.
No functional change intended.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Capture kvm_vmx in a local variable instead of polluting
hv_remote_flush_tlb_with_range() with to_kvm_vmx(kvm).
No functional change intended.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly track the EPTP that is common to all vCPUs instead of
grabbing vCPU0's EPTP when invoking Hyper-V's paravirt TLB flush.
Tracking the EPTP will allow optimizing the checks when loading a new
EPTP and will also allow dropping ept_pointer_match, e.g. by marking
the common EPTP as invalid.
This also technically fixes a bug where KVM could theoretically flush an
invalid GPA if all vCPUs have an invalid root. In practice, it's likely
impossible to trigger a remote TLB flush in such a scenario. In any
case, the superfluous flush is completely benign.
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Retrieve the active PCID only when writing a guest CR3 value, i.e. don't
get the PCID when using EPT or NPT. The PCID is especially problematic
for EPT as the bits have different meaning, and so the PCID and must be
manually stripped, which is annoying and unnecessary. And on VMX,
getting the active PCID also involves reading the guest's CR3 and
CR4.PCIDE, i.e. may add pointless VMREADs.
Opportunistically rename the pgd/pgd_level params to root_hpa and
root_level to better reflect their new roles. Keep the function names,
as "load the guest PGD" is still accurate/correct.
Last, and probably least, pass root_hpa as a hpa_t/u64 instead of an
unsigned long. The EPTP holds a 64-bit value, even in 32-bit mode, so
in theory EPT could support HIGHMEM for 32-bit KVM. Never mind that
doing so would require changing the MMU page allocators and reworking
the MMU to use kmap().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305183123.3978098-2-seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let the MMU deal with the SPTE masks to avoid splitting the logic and
knowledge across the MMU and VMX.
The SPTE masks that are used for EPT are very, very tightly coupled to
the MMU implementation. The use of available bits, the existence of A/D
types, the fact that shadow_x_mask even exists, and so on and so forth
are all baked into the MMU implementation. Cross referencing the params
to the masks is also a nightmare, as pretty much every param is a u64.
A future patch will make the location of the MMU_WRITABLE and
HOST_WRITABLE bits MMU specific, to free up bit 11 for a MMU_PRESENT bit.
Doing that change with the current kvm_mmu_set_mask_ptes() would be an
absolute mess.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Squish all the code for (re)setting the various SPTE masks into one
location. With the split code, it's not at all clear that the masks are
set once during module initialization. This will allow a future patch to
clean up initialization of the masks without shuffling code all over
tarnation.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-17-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Stop tagging MMIO SPTEs with specific available bits and instead detect
MMIO SPTEs by checking for their unique SPTE value. The value is
guaranteed to be unique on shadow paging and NPT as setting reserved
physical address bits on any other type of SPTE would consistute a KVM
bug. Ditto for EPT, as creating a WX non-MMIO would also be a bug.
Note, this approach is also future-compatibile with TDX, which will need
to reflect MMIO EPT violations as #VEs into the guest. To create an EPT
violation instead of a misconfig, TDX EPTs will need to have RWX=0, But,
MMIO SPTEs will also be the only case where KVM clears SUPPRESS_VE, so
MMIO SPTEs will still be guaranteed to have a unique value within a given
MMU context.
The main motivation is to make it easier to reason about which types of
SPTEs use which available bits. As a happy side effect, this frees up
two more bits for storing the MMIO generation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX. SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler. The nrips goofiness will be addressed in a future patch.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code. Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.
Opportunistically use pr_warn_once() where appropriate.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1. Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1. E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
Debugging VM-Enter is as painful on SVM as it is on VMX.
Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
to avoid any collisions with the uber-concise "CC".
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Defer reloading the MMU after a EPTP successful EPTP switch. The VMFUNC
instruction itself is executed in the previous EPTP context, any side
effects, e.g. updating RIP, should occur in the old context. Practically
speaking, this bug is benign as VMX doesn't touch the MMU when skipping
an emulated instruction, nor does queuing a single-step #DB. No other
post-switch side effects exist.
Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
is running or used to run L2 VM.
An example of the usage of 'nested_run' is to help the host administrator
to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
that may happen with nested virtualization, the administrator will be able
to easily narrow down and confirm if the issue is due to nested
virtualization via 'nested_run'. For example, whether the fix like
commit 88dddc11a8 ("KVM: nVMX: do not use dangling shadow VMCS after
guest reset") is required.
Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize x86_pmu.guest_get_msrs to return 0/NULL to handle the "nop"
case. Patching in perf_guest_get_msrs_nop() during setup does not work
if there is no PMU, as setup bails before updating the static calls,
leaving x86_pmu.guest_get_msrs NULL and thus a complete nop. Ultimately,
this causes VMX abort on VM-Exit due to KVM putting random garbage from
the stack into the MSR load list.
Add a comment in KVM to note that nr_msrs is valid if and only if the
return value is non-NULL.
Fixes: abd562df94 ("x86/perf: Use static_call for x86_pmu.guest_get_msrs")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: syzbot+cce9ef2dd25246f815ee@syzkaller.appspotmail.com
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210309171019.1125243-1-seanjc@google.com
If lbr_desc->event is successfully created, the intel_pmu_create_
guest_lbr_event() will return 0, otherwise it will return -ENOENT,
and then jump to LBR msrs dummy handling.
Fixes: 1b5ac3226a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210223013958.1280444-1-like.xu@linux.intel.com>
[Add "< 0" and PTR_ERR to make the code clearer. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, if enable_pml=1 PML remains enabled for the entire lifetime
of the VM irrespective of whether dirty logging is enable or disabled.
When dirty logging is disabled, all the pages of the VM are manually
marked dirty, so that PML is effectively non-operational. Setting
the dirty bits is an expensive operation which can cause severe MMU
lock contention in a performance sensitive path when dirty logging is
disabled after a failed or canceled live migration.
Manually setting dirty bits also fails to prevent PML activity if some
code path clears dirty bits, which can incur unnecessary VM-Exits.
In order to avoid this extra overhead, dynamically enable/disable PML
when dirty logging gets turned on/off for the first/last memslot.
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the facade of KVM's PML logic being vendor specific and move the
bits that aren't truly VMX specific into common x86 code. The MMU logic
for dealing with PML is tightly coupled to the feature and to VMX's
implementation, bouncing through kvm_x86_ops obfuscates the code without
providing any meaningful separation of concerns or encapsulation.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Store the vendor-specific dirty log size in a variable, there's no need
to wrap it in a function since the value is constant after
hardware_setup() runs.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unconditionally disable PML in vmcs02, KVM emulates PML purely in the
MMU, e.g. vmx_flush_pml_buffer() doesn't even try to copy the L2 GPAs
from vmcs02's buffer to vmcs12. At best, enabling PML is a nop. At
worst, it will cause vmx_flush_pml_buffer() to record bogus GFNs in the
dirty logs.
Initialize vmcs02.GUEST_PML_INDEX such that PML writes would trigger
VM-Exit if PML was somehow enabled, skip flushing the buffer for guest
mode since the index is bogus, and freak out if a PML full exit occurs
when L2 is active.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is not needed because the tweak was done on the guest_mmu, while
nested_ept_uninit_mmu_context has just changed vcpu->arch.walk_mmu
back to the root_mmu.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
trace_kvm_exit prints this value (using vmx_get_exit_info)
so it makes sense to read it before the trace point.
Fixes: dcf068da7e ("KVM: VMX: Introduce generic fastpath handler")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210217145718.1217358-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the restriction that prevents VMX from exposing INVPCID to the
guest without PCID also being exposed to the guest. The justification of
the restriction is that INVPCID will #UD if it's disabled in the VMCS.
While that is a true statement, it's also true that RDTSCP will #UD if
it's disabled in the VMCS. Neither of those things has any dependency
whatsoever on the guest being able to set CR4.PCIDE=1, which is what is
effectively allowed by exposing PCID to the guest.
Removing the bogus restriction aligns VMX with SVM, and also allows for
an interesting configuration. INVPCID is that fastest way to do a global
TLB flush, e.g. see native_flush_tlb_global(). Allowing INVPCID without
PCID would let a guest use the expedited flush while also limiting the
number of ASIDs consumed by the guest.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210212003411.1102677-4-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Advertise INVPCID by default (if supported by the host kernel) instead
of having both SVM and VMX opt in. INVPCID was opt in when it was a
VMX only feature so that KVM wouldn't prematurely advertise support
if/when it showed up in the kernel on AMD hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210212003411.1102677-3-seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always
available. As a preparation to allocating it dynamically, check that it is
not NULL at call sites which can normally proceed without it i.e. the
behavior is identical to the situation when Hyper-V emulation is not being
used by the guest.
When Hyper-V context for a particular vCPU is not allocated, we may still
need to get 'vp_index' from there. E.g. in a hypothetical situation when
Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V
style send-IPI hypercall may still be used. Luckily, vp_index is always
initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V
context is present. Introduce kvm_hv_get_vpindex() helper for
simplification.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a preparation to allocating Hyper-V context dynamically, make it clear
who's the user of the said context.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Spelling '&kvm->arch.hyperv' correctly is hard. Also, this makes the code
more consistent with vmx/svm where to_kvm_vmx()/to_kvm_svm() are already
being used.
Opportunistically change kvm_hv_msr_{get,set}_crash_{data,ctl}() and
kvm_hv_msr_set_crash_data() to take 'kvm' instead of 'vcpu' as these
MSRs are partition wide.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Push the injection of #GP up to the callers, so that they can just use
kvm_complete_insn_gp. __kvm_set_dr is pretty much what the callers can use
together with kvm_complete_insn_gp, so rename it to kvm_set_dr and drop
the old kvm_set_dr wrapper.
This also allows nested VMX code, which really wanted to use __kvm_set_dr,
to use the right function.
While at it, remove the kvm_require_dr() check from the SVM interception.
The APM states:
All normal exception checks take precedence over the SVM intercepts.
which includes the CR4.DE=1 #UD.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_get_dr and emulator_get_dr except an in-range value for the register
number so they cannot fail. Change the return type to void.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace a variety of open coded GPA checks with the recently introduced
common helpers.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that KVM is using static calls, calling vmx_vcpu_run and
vmx_sync_pir_to_irr does not incur anymore the cost of a
retpoline.
Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST
in vendor code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff(). Modify the
latter to return -EIO on fault so that KVM can invoke
kvm_spurious_fault() when appropriate. In addition to the obvious code
reuse, dropping kvm_cpu_vmxoff() also eliminates VMX's last usage of the
__ex()/__kvm_handle_fault_on_reboot() macros, thus helping pave the way
toward dropping them entirely.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the Intel PT tracking outside of the VMXON/VMXOFF helpers so that
a future patch can drop KVM's kvm_cpu_vmxoff() in favor of the kernel's
cpu_vmxoff() without an associated PT functional change, and without
losing symmetry between the VMXON and VMXOFF flows.
Barring undocumented behavior, this should have no meaningful effects
as Intel PT behavior does not interact with CR4.VMXE.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace inline assembly in nested_vmx_check_vmentry_hw
with a call to __vmx_vcpu_run. The function is not
performance critical, so (double) GPR save/restore
in __vmx_vcpu_run can be tolerated, as far as performance
effects are concerned.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: dropped versioning info from changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A subsequent patch introduces macros in preparation for simplifying the
definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform
expands the coverage of the macros. Add vmx/svm prefix to the following
functions: update_exception_bitmap(), enable_nmi_window(),
enable_irq_window(), update_cr8_intercept and enable_smi_window().
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
to 0 when the condition (e.g. RTM) happens. The value can be used to
initialize dr6 and also be the XOR mask between the #DB exit
qualification (or payload) and DR6.
Concerning that DR6_INIT is used as initial value only once, rename it
to DR6_ACTIVE_LOW and apply it in other places, which would make the
incoming changes for bus lock debug exception more simple.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
[Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace could enable guest LBR feature when the exactly supported
LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES
and the LBR is also compatible with vPMU version and host cpu model.
The LBR could be enabled on the guest if host perf supports LBR
(checked via x86_perf_get_lbr()) and the vcpu model is compatible
with the host one.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-11-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to
indicate whether a guest LBR event is still needed by the vcpu. If the
vcpu no longer accesses LBR related registers within a scheduling time
slice, and the enable bit of LBR has been unset, vPMU will treat the
guest LBR event as a bland event of a vPMC counter and release it
as usual. Also, the pass-through state of LBR records msrs is cancelled.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-10-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current vPMU only supports Architecture Version 2. According to
Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if
IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual
PMI and the KVM would emulate to clear the LBR bit (bit 0) in
IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR
to resume recording branches.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-9-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the LBR records msrs has already been pass-through, there is no
need to call vmx_update_intercept_for_lbr_msrs() again and again, and
vice versa.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-8-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In addition to DEBUGCTLMSR_LBR, any KVM trap caused by LBR msrs access
will result in a creation of guest LBR event per-vcpu.
If the guest LBR event is scheduled on with the corresponding vcpu context,
KVM will pass-through all LBR records msrs to the guest. The LBR callstack
mechanism implemented in the host could help save/restore the guest LBR
records during the event context switches, which reduces a lot of overhead
if we save/restore tens of LBR msrs (e.g. 32 LBR records entries) in the
much more frequent VMX transitions.
To avoid reclaiming LBR resources from any higher priority event on host,
KVM would always check the exist of guest LBR event and its state before
vm-entry as late as possible. A negative result would cancel the
pass-through state, and it also prevents real registers accesses and
potential data leakage. If host reclaims the LBR between two checks, the
interception state and LBR records can be safely preserved due to native
save/restore support from guest LBR event.
The KVM emits a pr_warn() when the LBR hardware is unavailable to the
guest LBR event. The administer is supposed to reminder users that the
guest result may be inaccurate if someone is using LBR to record
hypervisor on the host side.
Suggested-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-7-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler
would create a guest LBR event which enables the callstack mode and none of
hardware counter is assigned. The host perf would schedule and enable this
event as usual but in an exclusive way.
The guest LBR event will be released when the vPMU is reset but soon,
the lazy release mechanism would be applied to this event like a vPMC.
Suggested-by: Andi Kleen <ak@linux.intel.com>
Co-developed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-6-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
MSR which tells about the record format stored in the LBR records.
The LBR will be enabled on the guest if host perf supports LBR
(checked via x86_perf_get_lbr()) and the vcpu model is compatible
with the host one.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES
MSR which tells about the record format stored in the LBR records.
The LBR will be enabled on the guest if host perf supports LBR
(checked via x86_perf_get_lbr()) and the vcpu model is compatible
with the host one.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Once MSR_IA32_PERF_CAPABILITIES is changed via vmx_set_msr(), the
value should not be changed by cpuid(). To ensure that the new value
is kept, the default initialization path is moved to intel_pmu_init().
The effective value of the MSR will be 0 if PDCM is clear, however.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To make code responsibilities clear, we may resue and invoke the
vmx_set_intercept_for_msr() in other vmx-specific files (e.g. pmu_intel.c),
so expose it to passthrough LBR msrs later.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210201051039.255478-2-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the
svm_get/set_msr, so the x86 common part can be safely moved to VMX.
This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL.
Add vmx_supported_debugctl() to refactor the throwing logic of #GP.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com>
[Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception
to guest". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use x2apic_mode instead of x2apic_enabled() when adjusting the
destination ID during Posted Interrupt updates. This avoids the costly
RDMSR that is hidden behind x2apic_enabled().
Reported-by: luferry <luferry@163.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210115220354.434807-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Virtual Machine can exploit bus locks to degrade the performance of
system. Bus lock can be caused by split locked access to writeback(WB)
memory or by using locks on uncacheable(UC) memory. The bus lock is
typically >1000 cycles slower than an atomic operation within a cache
line. It also disrupts performance on other cores (which must wait for
the bus lock to be released before their memory operations can
complete).
To address the threat, bus lock VM exit is introduced to notify the VMM
when a bus lock was acquired, allowing it to enforce throttling or other
policy based mitigations.
A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
Detection" VM-execution control(bit 30 of Secondary Processor-based VM
execution controls). If delivery of this VM exit was preempted by a
higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
exit reason in vmcs field is set to 1.
In current implementation, the KVM exposes this capability through
KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
(i.e. off and exit) and enable it explicitly (disabled by default). If
bus locks in guest are detected by KVM, exit to user space even when
current exit reason is handled by KVM internally. Set a new field
KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
is a bus lock detected in guest.
Document for Bus Lock VM exit is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".
Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32). The
full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in
bits 15:0, and single-bit modifiers in bits 31:16.
Historically, KVM has only had to worry about handling the "failed
VM-Entry" modifier, which could only be set in very specific flows and
required dedicated handling. I.e. manually stripping the FAILED_VMENTRY
bit was a somewhat viable approach. But even with only a single bit to
worry about, KVM has had several bugs related to comparing a basic exit
reason against the full exit reason store in vcpu_vmx.
Upcoming Intel features, e.g. SGX, will add new modifier bits that can
be set on more or less any VM-Exit, as opposed to the significantly more
restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off
flows isn't scalable. Tracking exit reason in a union forces code to
explicitly choose between consuming the full exit reason and the basic
exit, and is a convenient way to document and access the modifiers.
No functional change intended.
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
When this happens and the host has tsx=on, it is possible to end up with
virtual machines that have HLE and RTM disabled, but TSX_CTRL available.
If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
use the tsx=off hosts as migration destinations, even though the guests
do not have TSX enabled.
To allow this migration, allow guests to write to their TSX_CTRL MSR,
while keeping the host MSR unchanged for the entire life of the guests.
This ensures that TSX remains disabled and also saves MSR reads and
writes, and it's okay to do because with tsx=off we know that guests will
not have the HLE and RTM features in their CPUID. (If userspace sets
bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).
Cc: stable@vger.kernel.org
Fixes: cbbaa2727a ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode. Therefore we cannot
WARN in that case.
However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.
Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12. This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.
However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.
Fixes: 7952d769c2 ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On VMX, if we exit and then re-enter immediately without leaving
the vmx_vcpu_run() function, the kvm_entry event is not logged.
That means we will see one (or more) kvm_exit, without its (their)
corresponding kvm_entry, as shown here:
CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE
CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE
It also seems possible for a kvm_entry event to be logged, but then
we leave vmx_vcpu_run() right away (if vmx->emulation_required is
true). In this case, we will have a spurious kvm_entry event in the
trace.
Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
(where trace_kvm_exit() already is).
A trace obtained with this patch applied looks like this:
CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE
CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT
Of course, not calling trace_kvm_entry() in common x86 code any
longer means that we need to adjust the SVM side of things too.
Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as
0x0300 in the intel_perfmon_event_map[]. Correct its usage.
Fixes: 62079d8a43 ("KVM: PMU: add proper support for fixed counter 2")
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since we know vPMU will not work properly when (1) the guest bit_width(s)
of the [gp|fixed] counters are greater than the host ones, or (2) guest
requested architectural events exceeds the range supported by the host, so
we can setup a smaller left shift value and refresh the guest cpuid entry,
thus fixing the following UBSAN shift-out-of-bounds warning:
shift exponent 197 is too large for 64-bit type 'long long unsigned int'
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348
kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177
kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308
kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709
kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
where the guest vCPU register state is updated and then the vCPU is VMRUN
to begin execution of the AP. For an SEV-ES guest, this won't work because
the guest register state is encrypted.
Following the GHCB specification, the hypervisor must not alter the guest
register state, so KVM must track an AP/vCPU boot. Should the guest want
to park the AP, it must use the AP Reset Hold exit event in place of, for
example, a HLT loop.
First AP boot (first INIT-SIPI-SIPI sequence):
Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
support. It is up to the guest to transfer control of the AP to the
proper location.
Subsequent AP boot:
KVM will expect to receive an AP Reset Hold exit event indicating that
the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
awaken it. When the AP Reset Hold exit event is received, KVM will place
the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
sequence, KVM will make the vCPU runnable. It is again up to the guest
to then transfer control of the AP to the proper location.
To differentiate between an actual HLT and an AP Reset Hold, a new MP
state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
placed in upon receiving the AP Reset Hold exit event. Additionally, to
communicate the AP Reset Hold exit event up to userspace (if needed), a
new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
a new function that, for non SEV-ES guests, invokes the original SIPI
delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
implements the logic above.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.
In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.
Fixes: a7d5c7ce41 ("KVM: nSVM: delay MSR permission processing to first nested VM run")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.
Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This will be used by SEV-ES to inject MSR failure via the GHCB.
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move kvm_machine_check to x86.h to avoid two exact copies
of the same function in kvm.c and svm.c.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029135600.122392-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until commit e7c587da12 ("x86/speculation: Use synthetic bits for
IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
Testing only Intel bits on VMX processors, or only AMD bits on SVM
processors, fails if the guests are created with the "opposite" vendor
as the host.
While at it, also tweak the host CPU check to use the vendor-agnostic
feature bit X86_FEATURE_IBPB, since we only care about the availability
of the MSR on the host here and not about specific CPUID bits.
Fixes: e7c587da12 ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
Cc: stable@vger.kernel.org
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Saves one byte in __vmx_vcpu_run for the same functionality.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Message-Id: <20201029140457.126965-1-ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On emulated VM-entry and VM-exit, update the CPUID bits that reflect
CR4.OSXSAVE and CR4.PKE.
This fixes a bug where the CPUID bits could continue to reflect L2 CR4
values after emulated VM-exit to L1. It also fixes a related bug where
the CPUID bits could continue to reflect L1 CR4 values after emulated
VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
CPUID is not a required intercept. However, it could also be relevant
to VMX, because the code to conditionally update these CPUID bits
assumes that the guest CPUID and the guest CR4 are always in sync.
Fixes: 8eb3f87d90 ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
Fixes: 2acf923e38 ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
Fixes: b9baba8614 ("KVM, pkeys: expose CPUID/CR4 to guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Dexuan Cui <dexuan.cui@intel.com>
Cc: Huaitong Han <huaitong.han@intel.com>
Message-Id: <20201029170648.483210-1-jmattson@google.com>
This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.
A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.
The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.
This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.
[1] https://patchwork.kernel.org/patch/10471409/
Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Originally, we have three code paths that can dirty a page without
vcpu context for X86:
- init_rmode_identity_map
- init_rmode_tss
- kvmgt_rw_gpa
init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.
To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller. Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.
Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
independent from vCPU and in some cases VMMs may want to use it as a system
ioctl instead. In particular, QEMU doesn CPU feature expansion before any
vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.
Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
same meaning.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.
According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.
In this patch:
1. introduce SIPI exit reason,
2. introduce wait-for-SIPI state for nVMX,
3. advertise wait-for-SIPI support to guest.
When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.
Handle procdure:
Source vCPU:
L2 write LAPIC.ICR(INIT).
L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
vCPU.
Target vCPU:
L0 emulate an INIT VMExit to L1 if is guest mode.
L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
== WAIT_SIPI).
Source vCPU:
L2 write LAPIC.ICR(SIPI).
L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
vCPU.
Target vCPU:
L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
L0 resume to L2.
L2 start-up.
Signed-off-by: Yadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx_apic_init_signal_blocked is buggy in that it returns true
even in VMX non-root mode. In non-root mode, however, INITs
are not latched, they just cause a vmexit. Previously,
KVM was waiting for them to be processed when kvm_apic_accept_events
and in the meanwhile it ate the SIPIs that the processor received.
However, in order to implement the wait-for-SIPI activity state,
KVM will have to process KVM_APIC_SIPI in vmx_check_nested_events,
and it will not be possible anymore to disregard SIPIs in non-root
mode as the code is currently doing.
By calling kvm_x86_ops.nested_ops->check_events, we can force a vmexit
(with the side-effect of latching INITs) before incorrectly injecting
an INIT or SIPI in a guest, and therefore vmx_apic_init_signal_blocked
can do the right thing.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework the common CR4 and SREGS checks to return a bool instead of an
int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
clarify the polarity of the return value (which is effectively inverted
by this change).
No functional changed intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4(). This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.
Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.
Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics. This will be remedied in a future patch.
Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop vmx_set_cr4()'s explicit check on the 'nested' module param now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, via kvm_cpu_caps. X86_FEATURE_VMX is set in kvm_cpu_caps
(by vmx_set_cpu_caps()), if and only if 'nested' is true.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop vmx_set_cr4()'s somewhat hidden guest_cpuid_has() check on VMXE now
that common x86 handles the check by incorporating VMXE into the CR4
reserved bits, i.e. in cr4_guest_rsvd_bits. This fixes a bug where KVM
incorrectly rejects KVM_SET_SREGS with CR4.VMXE=1 if it's executed
before KVM_SET_CPUID{,2}.
Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Reported-by: Stas Sergeev <stsp@users.sourceforge.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It was noticed that evmcs_sanitize_exec_ctrls() is not being executed
nowadays despite the code checking 'enable_evmcs' static key looking
correct. Turns out, static key magic doesn't work in '__init' section
(and it is unclear when things changed) but setup_vmcs_config() is called
only once per CPU so we don't really need it to. Switch to checking
'enlightened_vmcs' instead, it is supposed to be in sync with
'enable_evmcs'.
Opportunistically make evmcs_sanitize_exec_ctrls '__init' and drop unneeded
extra newline from it.
Reported-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201014143346.2430936-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
a host hang.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+T6RoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMx2gf+PjoeMjLKtstdKDdiLFV46X7YdYKz
sUoDhpSbiLpEus5BF6OauUWwKgB7GcsoDUnLgjN5jqkAQzoFm0YOcI2GlXS999SL
5QIg6Vw5WF8X/7EVt6gxzC6KcWjbQvv38R/Ktd/0sMqRBPiZG7kVcWeXlopb9DaQ
Rdgg0hNVpgDiTNrBNl5RnM7Wz/SrOZmwaotW1LcII+BkCnj9Av77v77TxN9YuvG4
o+GMMQseFAzDjQ+jHZkHuBmPRy5dQB9ywzEIrUCubqhT04sWbQ6DhGfx45a0IgsY
33iT28omYdMVlRd/i3KcHQ86JJSo5g7pOqLwGd1L9HjNTS5VmQ8HXNJWBA==
=ECL9
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Two fixes for this merge window, and an unrelated bugfix for a host
hang"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: ioapic: break infinite recursion on lazy EOI
KVM: vmx: rename pi_init to avoid conflict with paride
KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build
allyesconfig results in:
ld: drivers/block/paride/paride.o: in function `pi_init':
(.text+0x1340): multiple definition of `pi_init'; arch/x86/kvm/vmx/posted_intr.o:posted_intr.c:(.init.text+0x0): first defined here
make: *** [Makefile:1164: vmlinux] Error 1
because commit:
commit 8888cdd099
Author: Xiaoyao Li <xiaoyao.li@intel.com>
Date: Wed Sep 23 11:31:11 2020 -0700
KVM: VMX: Extract posted interrupt support to separate files
added another pi_init(), though one already existed in the paride code.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes
For x86, also included in this pull request is a new alternative and
(in the future) more scalable implementation of extended page tables
that does not need a reverse map from guest physical addresses to
host physical addresses. For now it is disabled by default because
it is still lacking a few of the existing MMU's bells and whistles.
However it is a very solid piece of work and it is already available
for people to hammer on it.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+S8dsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM40Af+M46NJmuS5rcwFfybvK/c42KT6svX
Co1NrZDwzSQ2mMy3WQzH9qeLvb+nbY4sT3n5BPNPNsT+aIDPOTDt//qJ2/Ip9UUs
tRNea0MAR96JWLE7MSeeRxnTaQIrw/AAZC0RXFzZvxcgytXwdqBExugw4im+b+dn
Dcz8QxX1EkwT+4lTm5HC0hKZAuo4apnK1QkqCq4SdD2QVJ1YE6+z7pgj4wX7xitr
STKD6q/Yt/0ndwqS0GSGbyg0jy6mE620SN6isFRkJYwqfwLJci6KnqvEK67EcNMu
qeE017K+d93yIVC46/6TfVHzLR/D1FpQ8LZ16Yl6S13OuGIfAWBkQZtPRg==
=AD6a
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"For x86, there is a new alternative and (in the future) more scalable
implementation of extended page tables that does not need a reverse
map from guest physical addresses to host physical addresses.
For now it is disabled by default because it is still lacking a few of
the existing MMU's bells and whistles. However it is a very solid
piece of work and it is already available for people to hammer on it.
Other updates:
ARM:
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
kvm: x86/mmu: NX largepage recovery for TDP MMU
kvm: x86/mmu: Don't clear write flooding count for direct roots
kvm: x86/mmu: Support MMIO in the TDP MMU
kvm: x86/mmu: Support write protection for nesting in tdp MMU
kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
kvm: x86/mmu: Support dirty logging for the TDP MMU
kvm: x86/mmu: Support changed pte notifier in tdp MMU
kvm: x86/mmu: Add access tracking for tdp_mmu
kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
kvm: x86/mmu: Add TDP MMU PF handler
kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
KVM: Cache as_id in kvm_memory_slot
kvm: x86/mmu: Add functions to handle changed TDP SPTEs
kvm: x86/mmu: Allocate and free TDP MMU roots
kvm: x86/mmu: Init / Uninit the TDP MMU
kvm: x86/mmu: Introduce tdp_iter
KVM: mmu: extract spte.h and spte.c
KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
...
Intercept CR4 bits that are guest reserved so that KVM correctly injects
a #GP fault if the guest attempts to set a reserved bit. If a feature
is supported by the CPU but is not exposed to the guest, and its
associated CR4 bit is not intercepted by KVM by default, then KVM will
fail to inject a #GP if the guest sets the CR4 bit without triggering
an exit, e.g. by toggling only the bit in question.
Note, KVM doesn't give the guest direct access to any CR4 bits that are
also dependent on guest CPUID. Yet.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200930041659.28181-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that vcpu_after_set_cpuid() and update_exception_bitmap() are called
back-to-back, subsume the exception bitmap update into the common CPUID
update. Drop the SVM invocation entirely as SVM's exception bitmap
doesn't vary with respect to guest CPUID.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200930041659.28181-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This will be used to signal an error to the userspace, in case
the vendor code failed during handling of this msr. (e.g -ENOMEM)
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework the resetting of the MSR bitmap for x2APIC MSRs to ignore userspace
filtering. Allowing userspace to intercept reads to x2APIC MSRs when
APICV is fully enabled for the guest simply can't work; the LAPIC and thus
virtual APIC is in-kernel and cannot be directly accessed by userspace.
To keep things simple we will in fact forbid intercepting x2APIC MSRs
altogether, independent of the default_allow setting.
Cc: Alexander Graf <graf@amazon.com>
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201005195532.8674-3-sean.j.christopherson@intel.com>
[Modified to operate even if APICv is disabled, adjust documentation. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix an inverted flag for intercepting x2APIC MSRs and intercept writes
by default, even when APICV is enabled.
Fixes: 3eb900173c ("KVM: x86: VMX: Prevent MSR passthrough when MSR access is denied")
Co-developed-by: Peter Xu <peterx@redhat.com>
[sean: added changelog]
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201005195532.8674-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Most of the changes are cleanups and reorganization to make the objtool code
more arch-agnostic. This is in preparation for non-x86 support.
Fixes:
- KASAN fixes.
- Handle unreachable trap after call to noreturn functions better.
- Ignore unreachable fake jumps.
- Misc smaller fixes & cleanups.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+FgwIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1juGw/6A6goA5/HHapM965yG1eY/rTLp3eIbcma
1ZbkUsP0YfT6wVUzw/sOeZzKNOwOq1FuMfkjuH2KcnlxlcMekIaKvLk8uauW4igM
hbFGuuZfZ0An5ka9iQ1W6HGdsuD3vVlN1w/kxdWk0c3lJCVQSTxdCfzF8fuF3gxX
lF3Bc1D/ZFcHIHT/hu/jeIUCgCYpD3qZDjQJBScSwVthZC+Fw6weLLGp2rKDaCao
HhSQft6MUfDrUKfH3LBIUNPRPCOrHo5+AX6BXxLXJVxqlwO/YU3e0GMwSLedMtBy
TASWo7/9GAp+wNNZe8EliyTKrfC3sLxN1QImfjuojxbBVXx/YQ/ToTt9fVGpF4Y+
XhhRFv9520v1tS2wPHIgQGwbh7EWG6mdrmo10RAs/31ViONPrbEZ4WmcA08b/5FY
KEkOVb18yfmDVzVZPpSc+HpIFkppEBOf7wPg27Bj3RTZmzIl/y+rKSnxROpsJsWb
R6iov7SFVET14lHl1G7tPNXfqRaS7HaOQIj3rSUyAP0ZfX+yIupVJp32dc6Ofg8b
SddUCwdIHoFdUNz4Y9csUCrewtCVJbxhV4MIdv0GpWbrgSw96RFZgetaH+6mGRpj
0Kh6M1eC3irDbhBuarWUBAr2doPAq4iOUeQU36Q6YSAbCs83Ws2uKOWOHoFBVwCH
uSKT0wqqG+E=
=KX5o
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"Most of the changes are cleanups and reorganization to make the
objtool code more arch-agnostic. This is in preparation for non-x86
support.
Other changes:
- KASAN fixes
- Handle unreachable trap after call to noreturn functions better
- Ignore unreachable fake jumps
- Misc smaller fixes & cleanups"
* tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf build: Allow nested externs to enable BUILD_BUG() usage
objtool: Allow nested externs to enable BUILD_BUG()
objtool: Permit __kasan_check_{read,write} under UACCESS
objtool: Ignore unreachable trap after call to noreturn functions
objtool: Handle calling non-function symbols in other sections
objtool: Ignore unreachable fake jumps
objtool: Remove useless tests before save_reg()
objtool: Decode unwind hint register depending on architecture
objtool: Make unwind hint definitions available to other architectures
objtool: Only include valid definitions depending on source file type
objtool: Rename frame.h -> objtool.h
objtool: Refactor jump table code to support other architectures
objtool: Make relocation in alternative handling arch dependent
objtool: Abstract alternative special case handling
objtool: Move macros describing structures to arch-dependent code
objtool: Make sync-check consider the target architecture
objtool: Group headers to check in a single list
objtool: Define 'struct orc_entry' only when needed
objtool: Skip ORC entry creation for non-text sections
objtool: Move ORC logic out of check()
...
The PFEC_MASK and PFEC_MATCH fields in the VMCS reverse the meaning of
the #PF intercept bit in the exception bitmap when they do not match.
This means that, if PFEC_MASK and/or PFEC_MATCH are set, the
hypervisor can get a vmexit for #PF exceptions even when the
corresponding bit is clear in the exception bitmap.
This is unexpected and is promptly detected by a WARN_ON_ONCE.
To fix it, reset PFEC_MASK and PFEC_MATCH when the #PF intercept
is disabled (as is common with enable_ept && !allow_smaller_maxphyaddr).
Reported-by: Qian Cai <cai@redhat.com>>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will introduce the concept of MSRs that may not be handled in kernel
space soon. Some MSRs are directly passed through to the guest, effectively
making them handled by KVM from user space's point of view.
This patch introduces all logic required to ensure that MSRs that
user space wants trapped are not marked as direct access for guests.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-7-graf@amazon.com>
[Replace "_idx" with "_slot". - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare vmx and svm for a subsequent change that ensures the MSR permission
bitmap is set to allow an MSR that userspace is tracking to force a vmx_vmexit
in the guest.
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
[agraf: rebase, adapt SVM scheme to nested changes that came in between]
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-5-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "index" to "slot" in struct vmx_uret_msr to align with the
terminology used by common x86's kvm_user_return_msrs, and to avoid
conflating "MSR's ECX index" with "MSR's index into an array".
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-16-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "vmx_msr_index" to "vmx_uret_msrs_list" to associate it with the
uret MSRs array, and to avoid conflating "MSR's ECX index" with "MSR's
index into an array". Similarly, don't use "slot" in the name as that
terminology is claimed by the common x86 "user_return_msrs" mechanism.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-15-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" to vmx_set_guest_msr() to explicitly associate it with the
guest_uret_msrs array, and to differentiate it from vmx_set_msr() as
well as VMX's load/store MSRs.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-14-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "find_msr_entry" to scope it to VMX and to associate it with
guest_uret_msrs. Drop the "entry" so that the function name pairs with
the existing __vmx_find_uret_msr(), which intentionally uses a double
underscore prefix instead of appending "index" or "slot" as those names
are already claimed by other pieces of the user return MSR stack.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-13-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add vmx_setup_uret_msr() to wrap the lookup and manipulation of the uret
MSRs array during setup_msrs(). In addition to consolidating code, this
eliminates move_msr_up(), which while being a very literally description
of the function, isn't exacly helpful in understanding the net effect of
the code.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-12-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move checking for the existence of MSR_EFER in the uret MSR array into
update_transition_efer() so that the lookup and manipulation of the
array in setup_msrs() occur back-to-back. This paves the way toward
adding a helper to wrap the lookup and manipulation.
To avoid unnecessary overhead, defer the lookup until the uret array
would actually be modified in update_transition_efer(). EFER obviously
exists on CPUs that support the dedicated VMCS fields for switching
EFER, and EFER must exist for the guest and host EFER.NX value to
diverge, i.e. there is no danger of attempting to read/write EFER when
it doesn't exist.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-11-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check for RDTSCP support prior to checking if MSR_TSC_AUX is in the uret
MSRs array so that the array lookup and manipulation are back-to-back.
This paves the way toward adding a helper to wrap the lookup and
manipulation.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename "__find_msr_index" to scope it to VMX, associate it with
guest_uret_msrs, and to avoid conflating "MSR's ECX index" with "MSR's
array index". Similarly, don't use "slot" in the name so as to avoid
colliding the common x86's half of "user_return_msrs" (the slot in
kvm_user_return_msrs is not the same slot in guest_uret_msrs).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" to "guest_msrs_ready" to explicitly associate it with the
"guest_uret_msrs" array, and replace "ready" with "loaded" to more
precisely reflect what it tracks, e.g. "ready" could be interpreted as
meaning ready for processing (setup_msrs() has run), which is wrong.
"loaded" also aligns with the similar "guest_state_loaded" field.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "uret" into the name of "save_nmsrs" to explicitly associate it with
the guest_uret_msrs array, and replace "save" with "active" (for lack of
a better word) to better describe what is being tracked. While "save"
is more or less accurate when viewed as a literal description of the
field, e.g. it holds the number of MSRs that were saved into the array
the last time setup_msrs() was invoked, it can easily be misinterpreted
by the reader, e.g. as meaning the number of MSRs that were saved from
hardware at some point in the past, or as the number of MSRs that need
to be saved at some point in the future, both of which are wrong.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename vcpu_vmx.nsmrs to vcpu_vmx.nr_uret_msrs to explicitly associate
it with the guest_uret_msrs array.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename struct "shared_msr_entry" to "vmx_uret_msr" to align with x86's
rename of "shared_msrs" to "user_return_msrs", and to call out that the
struct is specific to VMX, i.e. not part of the generic "shared_msrs"
framework. Abbreviate "user_return" as "uret" to keep line lengths
marginally sane and code more or less readable.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "loadstore" to vmx_find_msr_index() to differentiate it from the so
called shared MSRs helpers (which will soon be renamed), and replace
"index" with "slot" to better convey that the helper returns slot in the
array, not the MSR index (the value that gets stuffed into ECX).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "MAX" to the LOADSTORE and so called SHARED MSR defines to make it
more clear that the define controls the array size, as opposed to the
actual number of valid entries that are in the array.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the "shared_msrs" mechanism, which is used to defer restoring
MSRs that are only consumed when running in userspace, to a more banal
but less likely to be confusing "user_return_msrs".
The "shared" nomenclature is confusing as it's not obvious who is
sharing what, e.g. reasonable interpretations are that the guest value
is shared by vCPUs in a VM, or that the MSR value is shared/common to
guest and host, both of which are wrong.
"shared" is also misleading as the MSR value (in hardware) is not
guaranteed to be shared/reused between VMs (if that's indeed the correct
interpretation of the name), as the ability to share values between VMs
is simply a side effect (albiet a very nice side effect) of deferring
restoration of the host value until returning from userspace.
"user_return" avoids the above confusion by describing the mechanism
itself instead of its effects.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Read vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO only if the
VM-Exit is being reflected to L1 now that they are no longer passed
directly to the kvm_nested_vmexit tracepoint.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the newly introduced TRACE_EVENT_KVM_EXIT to define the guts of
kvm_nested_vmexit so that it captures and prints the same information as
kvm_exit. This has the bonus side effect of fixing the interrupt info
and error code printing for the case where they're invalid, e.g. if the
exit was a failed VM-Entry. This also sets the stage for retrieving
EXIT_QUALIFICATION and VM_EXIT_INTR_INFO in nested_vmx_reflect_vmexit()
if and only if the VM-Exit is being routed to L1.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper, is_exception_with_error_code(), to provide the simple but
difficult to read code of checking for a valid exception with an error
code given a vmcs.VM_EXIT_INTR_INFO value. The helper will gain another
user, vmx_get_exit_info(), in a future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_rip_read() to read the guest's RIP for the nested VM-Exit
tracepoint instead of having the caller pass in an argument. Params
that are passed into a tracepoint are evaluated even if the tracepoint
is disabled, i.e. passing in RIP for VMX incurs a VMREAD and retpoline
to retrieve a value that may never be used, e.g. if the exit is due to a
hardware interrupt.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if KVM attempts to switch to the currently loaded VMCS. Now that
nested_vmx_free_vcpu() doesn't blindly call vmx_switch_vmcs(), all paths
that lead to vmx_switch_vmcs() are implicitly guarded by guest vs. host
mode, e.g. KVM should never emulate VMX instructions when guest mode is
active, and nested_vmx_vmexit() should never be called when host mode is
active.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the explicit switch to vmcs01 and the call to free_nested() in
nested_vmx_free_vcpu(). free_nested(), which is called unconditionally
by vmx_leave_nested(), ensures vmcs01 is loaded prior to freeing vmcs02
and friends.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a WARN in free_nested() to ensure vmcs01 is loaded prior to freeing
vmcs02 and friends, and explicitly switch to vmcs01 if it's not. KVM is
supposed to keep is_guest_mode() and loaded_vmcs==vmcs02 synchronized,
but bugs happen and freeing vmcs02 while it's in use will escalate a KVM
error to a use-after-free and potentially crash the kernel.
Do the WARN and switch even in the !vmxon case to help detect latent
bugs. free_nested() is not a hot path, and the check is cheap.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move free_nested() down below vmx_switch_vmcs() so that a future patch
can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not
the loaded VMCS when freeing nested resources.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Call guest_state_valid() directly instead of querying emulation_required
when checking if L1 is attempting VM-Enter with invalid guest state.
If emulate_invalid_guest_state is false, KVM will fixup segment regs to
avoid emulation and will never set emulation_required, i.e. KVM will
incorrectly miss the associated consistency checks because the nested
path stuffs segments directly into vmcs02.
Opportunsitically add Consistency Check tracing to make future debug
suck a little less.
Fixes: 2bb8cafea8 ("KVM: vVMX: signal failure for nested VMEntry if emulation_required")
Fixes: 3184a995f7 ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM
expects vmcs01 to be loaded when is_guest_mode() is false.
Fixes: 671ddc700f ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: stable@vger.kernel.org
Cc: Dan Cross <dcross@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly reset the segment cache after stuffing guest segment regs in
prepare_vmcs02_rare(). Although the cache is reset when switching to
vmcs02, there is nothing that prevents KVM from re-populating the cache
prior to writing vmcs02 with vmcs12's values. E.g. if the vCPU is
preempted after switching to vmcs02 but before prepare_vmcs02_rare(),
kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl()
and cache the stale vmcs02 value. While the current code base only
caches stale data in the preemption case, it's theoretically possible
future code could read a segment register during the nested flow itself,
i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(),
although it did introduce the bug.
This manifests as an unexpected nested VM-Enter failure when running
with unrestricted guest disabled if the above preemption case coincides
with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3
to to a L2 vCPU at CPL0. stack_segment_valid() will see the new SS_SEL
but the old SS_AR_BYTES and incorrectly mark the guest state as invalid
due to SS.dpl != SS.rpl.
Don't bother updating the cache even though prepare_vmcs02_rare() writes
every segment. With unrestricted guest, guest segments are almost never
read, let alone L2 guest segments. On the other hand, populating the
cache requires a large number of memory writes, i.e. it's unlikely to be
a net win. Updating the cache would be a win when unrestricted guest is
not supported, as guest_state_valid() will immediately cache all segment
registers. But, nested virtualization without unrestricted guest is
dirt slow, saving some VMREADs won't change that, and every CPU
manufactured in the last decade supports unrestricted guest. In other
words, the extra (minor) complexity isn't worth the trouble.
Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL
depending on when preemption occurs. This is "ok" in that the usage is
imperfect by nature, i.e. it's used heuristically to improve performance
but doesn't affect functionality. kvm_arch_vcpu_put() could be "fixed"
by also disabling preemption while loading segments, but that's
pointless and misleading as reading state from kvm_sched_{in,out}() is
guaranteed to see stale data in one form or another. E.g. even if all
the usage of regs_avail is fixed to call kvm_register_mark_available()
after the associated state is set, the individual state might still be
stale with respect to the overall vCPU state. I.e. making functional
decisions in an asynchronous hook is doomed from the get go. Thankfully
KVM doesn't do that.
Fixes: de63ad4cf4 ("KVM: X86: implement the logic for spinlock optimization")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename ops.h to vmx_ops.h to allow adding a tdx_ops.h in the future
without causing massive confusion.
Trust Domain Extensions (TDX) is built on VMX, but KVM cannot directly
access the VMCS(es) for a TDX guest, thus TDX will need its own "ops"
implementation for wrapping the low level operations.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183112.3030-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extract the posted interrupt code so that it can be reused for Trust
Domain Extensions (TDX), which requires posted interrupts and can use
KVM VMX's implementation almost verbatim. TDX is different enough from
raw VMX that it is highly desirable to implement the guts of TDX in a
separate file, i.e. reusing posted interrupt code by shoving TDX support
into vmx.c would be a mess.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923183112.3030-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a helper function and several wrapping macros to consolidate the
copy-paste code in vmx_compute_secondary_exec_control() for adjusting
controls that are dependent on guest CPUID bits.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200925003011.21016-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
preparation for consolidating the logic for adjusting secondary exec
controls based on the guest CPUID model.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If PCID is not exposed to the guest, clear INVPCID in the guest's CPUID
even if the VMCS INVPCID enable is not supported. This will allow
consolidating the secondary execution control adjustment code without
having to special case INVPCID.
Technically, this fixes a bug where !CPUID.PCID && CPUID.INVCPID would
result in unexpected guest behavior (#UD instead of #GP/#PF), but KVM
doesn't support exposing INVPCID if it's not supported in the VMCS, i.e.
such a config is broken/bogus no matter what.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename helpers for a few controls to conform to the more prevelant style
of cpu_has_vmx_<feature>(). Consistent names will allow adding macros
to consolidate the boilerplate code for adjusting secondary execution
controls.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_vcpu_is_illegal_gpa() to check for a legal GPA when validating a
PT output base instead of open coding a clever, but difficult to read,
variant. Code readability is far more important than shaving a few uops
in a slow path.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it
to cpuid.h so that's it's colocated with cpuid_maxphyaddr(). The helper
is not MMU specific and will gain a user that is completely unrelated to
the MMU in a future patch.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the subtly not-a-constant MSR_IA32_RTIT_OUTPUT_BASE_MASK with a
proper helper function to check whether or not the specified base is
valid. Blindly referencing the local 'vcpu' is especially nasty.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use cpuid_maxphyaddr() instead of cpuid_query_maxphyaddr() for the
RTIT base MSR check. There is no reason to recompute MAXPHYADDR as the
precomputed version is synchronized with CPUID updates, and
MSR_IA32_RTIT_OUTPUT_BASE is not written between stuffing CPUID and
refreshing vcpu->arch.maxphyaddr.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200924194250.19137-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The INVD instruction is emulated as a NOP, just skip the instruction
instead.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <addd41be2fbf50f5f4059e990a2a0cff182d2136.1600972918.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add tracepoints for the early consistency checks in nested_vmx_run().
The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as
there is no architectural way to check VMCS.LAUNCH_STATE, and subtle
bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in
the L1 VMM.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812180615.22372-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On successful nested VM-Enter, check for pending interrupts and convert
the highest priority interrupt to a pending posted interrupt if it
matches L2's notification vector. If the vCPU receives a notification
interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
VM-Enter), the pending interrupt (for L1) should be recognized and
processed as a posted interrupt when interrupts become unblocked after
VM-Enter to L2.
This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
trying to inject an interrupt into L2 by setting the appropriate bit in
L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
method of manually moving the vector from PIR->vIRR/RVI). KVM will
observe the IPI while the vCPU is in L1 context and so won't immediately
morph it to a posted interrupt for L2. The pending interrupt will be
seen by vmx_check_nested_events(), cause KVM to force an immediate exit
after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
After handling the VM-Exit, L1 will see that L2 has a pending interrupt
in PIR, send another IPI, and repeat until L2 is killed.
Note, posted interrupts require virtual interrupt deliveriy, and virtual
interrupt delivery requires exit-on-interrupt, ergo interrupts will be
unconditionally unmasked on VM-Enter if posted interrupts are enabled.
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812175129.12172-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rework NMI VM-Exit handling to invoke the kernel handler by function
call instead of INTn. INTn microcode is relatively expensive, and
aligning the IRQ and NMI handling will make it easier to update KVM
should some newfangled method for invoking the handlers come along.
Suggested-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915191505.10355-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the asm blob that invokes the appropriate IRQ handler after VM-Exit
into a proper subroutine. Unconditionally create a stack frame in the
subroutine so that, as objtool sees things, the function has standard
stack behavior. The dynamic stack adjustment makes using unwind hints
problematic.
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915191505.10355-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
more generic is_emulatable(), and unconditionally call the new function
in x86_emulate_instruction().
KVM will use the generic hook to support multiple security related
technologies that prevent emulation in one way or another. Similar to
the existing AMD #NPF case where emulation of the current instruction is
not possible due to lack of information, AMD's SEV-ES and Intel's SGX
and TDX will introduce scenarios where emulation is impossible due to
the guest's register state being inaccessible. And again similar to the
existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
i.e. outside of the control of vendor-specific code.
While the cause and architecturally visible behavior of the various
cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
resume or complete shutdown, and SEV-ES and TDX "return" an error, the
impact on the common emulation code is identical: KVM must stop
emulation immediately and resume the guest.
Query is_emulatable() in handle_ud() as well so that the
force_emulation_prefix code doesn't incorrectly modify RIP before
calling emulate_instruction() in the absurdly unlikely scenario that
KVM encounters forced emulation in conjunction with "do not emulate".
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, prepare_vmcs02_early() does not check if the "unrestricted guest"
VM-execution control in vmcs12 is turned off and leaves the corresponding
bit on in vmcs02. Due to this setting, vmentry checks which are supposed to
render the nested guest state as invalid when this VM-execution control is
not set, are passing in hardware.
This patch turns off the "unrestricted guest" VM-execution control in vmcs02
if vmcs12 has turned it off.
Suggested-by: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200921081027.23047-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All the checks in lapic_timer_int_injected(), __kvm_wait_lapic_expire(), and
these function calls waste cpu cycles when the timer mode is not tscdeadline.
We can observe ~1.3% world switch time overhead by kvm-unit-tests/vmexit.flat
vmcall testing on AMD server. This patch reduces the world switch latency
caused by timer_advance_ns feature when the timer mode is not tscdeadline by
simpling move the check against apic->lapic_timer.expired_tscdeadline much
earlier.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1599731444-3525-7-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The nested VMX controls MSRs can be initialized by the global capability
values stored in vmcs_config.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-6-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
system ioctl KVM_GET_MSR. Add them to the setup of nested VMX controls MSR.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
with a better API in 5.10 or 5.11, for now this is a fix
that works with existing userspace but keeps the current
ugly API.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9ufLMUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMe9AgAgU3YQ2SktkqEOXjHMLqCH5Y3PKFI
S2anYpoKlH36Q6kzoqtkCj0GVagvdh5+Envz3I/tMdhv3Y/JgZaX1wHAe4cUl9BT
VyoiDBTWkhYRmpUbLYA8AtmgxQw1Hp8srH86rnvVGmLG6zdAa/rgUAKiQgT688Ej
CQvF5H7Zi3viPo2rInNSkgTIgewduqSWkwJ6+h4AQMmNJpbRaeZs45yMYyyu/FIi
hUazy7Rwk2vkWcuTd/sqH9b9y3VCYpN9juRaehEiK8qxXT3ydTU4Tub25BHmvXdr
dx5pShG4P3nAGnfV1qKAemyQcY7sjfMieqN1F3QcsRcxqZgySUm11o2JRw==
=sHsX
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
"Five small fixes.
The nested migration bug will be fixed with a better API in 5.10 or
5.11, for now this is a fix that works with existing userspace but
keeps the current ugly API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Add a dedicated INVD intercept routine
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: x86: fix MSR_IA32_TSC read for nested migration
selftests: kvm: Fix assert failure in single-step test
KVM: x86: VMX: Make smaller physical guest address space support user-configurable
This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
Since smaller physical address spaces are only supported on VMX, the
parameter is only exposed in the kvm_intel module.
For now disable support by default, and let the user decide if they want
to enable it.
Modifications to VMX page fault and EPT violation handling will depend
on whether that parameter is enabled.
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Message-Id: <20200903141122.72908-1-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced
(dirty logging, for example)
- Fix tracing output of 64bit values
x86:
- nSVM state restore fixes
- Async page fault fixes
- Lots of small fixes everywhere
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl9dM5kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM+Iwf+LbISO7ccpPMK1kKtOeug/jZv+xQA
sVaBGRzYo+k2e0XtV8E8IV4N30FBtYSwXsbBKkMAoy2FpmMebgDWDQ7xspb6RJMS
/y8t1iqPwdOaLIkUkgc7UihSTlZm05Es3f3q6uZ9+oaM4Fe+V7xWzTUX4Oy89JO7
KcQsTD7pMqS4bfZGADK781ITR/WPgCi0aYx5s6dcwcZAQXhb1K1UKEjB8OGKnjUh
jliReJtxRA16rjF+S5aJ7L07Ce/ksrfwkI4NXJ4GxW+lyOfVNdSBJUBaZt1m7G2M
1We5+i5EjKCjuxmgtUUUfVdazpj1yl+gBGT7KKkLte9T9WZdXyDnixAbvg==
=OFb3
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"A bit on the bigger side, mostly due to me being on vacation, then
busy, then on parental leave, but there's nothing worrisome.
ARM:
- Multiple stolen time fixes, with a new capability to match x86
- Fix for hugetlbfs mappings when PUD and PMD are the same level
- Fix for hugetlbfs mappings when PTE mappings are enforced (dirty
logging, for example)
- Fix tracing output of 64bit values
x86:
- nSVM state restore fixes
- Async page fault fixes
- Lots of small fixes everywhere"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: emulator: more strict rsm checks.
KVM: nSVM: more strict SMM checks when returning to nested guest
SVM: nSVM: setup nested msr permission bitmap on nested state load
SVM: nSVM: correctly restore GIF on vmexit from nesting after migration
x86/kvm: don't forget to ACK async PF IRQ
x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macro
KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit
KVM: SVM: avoid emulation with stale next_rip
KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_EN
KVM: SVM: Periodically schedule when unregistering regions on destroy
KVM: MIPS: Change the definition of kvm type
kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed
KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control
KVM: fix memory leak in kvm_io_bus_unregister_dev()
KVM: Check the allocation of pv cpu mask
KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
KVM: arm64: Update page shift if stage 2 block mapping not supported
KVM: arm64: Fix address truncation in traces
KVM: arm64: Do not try to map PUDs when they are folded into PMD
arm64/x86: KVM: Introduce steal-time cap
...
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
Don't report internal error and freeze guest when event delivery causes
an APIC-access exit, it is handleable and the event will be re-injected
during the next vmentry.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
in exit_ctls_high.
Fixes: 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3. It loads them into
vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.
At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.
The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.
After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
When L2 is in PAE paging mode add a call to ept_load_pdptrs before
leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
vcpu->arch.walk_mmu->pdptrs[].
Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.
Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
failures.
Signed-off-by: Peter Shier <pshier@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200820230545.2411347-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Header frame.h is getting more code annotations to help objtool analyze
object files.
Rename the file to objtool.h.
[ jpoimboe: add objtool.h to MAINTAINERS ]
Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
x86:
* Report last CPU for debugging
* Emulate smaller MAXPHYADDR in the guest than in the host
* .noinstr and tracing fixes from Thomas
* nested SVM page table switching optimization and fixes
Generic:
* Unify shadow MMU cache data structures across architectures
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8pC+oUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcOwgAjomqtEqQNlp7DdZT7VyyklzbxX1/
ud7v+oOJ8K4sFlf64lSthjPo3N9rzZCcw+yOXmuyuITngXOGc3tzIwXpCzpLtuQ1
WO1Ql3B/2dCi3lP5OMmsO1UAZqy9pKLg1dfeYUPk48P5+p7d/NPmk+Em5kIYzKm5
JsaHfCp2EEXomwmljNJ8PQ1vTjIQSSzlgYUBZxmCkaaX7zbEUMtxAQCStHmt8B84
33LczwXBm3viSWrzsoBV37I70+tseugiSGsCfUyupXOvq55d6D9FCqtCb45Hn4Vh
Ik8ggKdalsk/reiGEwNw1/3nr6mRMkHSbl+Mhc4waOIFf9dn0urgQgOaDg==
=YVx0
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"s390:
- implement diag318
x86:
- Report last CPU for debugging
- Emulate smaller MAXPHYADDR in the guest than in the host
- .noinstr and tracing fixes from Thomas
- nested SVM page table switching optimization and fixes
Generic:
- Unify shadow MMU cache data structures across architectures"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
KVM: SVM: Fix sev_pin_memory() error handling
KVM: LAPIC: Set the TDCR settable bits
KVM: x86: Specify max TDP level via kvm_configure_mmu()
KVM: x86/mmu: Rename max_page_level to max_huge_page_level
KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
KVM: VMX: Make vmx_load_mmu_pgd() static
KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
KVM: VMX: Drop a duplicate declaration of construct_eptp()
KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
KVM: Using macros instead of magic values
MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
KVM: VMX: Add guest physical address check in EPT violation and misconfig
KVM: VMX: introduce vmx_need_pf_intercept
KVM: x86: update exception bitmap on CPUID changes
KVM: x86: rename update_bp_intercept to update_exception_bitmap
...
this has been brought into a shape which is maintainable and actually
works.
This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out there
ship rogue kernel modules enabling FSGSBASE behind the kernels back which
opens an instantanious unpriviledged root hole.
The FSGSBASE instructions provide a considerable speedup of the context
switch path and enable user space to write GSBASE without kernel
interaction. This enablement requires careful handling of the exception
entries which go through the paranoid entry path as they cannot longer rely
on the assumption that user GSBASE is positive (as enforced via prctl() on
non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and
exceptions) can still just utilize SWAPGS unconditionally when the entry
comes from user space. Converting these entries to use FSGSBASE has no
benefit as SWAPGS is only marginally slower than WRGSBASE and locating and
retrieving the kernel GSBASE value is not a free operation either. The real
benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes.
The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pGnoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTYJD/9873GkwvGcc/Vq/dJH1szGTgFftPyZ
c/Y9gzx7EGBPLo25BS820L+ZlynzXHDxExKfCEaD10TZfe5XIc1vYNR0J74M2NmK
IBgEDstJeW93ai+rHCFRXIevhpzU4GgGYJ1MeeOgbVMN3aGU1g6HfzMvtF0fPn8Y
n6fsLZa43wgnoTdjwjjikpDTrzoZbaL1mbODBzBVPAaTbim7IKKTge6r/iCKrOjz
Uixvm3g9lVzx52zidJ9kWa8esmbOM1j0EPe7/hy3qH9DFo87KxEzjHNH3T6gY5t6
NJhRAIfY+YyTHpPCUCshj6IkRudE6w/qjEAmKP9kWZxoJrvPCTWOhCzelwsFS9b9
gxEYfsnaKhsfNhB6fi0PtWlMzPINmEA7SuPza33u5WtQUK7s1iNlgHfvMbjstbwg
MSETn4SG2/ZyzUrSC06lVwV8kh0RgM3cENc/jpFfIHD0vKGI3qfka/1RY94kcOCG
AeJd0YRSU2RqL7lmxhHyG8tdb8eexns41IzbPCLXX2sF00eKNkVvMRYT2mKfKLFF
q8v1x7yuwmODdXfFR6NdCkGm9IU7wtL6wuQ8Nhu9UraFmcXo6X6FLJC18FqcvSb9
jvcRP4XY/8pNjjf44JB8yWfah0xGQsaMIKQGP4yLv4j6Xk1xAQKH1MqcC7l1D2HN
5Z24GibFqSK/vA==
=QaAN
-----END PGP SIGNATURE-----
Merge tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fsgsbase from Thomas Gleixner:
"Support for FSGSBASE. Almost 5 years after the first RFC to support
it, this has been brought into a shape which is maintainable and
actually works.
This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out
there ship rogue kernel modules enabling FSGSBASE behind the kernels
back which opens an instantanious unpriviledged root hole.
The FSGSBASE instructions provide a considerable speedup of the
context switch path and enable user space to write GSBASE without
kernel interaction. This enablement requires careful handling of the
exception entries which go through the paranoid entry path as they
can no longer rely on the assumption that user GSBASE is positive (as
enforced via prctl() on non FSGSBASE enabled systemn).
All other entries (syscalls, interrupts and exceptions) can still just
utilize SWAPGS unconditionally when the entry comes from user space.
Converting these entries to use FSGSBASE has no benefit as SWAPGS is
only marginally slower than WRGSBASE and locating and retrieving the
kernel GSBASE value is not a free operation either. The real benefit
of RD/WRGSBASE is the avoidance of the MSR reads and writes.
The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver"
* tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/fsgsbase: Fix Xen PV support
x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase
selftests/x86/fsgsbase: Add a missing memory constraint
selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test
selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE
selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE
selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
x86/entry/64: Introduce the FIND_PERCPU_BASE macro
x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation
x86/process/64: Use FSGSBASE instructions on thread copy and ptrace
x86/process/64: Use FSBSBASE in switch_to() if available
x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
...
to the generic code. Pretty much a straight forward 1:1 conversion plus the
consolidation of the KVM handling of pending work before entering guest
mode.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pEFgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocEwD/474Eb7LzZ8yahyUBirWJP3k3qzgs9j
dZUxqB6LNuDOstEyTGLPdx1dmQP2vHbFfjoM7YBOH37EGcHsqjGliLvn2Y05ZD7O
6kYwjz6qVnJcm3IMtfSUn/8LkfO5pGUdKd3U5ngDmPLpkeaQ4nPKqiO0uIb0wzwa
cO7l10tG4YjMCWQxPNIaOh8kncLieQBediJPFjkQjV+Fh33kSU3LWTl3fccz6b5+
mgSUFL0qjQpp+Nl7lCaDQQiAop9GTUETfDtximRydZauiM2NpCfz+QBmQzq50Xv1
G3DWZoBIZBjmWJUgfSmS/s4GOYkBTBnT/fUcZmIDcgdRwvtEvRzIhcP87/wn7P3N
UKpLdHqmvA0BFDXZbNZgS362++29pj5Lnb+u3QbWSKQ9UqHN0NUlSY4wzfTLXsGp
Mzpp4TW0u/8kyOlo7wK3lVDgNJaPG31aiNVuDPgLe4cEluO5cq7/7g2GcFBqF1Ly
SqNGD1IccteNQTNvDopczPy7qUl5Lal+Ia06szNSPR48gLrvhSWdyYr2i1sD7vx4
hAhR0Hsi9dacGv46TrRw1OdDzq9bOW68G8GIgLJgDXaayPXLnx6TQEUjzQtIkE/i
ydTPUarp5QOFByt+RBjI90ZcW4RuLgMTOEVONPXtSn8IoCP2Kdg9u3gD9AmUW3Q2
JFkKMiSiJPGxlw==
=84y7
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 conversion to generic entry code from Thomas Gleixner:
"The conversion of X86 syscall, interrupt and exception entry/exit
handling to the generic code.
Pretty much a straight-forward 1:1 conversion plus the consolidation
of the KVM handling of pending work before entering guest mode"
* tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()
x86/kvm: Use generic xfer to guest work function
x86/entry: Cleanup idtentry_enter/exit
x86/entry: Use generic interrupt entry/exit code
x86/entry: Cleanup idtentry_entry/exit_user
x86/entry: Use generic syscall exit functionality
x86/entry: Use generic syscall entry function
x86/ptrace: Provide pt_regs helper for entry/exit
x86/entry: Move user return notifier out of loop
x86/entry: Consolidate 32/64 bit syscall entry
x86/entry: Consolidate check_user_regs()
x86: Correct noinstr qualifiers
x86/idtentry: Remove stale comment
Capture the max TDP level during kvm_configure_mmu() instead of using a
kvm_x86_ops hook to do it at every vCPU creation.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calculate the desired TDP level on the fly using the max TDP level and
MAXPHYADDR instead of doing the same when CPUID is updated. This avoids
the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and
also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior
across x86.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the WARN in vmx_load_mmu_pgd() that was temporarily added to aid
bisection/debug in the event the current MMU's shadow root level didn't
match VMX's computed EPTP level.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the shadow_root_level from the current MMU as the root level for the
PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between
VMX and the MMU where both must independently calculate the same root
level for things to work correctly. Temporarily keep VMX's calculation
of the level and use it to WARN if the incoming level diverges.
Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell,
and rename a 'cr3' param in the load_mmu_pgd prototype that managed to
survive the cr3 purge.
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make vmx_load_mmu_pgd() static as it is no longer invoked directly by
nested VMX (or any code for that matter).
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove an extra declaration of construct_eptp() from vmx.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
hdr.vmx.flags is meant for future extensions to the ABI, rejecting
invalid flags is necessary to avoid broken half-loads of the
nVMX state.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A missing VMCS12 was not causing -EINVAL (it was just read with
copy_from_user, so it is not a security issue, but it is still
wrong). Test for VMCS12 validity and reject the nested state
if a VMCS12 is required but not present.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the generic infrastructure to check for and handle pending work before
transitioning into guest mode.
This now handles TIF_NOTIFY_RESUME as well which was ignored so
far. Handling it is important as this covers task work and task work will
be used to offload the heavy lifting of POSIX CPU timers to thread context.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
allows userspace to query if the underlying architecture would
support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
(e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)
The complications in this patch are due to unexpected (but documented)
behaviour we see with NPF vmexit handling in AMD processor. If
SVM is modified to add guest physical address checks in the NPF
and guest #PF paths, we see the followning error multiple times in
the 'access' test in kvm-unit-tests:
test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
Dump mapping: address: 0x123400000000
------L4: 24c3027
------L3: 24c4027
------L2: 24c5021
------L1: 1002000021
This is because the PTE's accessed bit is set by the CPU hardware before
the NPF vmexit. This is handled completely by hardware and cannot be fixed
in software.
Therefore, availability of the new capability depends on a boolean variable
allow_smaller_maxphyaddr which is set individually by VMX and SVM init
routines. On VMX it's always set to true, on SVM it's only set to true
when NPT is not enabled.
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: Babu Moger <babu.moger@amd.com>
Signed-off-by: Mohammed Gamal <mgamal@redhat.com>
Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ignore non-present page faults, since those cannot have reserved
bits set.
When running access.flat with "-cpu Haswell,phys-bits=36", the
number of trapped page faults goes down from 8872644 to 3978948.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-9-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check guest physical address against its maximum, which depends on the
guest MAXPHYADDR. If the guest's physical address exceeds the
maximum (i.e. has reserved bits set), inject a guest page fault with
PFERR_RSVD_MASK set.
This has to be done both in the EPT violation and page fault paths, as
there are complications in both cases with respect to the computation
of the correct error code.
For EPT violations, unfortunately the only possibility is to emulate,
because the access type in the exit qualification might refer to an
access to a paging structure, rather than to the access performed by
the program.
Trapping page faults instead is needed in order to correct the error code,
but the access type can be obtained from the original error code and
passed to gva_to_gpa. The corrections required in the error code are
subtle. For example, imagine that a PTE for a supervisor page has a reserved
bit set. On a supervisor-mode access, the EPT violation path would trigger.
However, on a user-mode access, the processor will not notice the reserved
bit and not include PFERR_RSVD_MASK in the error code.
Co-developed-by: Mohammed Gamal <mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-8-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-7-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We would like to introduce a callback to update the #PF intercept
when CPUID changes. Just reuse update_bp_intercept since VMX is
already using update_exception_bitmap instead of a bespoke function.
While at it, remove an unnecessary assignment in the SVM version,
which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug)
and has nothing to do with the exception bitmap.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
state_test/smm_test selftests are failing on AMD with:
"Unexpected result from KVM_GET_MSRS, r: 51 (failed MSR was 0x345)"
MSR_IA32_PERF_CAPABILITIES is an emulated MSR on Intel but it is not
known to AMD code, we can move the emulation to common x86 code. For
AMD, we basically just allow the host to read and write zero to the MSR.
Fixes: 27461da310 ("KVM: x86/pmu: Support full width counting")
Suggested-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200710152559.1645827-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration",
2020-06-01) accidentally broke nVMX live migration from older version
by changing the userspace ABI. Restore it and, while at it, ensure
that vmx->nested.has_preemption_timer_deadline is always initialized
according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag.
Cc: Makarand Sonare <makarandsonare@google.com>
Fixes: 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
read/write_cr2() go throuh the paravirt XXL indirection, but nested VMX in
a XEN_PV guest is not supported.
Use the native variants.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Message-Id: <20200708195322.344731916@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the functions which are inside the RCU off region into the
non-instrumentable text section.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195322.037311579@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Entering guest mode is more or less the same as returning to user
space. From an instrumentation point of view both leave kernel mode and the
transition to guest or user mode reenables interrupts on the host. In user
mode an interrupt is served directly and in guest mode it causes a VM exit
which then handles or reinjects the interrupt.
The transition from guest mode or user mode to kernel mode disables
interrupts, which needs to be recorded in instrumentation to set the
correct state again.
This is important for e.g. latency analysis because otherwise the execution
time in guest or user mode would be wrongly accounted as interrupt disabled
and could trigger false positives.
Add hardirq tracing to guest enter/exit functions in the same way as it
is done in the user mode enter/exit code, respecting the RCU requirements.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195321.822002354@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Context tracking for KVM happens way too early in the vcpu_run()
code. Anything after guest_enter_irqoff() and before guest_exit_irqoff()
cannot use RCU and should also be not instrumented.
The current way of doing this covers way too much code. Move it closer to
the actual vmenter/exit code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200708195321.724574345@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.
One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP
The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.
This is based on a patch from Sean Christopherson, who suggested this idea.
Fixes: 6441fa6178 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The name of callback cpuid_update() is misleading that it's not about
updating CPUID settings of vcpu but updating the configurations of vcpu
based on the CPUIDs. So rename it to vcpu_after_set_cpuid().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add nested_vmx_fail() to wrap VM-Fail paths that _may_ result in VM-Fail
Valid to make it clear at the call sites that the Valid flavor isn't
guaranteed.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200609015607.6994-1-sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since this field is now in kvm_vcpu_arch, clean things up a little by
setting it in vendor-agnostic code: vcpu_enter_guest. Note that it
must be set after the call to kvm_x86_ops.run(), since it can't be
updated before pre_sev_run().
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-7-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both the vcpu_vmx structure and the vcpu_svm structure have a
'last_cpu' field. Move the common field into the kvm_vcpu_arch
structure. For clarity, rename it to 'last_vmentry_cpu.'
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-6-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
More often than not, a failed VM-entry in an x86 production
environment is induced by a defective CPU. To help identify the bad
hardware, include the id of the last logical CPU to run a vCPU in the
information provided to userspace on a KVM exit for failed VM-entry or
for KVM internal errors not associated with emulation. The presence of
this additional information is indicated by a new capability,
KVM_CAP_LAST_CPU.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-5-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As we already do in svm, record the last logical processor on which a
vCPU has run, so that it can be communicated to userspace for
potential hardware errors.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Message-Id: <20200603235623.245638-4-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logically the ignore_msrs and report_ignored_msrs should also apply to feature
MSRs. Add them in.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200622220442.21998-3-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it
from the non-nested dirty log hooks. And because it's a nested-only
operation.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
WARN if vmx_write_pml_buffer() is called outside of guest mode instead
of silently ignoring the condition. The only caller is nested EPT's
ept_update_accessed_dirty_bits(), which should only be reachable when
L2 is active.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the
CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and
open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks
are effectively VMX specific even if they're not named as such. This
avoids duplicate code, better documents the guest owned CR0 bit, and
eliminates the need for a build-time assertion to keep VMX and x86
synchronized.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark CR4.TSD as being possibly owned by the guest as that is indeed the
case on VMX. Without TSD being tagged as possibly owned by the guest, a
targeted read of CR4 to get TSD could observe a stale value. This bug
is benign in the current code base as the sole consumer of TSD is the
emulator (for RDTSC) and the emulator always "reads" the entirety of CR4
when grabbing bits.
Add a build-time assertion in to ensure VMX doesn't hand over more CR4
bits without also updating x86.
Fixes: 52ce3c21ae ("x86,kvm,vmx: Don't trap writes to CR4.TSD")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703040422.31536-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was
moved to common x86 code.
No functional change intended.
Fixes: 37486135d3 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove support for context switching between the guest's and host's
desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't
required for correct functionality, e.g. KVM intercepts reads and writes
to the MSR, and the latency effects of the settings controlled by the
MSR are not architecturally visible.
As a general rule, KVM should not allow the guest to control power
management settings unless explicitly enabled by userspace, e.g. see
KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2
can improve the performance of SMT siblings. A devious guest could
disable C0.2 so as to improve the performance of their workloads at the
detriment to workloads running in the host or on other VMs.
Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
condition where updates from the host may cause KVM to enter the guest
with the incorrect value. Because updates are are propagated to all
CPUs via IPI (SMP function callback), the value in hardware may be
stale with respect to the cached value and KVM could enter the guest
with the wrong value in hardware. As above, the guest can't observe the
bad value, but it's a weird and confusing wart in the implementation.
Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
lists. Using the lists is only necessary for MSRs that are required for
correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
old hardware, or for MSRs that need to-the-uop precision, e.g. perf
related MSRs. For UMWAIT_CONTROL, the effects are only visible in the
kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
vcpu_vmx_run(). Using the atomic lists is undesirable as they are more
expensive than direct RDMSR/WRMSR.
Furthermore, even if giving the guest control of the MSR is legitimate,
e.g. in pass-through scenarios, it's not clear that the benefits would
outweigh the overhead. E.g. saving and restoring an MSR across a VMX
roundtrip costs ~250 cycles, and if the guest diverged from the host
that cost would be paid on every run of the guest. In other words, if
there is a legitimate use case then it should be enabled by a new
per-VM capability.
Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
UMWAIT and UMONITOR.
Fixes: 6e3ba4abce ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
Cc: stable@vger.kernel.org
Cc: Jingqi Liu <jingqi.liu@intel.com>
Cc: Tao Xu <tao3.xu@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
intents and purposes is vmx_write_pml_buffer(), instead of having the
latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit
update is the result of KVM emulation (rare for L2), then the GPA in the
VMCS may be stale and/or hold a completely unrelated GPA.
Fixes: c5f983f6e8 ("nVMX: Implement emulated Page Modification Logging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Guest crashes are observed on a Cascade Lake system when 'perf top' is
launched on the host, e.g.
BUG: unable to handle kernel paging request at fffffe0000073038
PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380
...
Call Trace:
serial8250_console_write+0xfe/0x1f0
call_console_drivers.constprop.0+0x9d/0x120
console_unlock+0x1ea/0x460
Call traces are different but the crash is imminent. The problem was
blindly bisected to the commit 041bc42ce2 ("KVM: VMX: Micro-optimize
vmexit time when not exposing PMU"). It was also confirmed that the
issue goes away if PMU is exposed to the guest.
With some instrumentation of the guest we can see what is being switched
(when we do atomic_switch_perf_msrs()):
vmx_vcpu_run: switching 2 msrs
vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f
vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2
The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame.
Regardless of whether PMU is exposed to the guest or not, PEBS needs to
be disabled upon switch.
This reverts commit 041bc42ce2.
Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200619094046.654019-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
save_fsgs_for_kvm() is invoked via
vcpu_enter_guest()
kvm_x86_ops.prepare_guest_switch(vcpu)
vmx_prepare_switch_to_guest()
save_fsgs_for_kvm()
with preemption disabled, but interrupts enabled.
The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This
could be done in the helper function, but that function is also called from
switch_to() which has interrupts disabled already.
Disable interrupts inside save_fsgs_for_kvm() and rename the function to
current_save_fsgs() so it can be invoked from other places.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org
Add is_intr_type() and is_intr_type_n() to consolidate the boilerplate
code for querying a specific type of interrupt given an encoded value
from VMCS.VM_{ENTER,EXIT}_INTR_INFO, with and without an associated
vector respectively.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200609014518.26756-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This all started about 6 month ago with the attempt to move the Posix CPU
timer heavy lifting out of the timer interrupt code and just have lockless
quick checks in that code path. Trivial 5 patches.
This unearthed an inconsistency in the KVM handling of task work and the
review requested to move all of this into generic code so other
architectures can share.
Valid request and solved with another 25 patches but those unearthed
inconsistencies vs. RCU and instrumentation.
Digging into this made it obvious that there are quite some inconsistencies
vs. instrumentation in general. The int3 text poke handling in particular
was completely unprotected and with the batched update of trace events even
more likely to expose to endless int3 recursion.
In parallel the RCU implications of instrumenting fragile entry code came
up in several discussions.
The conclusion of the X86 maintainer team was to go all the way and make
the protection against any form of instrumentation of fragile and dangerous
code pathes enforcable and verifiable by tooling.
A first batch of preparatory work hit mainline with commit d5f744f9a2.
The (almost) full solution introduced a new code section '.noinstr.text'
into which all code which needs to be protected from instrumentation of all
sorts goes into. Any call into instrumentable code out of this section has
to be annotated. objtool has support to validate this. Kprobes now excludes
this section fully which also prevents BPF from fiddling with it and all
'noinstr' annotated functions also keep ftrace off. The section, kprobes
and objtool changes are already merged.
The major changes coming with this are:
- Preparatory cleanups
- Annotating of relevant functions to move them into the noinstr.text
section or enforcing inlining by marking them __always_inline so the
compiler cannot misplace or instrument them.
- Splitting and simplifying the idtentry macro maze so that it is now
clearly separated into simple exception entries and the more
interesting ones which use interrupt stacks and have the paranoid
handling vs. CR3 and GS.
- Move quite some of the low level ASM functionality into C code:
- enter_from and exit to user space handling. The ASM code now calls
into C after doing the really necessary ASM handling and the return
path goes back out without bells and whistels in ASM.
- exception entry/exit got the equivivalent treatment
- move all IRQ tracepoints from ASM to C so they can be placed as
appropriate which is especially important for the int3 recursion
issue.
- Consolidate the declaration and definition of entry points between 32
and 64 bit. They share a common header and macros now.
- Remove the extra device interrupt entry maze and just use the regular
exception entry code.
- All ASM entry points except NMI are now generated from the shared header
file and the corresponding macros in the 32 and 64 bit entry ASM.
- The C code entry points are consolidated as well with the help of
DEFINE_IDTENTRY*() macros. This allows to ensure at one central point
that all corresponding entry points share the same semantics. The
actual function body for most entry points is in an instrumentable
and sane state.
There are special macros for the more sensitive entry points,
e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
They allow to put the whole entry instrumentation and RCU handling
into safe places instead of the previous pray that it is correct
approach.
- The INT3 text poke handling is now completely isolated and the
recursion issue banned. Aside of the entry rework this required other
isolation work, e.g. the ability to force inline bsearch.
- Prevent #DB on fragile entry code, entry relevant memory and disable
it on NMI, #MC entry, which allowed to get rid of the nested #DB IST
stack shifting hackery.
- A few other cleanups and enhancements which have been made possible
through this and already merged changes, e.g. consolidating and
further restricting the IDT code so the IDT table becomes RO after
init which removes yet another popular attack vector
- About 680 lines of ASM maze are gone.
There are a few open issues:
- An escape out of the noinstr section in the MCE handler which needs
some more thought but under the aspect that MCE is a complete
trainwreck by design and the propability to survive it is low, this was
not high on the priority list.
- Paravirtualization
When PV is enabled then objtool complains about a bunch of indirect
calls out of the noinstr section. There are a few straight forward
ways to fix this, but the other issues vs. general correctness were
more pressing than parawitz.
- KVM
KVM is inconsistent as well. Patches have been posted, but they have
not yet been commented on or picked up by the KVM folks.
- IDLE
Pretty much the same problems can be found in the low level idle code
especially the parts where RCU stopped watching. This was beyond the
scope of the more obvious and exposable problems and is on the todo
list.
The lesson learned from this brain melting exercise to morph the evolved
code base into something which can be validated and understood is that once
again the violation of the most important engineering principle
"correctness first" has caused quite a few people to spend valuable time on
problems which could have been avoided in the first place. The "features
first" tinkering mindset really has to stop.
With that I want to say thanks to everyone involved in contributing to this
effort. Special thanks go to the following people (alphabetical order):
Alexandre Chartre
Andy Lutomirski
Borislav Petkov
Brian Gerst
Frederic Weisbecker
Josh Poimboeuf
Juergen Gross
Lai Jiangshan
Macro Elver
Paolo Bonzini
Paul McKenney
Peter Zijlstra
Vitaly Kuznetsov
Will Deacon
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7j510THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoU2WD/4refvaNm08fG7aiVYem3JJzr0+Pq5O
/opwnI/1D973ApApj5W/Nd53sN5tVqOiXncSKgywRBWZxRCAGjVYypl9rjpvXu4l
HlMjhEKBmWkDryxxrM98Vr7hl3hnId5laR56oFfH+G4LUsItaV6Uak/HfXZ4Mq1k
iYVbEtl2CN+KJjvSgZ6Y1l853Ab5mmGvmeGNHHWCj8ZyjF3cOLoelDTQNnsb0wXM
crKXBcXJSsCWKYyJ5PTvB82crQCET7Su+LgwK06w/ZbW1//2hVIjSCiN5o/V+aRJ
06BZNMj8v9tfglkN8LEQvRIjTlnEQ2sq3GxbrVtA53zxkzbBCBJQ96w8yYzQX0ux
yhqQ/aIZJ1wTYEjJzSkftwLNMRHpaOUnKvJndXRKAYi+eGI7syF61qcZSYGKuAQ/
bK3b/CzU6QWr1235oTADxh4isEwxA0Pg5wtJCfDDOG0MJ9ALMSOGUkhoiz5EqpkU
mzFAwfG/Uj7hRjlkms7Yj2OjZfnU7iypj63GgpXghLjr5ksRFKEOMw8e1GXltVHs
zzwghUjqp2EPq0VOOQn3lp9lol5Prc3xfFHczKpO+CJW6Rpa4YVdqJmejBqJy/on
Hh/T/ST3wa2qBeAw89vZIeWiUJZZCsQ0f//+2hAbzJY45Y6DuR9vbTAPb9agRgOM
xg+YaCfpQqFc1A==
=llba
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Thomas Gleixner:
"The x86 entry, exception and interrupt code rework
This all started about 6 month ago with the attempt to move the Posix
CPU timer heavy lifting out of the timer interrupt code and just have
lockless quick checks in that code path. Trivial 5 patches.
This unearthed an inconsistency in the KVM handling of task work and
the review requested to move all of this into generic code so other
architectures can share.
Valid request and solved with another 25 patches but those unearthed
inconsistencies vs. RCU and instrumentation.
Digging into this made it obvious that there are quite some
inconsistencies vs. instrumentation in general. The int3 text poke
handling in particular was completely unprotected and with the batched
update of trace events even more likely to expose to endless int3
recursion.
In parallel the RCU implications of instrumenting fragile entry code
came up in several discussions.
The conclusion of the x86 maintainer team was to go all the way and
make the protection against any form of instrumentation of fragile and
dangerous code pathes enforcable and verifiable by tooling.
A first batch of preparatory work hit mainline with commit
d5f744f9a2 ("Pull x86 entry code updates from Thomas Gleixner")
That (almost) full solution introduced a new code section
'.noinstr.text' into which all code which needs to be protected from
instrumentation of all sorts goes into. Any call into instrumentable
code out of this section has to be annotated. objtool has support to
validate this.
Kprobes now excludes this section fully which also prevents BPF from
fiddling with it and all 'noinstr' annotated functions also keep
ftrace off. The section, kprobes and objtool changes are already
merged.
The major changes coming with this are:
- Preparatory cleanups
- Annotating of relevant functions to move them into the
noinstr.text section or enforcing inlining by marking them
__always_inline so the compiler cannot misplace or instrument
them.
- Splitting and simplifying the idtentry macro maze so that it is
now clearly separated into simple exception entries and the more
interesting ones which use interrupt stacks and have the paranoid
handling vs. CR3 and GS.
- Move quite some of the low level ASM functionality into C code:
- enter_from and exit to user space handling. The ASM code now
calls into C after doing the really necessary ASM handling and
the return path goes back out without bells and whistels in
ASM.
- exception entry/exit got the equivivalent treatment
- move all IRQ tracepoints from ASM to C so they can be placed as
appropriate which is especially important for the int3
recursion issue.
- Consolidate the declaration and definition of entry points between
32 and 64 bit. They share a common header and macros now.
- Remove the extra device interrupt entry maze and just use the
regular exception entry code.
- All ASM entry points except NMI are now generated from the shared
header file and the corresponding macros in the 32 and 64 bit
entry ASM.
- The C code entry points are consolidated as well with the help of
DEFINE_IDTENTRY*() macros. This allows to ensure at one central
point that all corresponding entry points share the same
semantics. The actual function body for most entry points is in an
instrumentable and sane state.
There are special macros for the more sensitive entry points, e.g.
INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
They allow to put the whole entry instrumentation and RCU handling
into safe places instead of the previous pray that it is correct
approach.
- The INT3 text poke handling is now completely isolated and the
recursion issue banned. Aside of the entry rework this required
other isolation work, e.g. the ability to force inline bsearch.
- Prevent #DB on fragile entry code, entry relevant memory and
disable it on NMI, #MC entry, which allowed to get rid of the
nested #DB IST stack shifting hackery.
- A few other cleanups and enhancements which have been made
possible through this and already merged changes, e.g.
consolidating and further restricting the IDT code so the IDT
table becomes RO after init which removes yet another popular
attack vector
- About 680 lines of ASM maze are gone.
There are a few open issues:
- An escape out of the noinstr section in the MCE handler which needs
some more thought but under the aspect that MCE is a complete
trainwreck by design and the propability to survive it is low, this
was not high on the priority list.
- Paravirtualization
When PV is enabled then objtool complains about a bunch of indirect
calls out of the noinstr section. There are a few straight forward
ways to fix this, but the other issues vs. general correctness were
more pressing than parawitz.
- KVM
KVM is inconsistent as well. Patches have been posted, but they
have not yet been commented on or picked up by the KVM folks.
- IDLE
Pretty much the same problems can be found in the low level idle
code especially the parts where RCU stopped watching. This was
beyond the scope of the more obvious and exposable problems and is
on the todo list.
The lesson learned from this brain melting exercise to morph the
evolved code base into something which can be validated and understood
is that once again the violation of the most important engineering
principle "correctness first" has caused quite a few people to spend
valuable time on problems which could have been avoided in the first
place. The "features first" tinkering mindset really has to stop.
With that I want to say thanks to everyone involved in contributing to
this effort. Special thanks go to the following people (alphabetical
order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
Vitaly Kuznetsov, and Will Deacon"
* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
x86/entry: Force rcu_irq_enter() when in idle task
x86/entry: Make NMI use IDTENTRY_RAW
x86/entry: Treat BUG/WARN as NMI-like entries
x86/entry: Unbreak __irqentry_text_start/end magic
x86/entry: __always_inline CR2 for noinstr
lockdep: __always_inline more for noinstr
x86/entry: Re-order #DB handler to avoid *SAN instrumentation
x86/entry: __always_inline arch_atomic_* for noinstr
x86/entry: __always_inline irqflags for noinstr
x86/entry: __always_inline debugreg for noinstr
x86/idt: Consolidate idt functionality
x86/idt: Cleanup trap_init()
x86/idt: Use proper constants for table size
x86/idt: Add comments about early #PF handling
x86/idt: Mark init only functions __init
x86/entry: Rename trace_hardirqs_off_prepare()
x86/entry: Clarify irq_{enter,exit}_rcu()
x86/entry: Remove DBn stacks
x86/entry: Remove debug IDT frobbing
x86/entry: Optimize local_db_save() for virt
...
Using a topic branch so that stable branches can simply cherry-pick the
patch.
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON,
when determining whether a nested VM-Exit should be reflected into L1 or
handled by KVM in L0.
For better or worse, the switch statement in nested_vmx_exit_reflected()
currently defaults to "true", i.e. reflects any nested VM-Exit without
dedicated logic. Because the case statements only contain the basic
exit reason, any VM-Exit with modifier bits set will be reflected to L1,
even if KVM intended to handle it in L0.
Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY,
i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to
L1, as "failed VM-Entry" is the only modifier that KVM can currently
encounter. The SMM modifiers will never be generated as KVM doesn't
support/employ a SMI Transfer Monitor. Ditto for "exit from enclave",
as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to
enter an enclave in a KVM guest (L1 or L2).
Fixes: 644d711aa0 ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit")
Cc: Jim Mattson <jmattson@google.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because DRn access is 'difficult' with virt; but the DR7 read is cheaper
than a cacheline miss on native, add a virt specific fast path to
local_db_save(), such that when breakpoints are not in use to avoid
touching DRn entirely.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org
Convert #MC to IDTENTRY_MCE:
- Implement the C entry points with DEFINE_IDTENTRY_MCE
- Emit the ASM stub with DECLARE_IDTENTRY_MCE
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the error code from *machine_check_vector() as
it is always 0 and not used by any of the functions
it can point to. Fixup all the functions as well.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.334980426@linutronix.de
Syzbot reports the following issue:
WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
Call Trace:
...
RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067
'exception' we're trying to inject with kvm_inject_emulated_page_fault()
comes from:
nested_vmx_get_vmptr()
kvm_read_guest_virt()
kvm_read_guest_virt_helper()
vcpu->arch.walk_mmu->gva_to_gpa()
but it is only set when GVA to GPA conversion fails. In case it doesn't but
we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
'exception'. This happen when the argument is MMIO.
Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
KVM code where kvm_read/write_guest_virt*() return result is mishandled.
VMX instructions along with INVPCID have the same issue. This was already
noticed before, e.g. see commit 541ab2aeb2 ("KVM: x86: work around
leak of uninitialized stack contents") but was never fully fixed.
KVM could've handled the request correctly by going to userspace and
performing I/O but there doesn't seem to be a good need for such requests
in the first place.
Introduce vmx_handle_memory_failure() as an interim solution.
Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
to have a good enum describing this tristate, just add "int *ret" to
nested_vmx_get_vmptr() interface to pass the information.
Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Both Intel and AMD support (MPK) Memory Protection Key feature.
Move the feature detection from VMX to the common code. It should
work for both the platforms now.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Message-Id: <158932795627.44260.15144185478040178638.stgit@naples-babu.amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unconditionally return true when querying the validity of
MSR_IA32_PERF_CAPABILITIES so as to defer the validity check to
intel_pmu_{get,set}_msr(), which can properly give the MSR a pass when
the access is initiated from host userspace. The MSR is emulated so
there is no underlying hardware dependency to worry about.
Fixes: 27461da310 ("KVM: x86/pmu: Support full width counting")
Cc: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200603203303.28545-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a nested VM with a VMX-preemption timer is migrated, verify that the
nested VM and its parent VM observe the VMX-preemption timer exit close to
the original expiration deadline.
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200526215107.205814-3-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add new field to hold preemption timer expiration deadline
appended to struct kvm_vmx_nested_state_hdr. This is to prevent
the first VM-Enter after migration from incorrectly restarting the timer
with the full timer value instead of partially decayed timer value.
KVM_SET_NESTED_STATE restarts timer using migrated state regardless
of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER.
Fixes: cf8b84f48a ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Message-Id: <20200526215107.205814-2-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0)
for GP counters that allows writing the full counter width. Enable this
range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]).
The guest would query CPUID to get the counter width, and sign extends
the counter values as needed. The traditional MSRs always limit to 32bit,
even though the counter internally is larger (48 or 57 bits).
When the new capability is set, use the alternative range which do not
have these restrictions. This lowers the overhead of perf stat slightly
because it has to do less interrupts to accumulate the counter value.
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr. No functional changes.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Message-Id: <20200529074347.124619-2-like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.
While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.
The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
This issue was found with the help of Coccinelle.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Message-Id: <20200507185618.GA14831@embeddedor>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to VMX, the state that is captured through the currently available
IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was
running at the moment when the process was interrupted to save its state.
In particular, the SVM-specific state for nested virtualization includes
the L1 saved state (including the interrupt flag), the cached L2 controls,
and the GIF.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as
an optimization, but this is only correct before the nested vmentry.
If userspace is modifying CR3 with KVM_SET_SREGS after the VM has
already been put in guest mode, the value of CR3 will not be updated.
Remove the optimization, which almost never triggers anyway.
Fixes: 04f11ef458 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case an interrupt arrives after nested.check_events but before the
call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt
window even if the interrupt is actually going to be a vmexit. This is
useless rather than harmful, but it really complicates reasoning about
SVM's handling of the VINTR intercept. We'd like to never bother with
the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in
that case there is no interrupt window and we can just exit the nested
guest whenever we want.
This patch moves the opening of the interrupt window inside
inject_pending_event. This consolidates the check for pending
interrupt/NMI/SMI in one place, and makes KVM's usage of immediate
exits more consistent, extending it beyond just nested virtualization.
There are two functional changes here. They only affect corner cases,
but overall they simplify the inject_pending_event.
- re-injection of still-pending events will also use req_immediate_exit
instead of using interrupt-window intercepts. This should have no impact
on performance on Intel since it simply replaces an interrupt-window
or NMI-window exit for a preemption-timer exit. On AMD, which has no
equivalent of the preemption time, it may incur some overhead but an
actual effect on performance should only be visible in pathological cases.
- kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true
if an interrupt, NMI or SMI is blocked by nested_run_pending. This
makes sense because entering the VM will allow it to make progress
and deliver the event.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The second "/* fall through */" in rmode_exception() makes code harder to
read. Replace it with "return" to indicate they are different cases, only
the #DB and #BP check vcpu->guest_debug, while others don't care. And this
also improves the readability.
Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Message-Id: <1582080348-20827-1-git-send-email-linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>