Commit Graph

566 Commits

Author SHA1 Message Date
Sean Christopherson
c7d855c2af KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed
per the associated nested VMX MSRs' fixed0/1 settings.  KVM cannot rely
on hardware to perform the checks, even for the few checks that have
higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in
hardware while running the guest, (b) there may incompatible CR0/CR4 bits
that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may
have further restricted the allowed CR0/CR4 values by manipulating the
guest's nested VMX MSRs.

Note, despite a very strong desire to throw shade at Jim, commit
70f3aac964 ("kvm: nVMX: Remove superfluous VMX instruction fault checks")
is not to blame for the buggy behavior (though the comment...).  That
commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks
(though it did erroneously drop the CPL check, but that has already been
remedied).  KVM may force CR0.PE=1, but will do so only when also
forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033
Fixes: ec378aeef9 ("KVM: nVMX: Implement VMXON and VMXOFF")
Reported-by: Eric Li <ercli@ucdavis.edu>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220607213604.3346000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28 13:22:26 -04:00
Paolo Bonzini
cca3f3381b Merge commit 'kvm-vmx-nested-tsc-fix' into kvm-master
Merge bugfix needed in both 5.19 (because it's bad) and 5.20 (because
it is a prerequisite to test new features).
2022-07-14 10:04:44 -04:00
Vitaly Kuznetsov
9948272645 KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
frequency for its guests even when the feature is advertised and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220712135009.952805-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14 07:44:05 -04:00
Vitaly Kuznetsov
156b9d76e8 KVM: nVMX: Always enable TSC scaling for L2 when it was enabled for L1
Windows 10/11 guests with Hyper-V role (WSL2) enabled are observed to
hang upon boot or shortly after when a non-default TSC frequency was
set for L1. The issue is observed on a host where TSC scaling is
supported. The problem appears to be that Windows doesn't use TSC
scaling for its guests, even when the feature is advertised, and KVM
filters SECONDARY_EXEC_TSC_SCALING out when creating L2 controls from
L1's VMCS. This leads to L2 running with the default frequency (matching
host's) while L1 is running with an altered one.

Keep SECONDARY_EXEC_TSC_SCALING in secondary exec controls for L2 when
it was set for L1. TSC_MULTIPLIER is already correctly computed and
written by prepare_vmcs02().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: d041b5ea93 ("KVM: nVMX: Enable nested TSC scaling")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20220712135009.952805-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-07-12 12:06:20 -07:00
Josh Poimboeuf
bb06650634 KVM: VMX: Convert launched argument to flags
Convert __vmx_vcpu_run()'s 'launched' argument to 'flags', in
preparation for doing SPEC_CTRL handling immediately after vmexit, which
will need another flag.

This is much easier than adding a fourth argument, because this code
supports both 32-bit and 64-bit, and the fourth argument on 32-bit would
have to be pushed on the stack.

Note that __vmx_vcpu_run_flags() is called outside of the noinstr
critical section because it will soon start calling potentially
traceable functions.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-27 10:34:00 +02:00
Sean Christopherson
fe1911aa44 KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access page
Use kvm_vcpu_map() to get/pin the backing for vmcs12's APIC-access page,
there's no reason it has to be restricted to 'struct page' backing.  The
APIC-access page actually doesn't need to be backed by anything, which is
ironically why it got left behind by the series which introduced
kvm_vcpu_map()[1]; the plan was to shove a dummy pfn into vmcs02[2], but
that code never got merged.

Switching the APIC-access page to kvm_vcpu_map() doesn't preclude using a
magic pfn in the future, and will allow a future patch to drop
kvm_vcpu_gpa_to_page().

[1] https://lore.kernel.org/all/1547026933-31226-1-git-send-email-karahmed@amazon.de
[2] https://lore.kernel.org/lkml/1543845551-4403-1-git-send-email-karahmed@amazon.de

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220429010416.2788472-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:33 -04:00
Sean Christopherson
913d6c9b8f KVM: nVMX: Update vmcs12 on BNDCFGS write, not at vmcs02=>vmcs12 sync
Update vmcs12->guest_bndcfgs on intercepted writes to BNDCFGS from L2
instead of waiting until vmcs02 is synchronized to vmcs12.  KVM always
intercepts BNDCFGS accesses, so the only way the value in vmcs02 can
change is via KVM's explicit VMWRITE during emulation.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:21 -04:00
Sean Christopherson
308a4fffeb KVM: nVMX: Save BNDCFGS to vmcs12 iff relevant controls are exposed to L1
Save BNDCFGS to vmcs12 (from vmcs02) if and only if at least of one of
the load-on-entry or clear-on-exit fields for BNDCFGS is enumerated as an
allowed-1 bit in vmcs12.  Skipping the field avoids an unnecessary VMREAD
when MPX is supported but not exposed to L1.

Per Intel's SDM:

  If the processor supports either the 1-setting of the "load IA32_BNDCFGS"
  VM-entry control or that of the "clear IA32_BNDCFGS" VM-exit control, the
  contents of the IA32_BNDCFGS MSR are saved into the corresponding field.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:20 -04:00
Sean Christopherson
5d76b1f8c7 KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_*
Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter
values to reflect that they can hold L2's values when restoring nested
state, e.g. if userspace restores MSRs before nested state.  As crazy as
it seems, restoring MSRs before nested state actually works (because KVM
goes out if it's way to make it work), even though the initial MSR writes
will hit vmcs01 despite holding L2 values.

Add a related comment to vmx_enter_smm() to call out that using the
common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and
broken.  The few MSRs that have snapshots _could_ be fixed by taking a
snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but
that's just the tip of the iceberg as the rather long list of MSRs that
aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this
way.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:20 -04:00
Sean Christopherson
764643a6be KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending case
If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL
irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in
vmcs12.  When restoring nested state, e.g. after migration, without a
nested run pending, prepare_vmcs02() will propagate
nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into
vmcs02.GUEST_IA32_DEBUGCTL.

If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading DEBUGCTL will also update vmcs02.  But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.

Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported.  Sadly, some
VMMs do exactly that and rely on KVM to make things work.

Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL
to vmcs02 across RSM may corrupt L2's DEBUGCTL.  But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.

Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
Fixes: 8fcc4b5923 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:19 -04:00
Sean Christopherson
fa578398a0 KVM: nVMX: Snapshot pre-VM-Enter BNDCFGS for !nested_run_pending case
If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective
of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12.  When restoring
nested state, e.g. after migration, without a nested run pending,
prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02,
i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS.

If userspace restores nested state before MSRs, then loading garbage is a
non-issue as loading BNDCFGS will also update vmcs02.  But if usersepace
restores MSRs first, then KVM is responsible for propagating L2's value,
which is actually thrown into vmcs01, into vmcs02.

Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state
is all kinds of bizarre and ideally would not be supported.  Sadly, some
VMMs do exactly that and rely on KVM to make things work.

Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS
to vmcs02 across RSM may corrupt L2's BNDCFGS.  But KVM's entire VMX+SMM
emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the
"default treatment of SMIs", i.e. when not using an SMI Transfer Monitor.

Link: https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
Fixes: 62cf9bd811 ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS")
Cc: stable@vger.kernel.org
Cc: Lei Wang <lei4.wang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614215831.3762138-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20 06:21:18 -04:00
Tao Xu
2f4073e08f KVM: VMX: Enable Notify VM exit
There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
  can query and enable this feature in per-VM scope. The argument is a
  64bit value: bits 63:32 are used for notify window, and bits 31:0 are
  for flags. Current supported flags:
  - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
    window provided.
  - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
  threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
  bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:56:24 -04:00
Sean Christopherson
938c8745bc KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings
Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature.  The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 05:21:16 -04:00
Paolo Bonzini
47e8eec832 KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
 
 - Guard pages for the EL2 stacks
 
 - Trap and emulate AArch32 ID registers to hide unsupported features
 
 - Ability to select and save/restore the set of hypercalls exposed
   to the guest
 
 - Support for PSCI-initiated suspend in collaboration with userspace
 
 - GICv3 register-based LPI invalidation support
 
 - Move host PMU event merging into the vcpu data structure
 
 - GICv3 ITS save/restore fixes
 
 - The usual set of small-scale cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmKGAGsPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDB/gQAMhyZ+wCG0OMEZhwFF6iDfxVEX2Kw8L41NtD
 a/e6LDWuIOGihItpRkYROc5myG74D7XckF2Bz3G7HJoU4vhwHOV/XulE26GFizoC
 O1GVRekeSUY81wgS1yfo0jojLupBkTjiq3SjTHoDP7GmCM0qDPBtA0QlMRzd2bMs
 Kx0+UUXZUHFSTXc7Lp4vqNH+tMp7se+yRx7hxm6PCM5zG+XYJjLxnsZ0qpchObgU
 7f6YFojsLUs1SexgiUqJ1RChVQ+FkgICh5HyzORvGtHNNzK6D2sIbsW6nqMGAMql
 Kr3A5O/VOkCztSYnLxaa76/HqD21mvUrXvr3grhabNc7rOmuzWV0dDgr6c6wHKHb
 uNCtH4d7Ra06gUrEOrfsgLOLn0Zqik89y6aIlMsnTudMg9gMNgFHy1jz4LM7vMkY
 FS5AVj059heg2uJcfgTvzzcqneyuBLBmF3dS4coowO6oaj8SycpaEmP5e89zkPMI
 1kk8d0e6RmXuCh/2AJ8GxxnKvBPgqp2mMKXOCJ8j4AmHEDX/CKpEBBqIWLKkplUU
 8DGiOWJUtRZJg398dUeIpiVLoXJthMODjAnkKkuhiFcQbXomlwgg7YSnNAz6TRED
 Z7KR2leC247kapHnnagf02q2wED8pBeyrxbQPNdrHtSJ9Usm4nTkY443HgVTJW3s
 aTwPZAQ7
 =mh7W
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for 5.19

- Add support for the ARMv8.6 WFxT extension

- Guard pages for the EL2 stacks

- Trap and emulate AArch32 ID registers to hide unsupported features

- Ability to select and save/restore the set of hypercalls exposed
  to the guest

- Support for PSCI-initiated suspend in collaboration with userspace

- GICv3 register-based LPI invalidation support

- Move host PMU event merging into the vcpu data structure

- GICv3 ITS save/restore fixes

- The usual set of small-scale cleanups and fixes

[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
 from 4 to 6. - Paolo]
2022-05-25 05:09:23 -04:00
Sean Christopherson
6819af7597 KVM: x86: Clean up and document nested #PF workaround
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
with an explicit, common workaround in kvm_inject_emulated_page_fault().
Aside from being a hack, the current approach is brittle and incomplete,
e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
and nVMX fails to apply the workaround when VMX is intercepting #PF due
to allow_smaller_maxphyaddr=1.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29 12:49:49 -04:00
Sean Christopherson
7c69661e22 KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.

In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).

IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.

Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.

Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form.  But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs.  If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220420013732.3308816-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21 13:16:12 -04:00
Sean Christopherson
9bd1f0efa8 KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple fault
Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double
or triple fault.  Per the SDM, a VM-Exit isn't considered to occur during
event delivery if the exit is due to an intercepted double fault or a
triple fault.  Opportunistically move the default clearing (no event
"pending") into the helper so that it's more obvious that KVM does indeed
handle this case.

Note, the double fault case is worded rather wierdly in the SDM:

  The original event results in a double-fault exception that causes the
  VM exit directly.

Temporarily ignoring injected events, double faults can _only_ occur if
an exception occurs while attempting to deliver a different exception,
i.e. there's _always_ an original event.  And for injected double fault,
while there's no original event, injected events are never subject to
interception.

Presumably the SDM is calling out that a the vectoring info will be valid
if a different exit occurs after a double fault, e.g. if a #PF occurs and
is intercepted while vectoring #DF, then the vectoring info will show the
double fault.  In other words, the clause can simply be read as:

  The VM exit is caused by a double-fault exception.

Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Cc: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:46 -04:00
Sean Christopherson
c3634d25fb KVM: nVMX: Leave most VM-Exit info fields unmodified on failed VM-Entry
Don't modify vmcs12 exit fields except EXIT_REASON and EXIT_QUALIFICATION
when performing a nested VM-Exit due to failed VM-Entry.  Per the SDM,
only the two aformentioned fields are filled and "All other VM-exit
information fields are unmodified".

Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:46 -04:00
Sean Christopherson
45846661d1 KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2
Remove WARNs that sanity check that KVM never lets a triple fault for L2
escape and incorrectly end up in L1.  In normal operation, the sanity
check is perfectly valid, but it incorrectly assumes that it's impossible
for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
the triple fault).

The WARN can currently be triggered if userspace injects a machine check
while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
lost on migration, will make it trivially easy for userspace to trigger
the WARN.

Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
tempting, but wrong, especially if/when the request is saved/restored,
e.g. if userspace restores events (including a triple fault) and then
restores nested state (which may forcibly leave guest mode).  Ignoring
the fact that KVM doesn't currently provide the necessary APIs, it's
userspace's responsibility to manage pending events during save/restore.

  ------------[ cut here ]------------
  WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
  Call Trace:
   <TASK>
   vmx_leave_nested+0x30/0x40 [kvm_intel]
   vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
   kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
   kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---

Fixes: cb6a32c2b8 ("KVM: x86: Handle triple fault in L2 without killing L1")
Cc: stable@vger.kernel.org
Cc: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220407002315.78092-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13 13:37:45 -04:00
Paolo Bonzini
0564eeb71b Merge branch 'kvm-bugfixes' into HEAD
Merge bugfixes from 5.17 before merging more tricky work.
2022-03-04 18:39:29 -05:00
Paolo Bonzini
3cffc89d9d KVM: x86/mmu: load new PGD after the shadow MMU is initialized
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.

Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 08:20:18 -05:00
Paolo Bonzini
0c1c92f15f KVM: x86/mmu: do not pass vcpu to root freeing functions
These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables).  They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu.  So, pass the vcpu->kvm directly to them.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 08:20:17 -05:00
Paolo Bonzini
b9e5603c2a KVM: x86: use struct kvm_mmu_root_info for mmu->root
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 08:20:16 -05:00
Sean Christopherson
1a71581012 Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
out, as the kernel will switch CR3 during __text_poke(), e.g. in response
to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
vmcs.HOST_CR3.

Add a comment to explain why KVM must wait until VM-Enter is imminent to
refresh vmcs.HOST_CR3.

The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
loaded for the KVM-associated mm while KVM has a "running" vCPU:

  static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
  {
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

	...

	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
  }

  ------------[ cut here ]------------
  KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
  WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
  CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:load_new_mm_cr3+0x82/0xe0
  RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
  RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
  RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
  R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
  R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
  FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   switch_mm_irqs_off+0x1cb/0x460
   __text_poke+0x308/0x3e0
   text_poke_bp_batch+0x168/0x220
   text_poke_finish+0x1b/0x30
   arch_jump_label_transform_apply+0x18/0x30
   static_key_slow_inc_cpuslocked+0x7c/0x90
   static_key_slow_inc+0x16/0x20
   kvm_lapic_set_base+0x116/0x190
   kvm_set_apic_base+0xa5/0xe0
   kvm_set_msr_common+0x2f4/0xf60
   vmx_set_msr+0x355/0xe70 [kvm_intel]
   kvm_set_msr_ignored_check+0x91/0x230
   kvm_emulate_wrmsr+0x36/0x120
   vmx_handle_exit+0x609/0x6c0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
   kvm_vcpu_ioctl+0x279/0x690
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---

This reverts commit 15ad9762d6.

Fixes: 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Reported-by: Wanpeng Li <kernellwp@gmail.com>
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <20220224191917.3508476-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 04:02:18 -05:00
Sean Christopherson
bca06b85fc Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
Undo a nested VMX fix as a step toward reverting the commit it fixed,
15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
as the underlying premise that "host CR3 in the vcpu thread can only be
changed when scheduling" is wrong.

This reverts commit a9f2705ec8.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220224191917.3508476-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 04:02:18 -05:00
Sean Christopherson
0bcd556e15 KVM: nVMX: Refactor PMU refresh to avoid referencing kvm_x86_ops.pmu_ops
Refactor the nested VMX PMU refresh helper to pass it a flag stating
whether or not the vCPU has PERF_GLOBAL_CTRL instead of having the nVMX
helper query the information by bouncing through kvm_x86_ops.pmu_ops.
This will allow a future patch to use static_call() for the PMU ops
without having to export any static call definitions from common x86, and
it is also a step toward unexported kvm_x86_ops.

Alternatively, nVMX could call kvm_pmu_is_valid_msr() to indirectly use
kvm_x86_ops.pmu_ops, but that would incur an extra layer of indirection
and would require exporting kvm_pmu_is_valid_msr().

Opportunistically rename the helper to keep line lengths somewhat
reasonable, and to better capture its high-level role.

No functional change intended.

Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220128005208.4008533-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:47:19 -05:00
Vitaly Kuznetsov
6cbbaab60f KVM: nVMX: Allow VMREAD when Enlightened VMCS is in use
Hyper-V TLFS explicitly forbids VMREAD and VMWRITE instructions when
Enlightened VMCS interface is in use:

"Any VMREAD or VMWRITE instructions while an enlightened VMCS is
active is unsupported and can result in unexpected behavior.""

Windows 11 + WSL2 seems to ignore this, attempts to VMREAD VMCS field
0x4404 ("VM-exit interruption information") are observed. Failing
these attempts with nested_vmx_failInvalid() makes such guests
unbootable.

Microsoft confirms this is a Hyper-V bug and claims that it'll get fixed
eventually but for the time being we need a workaround. (Temporary) allow
VMREAD to get data from the currently loaded Enlightened VMCS.

Note: VMWRITE instructions remain forbidden, it is not clear how to
handle them properly and hopefully won't ever be needed.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 07:38:26 -05:00
Vitaly Kuznetsov
2423a4c0d1 KVM: nVMX: Rename vmcs_to_field_offset{,_table}
vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque
blob which is not supposed to be accessed directly. In fact,
vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure.

Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity.

No functional change intended.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-28 07:38:26 -05:00
Sean Christopherson
d6e656cd26 KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02
WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
a "real" shadow VMCS for L2.

The previous code WARNed but continued anyway with the allocation,
presumably in an attempt to avoid NULL pointer dereference.
However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
indeed the sole caller does:

	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
		goto out_shadow_vmcs;

which makes it not a useful attempt.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220125220527.2093146-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-26 12:15:04 -05:00
Sean Christopherson
f7e570780e KVM: x86: Forcibly leave nested virt when SMM state is toggled
Forcibly leave nested virtualization operation if userspace toggles SMM
state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
vmxon=false and smm.vmxon=false, but all other nVMX state allocated.

Don't attempt to gracefully handle the transition as (a) most transitions
are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
sufficient information to handle all transitions, e.g. SVM wants access
to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
KVM_SET_NESTED_STATE during state restore as the latter disallows putting
the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
being post-VMXON in SMM if SMM is not active.

Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
in an architecturally impossible state.

  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Modules linked in:
  CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
  RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
  Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
  Call Trace:
   <TASK>
   kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
   kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
   kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
   kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
   kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
   kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
   kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
   kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
   __fput+0x286/0x9f0 fs/file_table.c:311
   task_work_run+0xdd/0x1a0 kernel/task_work.c:164
   exit_task_work include/linux/task_work.h:32 [inline]
   do_exit+0xb29/0x2a30 kernel/exit.c:806
   do_group_exit+0xd2/0x2f0 kernel/exit.c:935
   get_signal+0x4b0/0x28c0 kernel/signal.c:2862
   arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
   handle_signal_work kernel/entry/common.c:148 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
   exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
   __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
   syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
   do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

Cc: stable@vger.kernel.org
Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220125220358.2091737-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-26 12:15:03 -05:00
Eric Hankland
018d70ffcf KVM: x86: Update vPMCs when retiring branch instructions
When KVM retires a guest branch instruction through emulation,
increment any vPMCs that are configured to monitor "branch
instructions retired," and update the sample period of those counters
so that they will overflow at the right time.

Signed-off-by: Eric Hankland <ehankland@google.com>
[jmattson:
  - Split the code to increment "branch instructions retired" into a
    separate commit.
  - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
    emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
    that code.
]
Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211130074221.93635-7-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:43 -05:00
Lai Jiangshan
a9f2705ec8 KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()
The host CR3 in the vcpu thread can only be changed when scheduling,
so commit 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
changed vmx.c to only save it in vmx_prepare_switch_to_guest().

However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
vmx_set_host_fs_gs() is called in both places, so rename it to
vmx_set_vmcs_host_state() and make it update HOST_CR3.

Fixes: 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-07 10:44:39 -05:00
Vitaly Kuznetsov
502d2bf5f2 KVM: nVMX: Implement Enlightened MSR Bitmap feature
Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V
offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to
inform L0 when it changes MSR bitmap, this eliminates the need to examine
L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets
constructed.

Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature.

Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a
nested hypervisor on top of Hyper-V. The newly introduced feature is going
to be used by Hyper-V guests on KVM.

When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU
cycles from a nested vmexit cost (tight cpuid loop test).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:05 -05:00
Vitaly Kuznetsov
ed2a4800ae KVM: nVMX: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt
Introduce a flag to keep track of whether MSR bitmap for L2 needs to be
rebuilt due to changes in MSR bitmap for L1 or switching to a different
L2. This information will be used for Enlightened MSR Bitmap feature for
Hyper-V guests.

Note, setting msr_bitmap_changed to 'true' from set_current_vmptr() is
not really needed for Enlightened MSR Bitmap as the feature can only
be used in conjunction with Enlightened VMCS but let's keep tracking
information complete, it's cheap and in the future similar PV feature can
easily be implemented for KVM on KVM too.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211129094704.326635-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 09:06:04 -05:00
Lai Jiangshan
2df4a5eb6c KVM: X86: Remove mmu parameter from load_pdptrs()
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs,
and nested NPT treats them like all other non-leaf page table levels
instead of caching them.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:14 -05:00
Lai Jiangshan
cc022ae144 KVM: X86: Add parameter huge_page_level to kvm_init_shadow_ept_mmu()
The level of supported large page on nEPT affects the rsvds_bits_mask.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-8-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:12 -05:00
Lai Jiangshan
15ad9762d6 KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()
The host CR3 in the vcpu thread can only be changed when scheduling.
Moving the code in vmx_prepare_switch_to_guest() makes the code
simpler.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211118110814.2568-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:06 -05:00
Paolo Bonzini
41e68b6964 KVM: vmx, svm: clean up mass updates to regs_avail/regs_dirty bits
Document the meaning of the three combinations of regs_avail and
regs_dirty.  Update regs_dirty just after writeback instead of
doing it later after vmexit.  After vmexit, instead, we clear the
regs_avail bits corresponding to lazily-loaded registers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:03 -05:00
Lai Jiangshan
3883bc9d28 KVM: X86: Mark CR3 dirty when vcpu->arch.cr3 is changed
When vcpu->arch.cr3 is changed, it should be marked dirty unless it
is being updated to the value of the architecture guest CR3 (i.e.
VMX.GUEST_CR3 or vmcb->save.cr3 when tdp is enabled).

This patch has no functionality changed because
kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3) is superset of
kvm_register_mark_available(vcpu, VCPU_EXREG_CR3) with additional
change to vcpu->arch.regs_dirty, but no code uses regs_dirty for
VCPU_EXREG_CR3.  (vmx_load_mmu_pgd() uses vcpu->arch.regs_avail instead
to test if VCPU_EXREG_CR3 dirty which means current code (ab)uses
regs_avail for VCPU_EXREG_CR3 dirty information.)

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211108124407.12187-11-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:25:02 -05:00
Sean Christopherson
1460179dcd KVM: x86: Tweak halt emulation helper names to free up kvm_vcpu_halt()
Rename a variety of HLT-related helpers to free up the function name
"kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate
between "block" and "halt".

No functional change intended.

Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009021236.4122790-13-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:50 -05:00
Dan Carpenter
bfbb307c62 KVM: VMX: Set failure code in prepare_vmcs02()
The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not.  It leads to using an
uninitialized variable in the caller.

Fixes: 71f7347025 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:11 -05:00
Sean Christopherson
712494de96 KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12
Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02.  From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12.  Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.

For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:11:29 -05:00
Sean Christopherson
40e5f90804 KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2.  E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.

Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:07:49 -05:00
Paolo Bonzini
8503fea676 KVM: VMX: do not use uninitialized gfn_to_hva_cache
An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
the accessors to croak very loudly.  While a BUG_ON is definitely
_too_ loud and a bug on its own, there is indeed an issue of using
the caches in such a way that they could not have been initialized,
because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
would not be called.

For the vmcs12_cache, the solution is simply to invoke
kvm_gfn_to_hva_cache_init unconditionally: we already know
that the cache does not match the current VMCS pointer.
For the shadow_vmcs12_cache, there is no similar condition
that checks the VMCS link pointer, so invalidate the cache
on VMXON.

Fixes: cee66664dc ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:28 -05:00
David Woodhouse
cee66664dc KVM: nVMX: Use a gfn_to_hva_cache for vmptrld
And thus another call to kvm_vcpu_map() can die.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-7-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 02:03:43 -05:00
David Woodhouse
7d0172b3ca KVM: nVMX: Use kvm_read_guest_offset_cached() for nested VMCS check
Kill another mostly gratuitous kvm_vcpu_map() which could just use the
userspace HVA for it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 02:03:43 -05:00
David Woodhouse
297d597a6d KVM: nVMX: Use kvm_{read,write}_guest_cached() for shadow_vmcs12
Using kvm_vcpu_map() for reading from the guest is entirely gratuitous,
when all we do is a single memcpy and unmap it again. Fix it up to use
kvm_read_guest()... but in fact I couldn't bring myself to do that
without also making it use a gfn_to_hva_cache for both that *and* the
copy in the other direction.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 02:03:42 -05:00
Maxim Levitsky
af957eebfc KVM: nVMX: don't use vcpu->arch.efer when checking host state on nested state load
When loading nested state, don't use check vcpu->arch.efer to get the
L1 host's 64-bit vs. 32-bit state and don't check it for consistency
with respect to VM_EXIT_HOST_ADDR_SPACE_SIZE, as register state in vCPU
may be stale when KVM_SET_NESTED_STATE is called---and architecturally
does not exist.  When restoring L2 state in KVM, the CPU is placed in
non-root where nested VMX code has no snapshot of L1 host state: VMX
(conditionally) loads host state fields loaded on VM-exit, but they need
not correspond to the state before entry.  A simple case occurs in KVM
itself, where the host RIP field points to vmx_vmexit rather than the
instruction following vmlaunch/vmresume.

However, for the particular case of L1 being in 32- or 64-bit mode
on entry, the exit controls can be treated instead as the source of
truth regarding the state of L1 on entry, and can be used to check
that vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE matches vmcs12.HOST_EFER if
vmcs12.VM_EXIT_LOAD_IA32_EFER is set.  The consistency check on CPU
EFER vs. vmcs12.VM_EXIT_HOST_ADDR_SPACE_SIZE, instead, happens only
on VM-Enter.  That's because, again, there's conceptually no "current"
L1 EFER to check on KVM_SET_NESTED_STATE.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211115131837.195527-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 02:03:42 -05:00
Vipin Sharma
329bd56ce5 KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT
handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
holds the invalidation type. Add a helper to retrieve reg2 from VMX
instruction info to consolidate and document the shift+mask magic.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11 10:56:24 -05:00
Sean Christopherson
a5e0c25284 KVM: nVMX: Clean up x2APIC MSR handling for L2
Clean up the x2APIC MSR bitmap intereption code for L2, which is the last
holdout of open coded bitmap manipulations.  Freshen up the SDM/PRM
comment, rename the function to make it abundantly clear the funky
behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored
(the previous comment was flat out wrong for x2APIC behavior).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11 10:56:23 -05:00
Sean Christopherson
67f4b9969c KVM: nVMX: Handle dynamic MSR intercept toggling
Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
and always update the relevant bits in vmcs02.  This fixes two distinct,
but intertwined bugs related to dynamic MSR bitmap modifications.

The first issue is that KVM fails to enable MSR interception in vmcs02
for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
and later enables interception.

The second issue is that KVM fails to honor userspace MSR filtering when
preparing vmcs02.

Fix both issues simultaneous as fixing only one of the issues (doesn't
matter which) would create a mess that no one should have to bisect.
Fixing only the first bug would exacerbate the MSR filtering issue as
userspace would see inconsistent behavior depending on the whims of L1.
Fixing only the second bug (MSR filtering) effectively requires fixing
the first, as the nVMX code only knows how to transition vmcs02's
bitmap from 1->0.

Move the various accessor/mutators that are currently buried in vmx.c
into vmx.h so that they can be shared by the nested code.

Fixes: 1a155254ff ("KVM: x86: Introduce MSR filtering")
Fixes: d69129b4e4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109013047.2041518-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-11 10:56:23 -05:00
David Edmondson
0a62a0319a KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
Extend the get_exit_info static call to provide the reason for the VM
exit. Modify relevant trace points to use this rather than extracting
the reason in the caller.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-3-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 06:48:24 -04:00
Vitaly Kuznetsov
feb3162f9d KVM: nVMX: Reset vmxon_ptr upon VMXOFF emulation.
Currently, 'vmx->nested.vmxon_ptr' is not reset upon VMXOFF
emulation. This is not a problem per se as we never access
it when !vmx->nested.vmxon. But this should be done to avoid
any issue in the future.

Also, initialize the vmxon_ptr when vcpu is created.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-3-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:02 -04:00
Yu Zhang
64c785082c KVM: nVMX: Use INVALID_GPA for pointers used in nVMX.
Clean up nested.c and vmx.c by using INVALID_GPA instead of "-1ull",
to denote an invalid address in nested VMX. Affected addresses are
the ones of VMXON region, current VMCS, VMCS link pointer, virtual-
APIC page, ENCLS-exiting bitmap, and IO bitmap etc.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-2-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:02 -04:00
Maxim Levitsky
dbab610a5b KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
If L1 had invalid state on VM entry (can happen on SMM transactions
when we enter from real mode, straight to nested guest),

then after we load 'host' state from VMCS12, the state has to become
valid again, but since we load the segment registers with
__vmx_set_segment we weren't always updating emulation_required.

Update emulation_required explicitly at end of load_vmcs12_host_state.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Maxim Levitsky
c8607e4a08 KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
It is possible that when non root mode is entered via special entry
(!from_vmentry), that is from SMM or from loading the nested state,
the L2 state could be invalid in regard to non unrestricted guest mode,
but later it can become valid.

(for example when RSM emulation restores segment registers from SMRAM)

Thus delay the check to VM entry, where we will check this and fail.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Chenyi Qiang
24a996ade3 KVM: nVMX: Fix nested bus lock VM exit
Nested bus lock VM exits are not supported yet. If L2 triggers bus lock
VM exit, it will be directed to L1 VMM, which would cause unexpected
behavior. Therefore, handle L2's bus lock VM exits in L0 directly.

Fixes: fe6b6bc802 ("KVM: VMX: Enable bus lock VM exit")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210914095041.29764-1-chenyi.qiang@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:12 -04:00
Yu Zhang
ed7023a11b KVM: nVMX: fix comments of handle_vmon()
"VMXON pointer" is saved in vmx->nested.vmxon_ptr since
commit 3573e22cfe ("KVM: nVMX: additional checks on
vmxon region"). Also, handle_vmptrld() & handle_vmclear()
now have logic to check the VMCS pointer against the VMXON
pointer.

So just remove the obsolete comments of handle_vmon().

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210908171731.18885-1-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:09 -04:00
Linus Torvalds
192ad3c27a ARM:
- Page ownership tracking between host EL1 and EL2
 
 - Rely on userspace page tables to create large stage-2 mappings
 
 - Fix incompatibility between pKVM and kmemleak
 
 - Fix the PMU reset state, and improve the performance of the virtual PMU
 
 - Move over to the generic KVM entry code
 
 - Address PSCI reset issues w.r.t. save/restore
 
 - Preliminary rework for the upcoming pKVM fixed feature
 
 - A bunch of MM cleanups
 
 - a vGIC fix for timer spurious interrupts
 
 - Various cleanups
 
 s390:
 
 - enable interpretation of specification exceptions
 
 - fix a vcpu_idx vs vcpu_id mixup
 
 x86:
 
 - fast (lockless) page fault support for the new MMU
 
 - new MMU now the default
 
 - increased maximum allowed VCPU count
 
 - allow inhibit IRQs on KVM_RUN while debugging guests
 
 - let Hyper-V-enabled guests run with virtualized LAPIC as long as they
   do not enable the Hyper-V "AutoEOI" feature
 
 - fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC)
 
 - tuning for the case when two-dimensional paging (EPT/NPT) is disabled
 
 - bugfixes and cleanups, especially with respect to 1) vCPU reset and
   2) choosing a paging mode based on CR0/CR4/EFER
 
 - support for 5-level page table on AMD processors
 
 Generic:
 
 - MMU notifier invalidation callbacks do not take mmu_lock unless necessary
 
 - improved caching of LRU kvm_memory_slot
 
 - support for histogram statistics
 
 - add statistics for halt polling and remote TLB flush requests
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmE2CIAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMyqwf+Ky2WoThuQ9Ra0r/m8pUTAx5+gsAf
 MmG24rNLE+26X0xuBT9Q5+etYYRLrRTWJvo5cgHooz7muAYW6scR+ho5xzvLTAxi
 DAuoijkXsSdGoFCp0OMUHiwG3cgY5N7feTEwLPAb2i6xr/l6SZyCP4zcwiiQbJ2s
 UUD0i3rEoNQ02/hOEveud/ENxzUli9cmmgHKXR3kNgsJClSf1fcuLnhg+7EGMhK9
 +c2V+hde5y0gmEairQWm22MLMRolNZ5NL4kjykiNh2M5q9YvbHe5+f/JmENlNZMT
 bsUQT6Ry1ukuJ0V59rZvUw71KknPFzZ3d6HgW4pwytMq6EJKiISHzRbVnQ==
 =FCAB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - Page ownership tracking between host EL1 and EL2
   - Rely on userspace page tables to create large stage-2 mappings
   - Fix incompatibility between pKVM and kmemleak
   - Fix the PMU reset state, and improve the performance of the virtual
     PMU
   - Move over to the generic KVM entry code
   - Address PSCI reset issues w.r.t. save/restore
   - Preliminary rework for the upcoming pKVM fixed feature
   - A bunch of MM cleanups
   - a vGIC fix for timer spurious interrupts
   - Various cleanups

  s390:
   - enable interpretation of specification exceptions
   - fix a vcpu_idx vs vcpu_id mixup

  x86:
   - fast (lockless) page fault support for the new MMU
   - new MMU now the default
   - increased maximum allowed VCPU count
   - allow inhibit IRQs on KVM_RUN while debugging guests
   - let Hyper-V-enabled guests run with virtualized LAPIC as long as
     they do not enable the Hyper-V "AutoEOI" feature
   - fixes and optimizations for the toggling of AMD AVIC (virtualized
     LAPIC)
   - tuning for the case when two-dimensional paging (EPT/NPT) is
     disabled
   - bugfixes and cleanups, especially with respect to vCPU reset and
     choosing a paging mode based on CR0/CR4/EFER
   - support for 5-level page table on AMD processors

  Generic:
   - MMU notifier invalidation callbacks do not take mmu_lock unless
     necessary
   - improved caching of LRU kvm_memory_slot
   - support for histogram statistics
   - add statistics for halt polling and remote TLB flush requests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
  KVM: Drop unused kvm_dirty_gfn_invalid()
  KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
  KVM: MMU: mark role_regs and role accessors as maybe unused
  KVM: MIPS: Remove a "set but not used" variable
  x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
  KVM: stats: Add VM stat for remote tlb flush requests
  KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
  KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
  KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
  Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
  KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
  kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
  kvm: x86: Increase MAX_VCPUS to 1024
  kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
  KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
  KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
  KVM: s390: index kvm->arch.idle_mask by vcpu_idx
  KVM: s390: Enable specification exception interpretation
  KVM: arm64: Trim guest debug exception handling
  KVM: SVM: Add 5-level page table support for SVM
  ...
2021-09-07 13:40:51 -07:00
Sean Christopherson
f7782bb8d8 KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled.  If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.

Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag.  Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt.  Fixing the mess is a future problem.

Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled.  Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().

Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb896 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810144526.2662272-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:17 -04:00
Sean Christopherson
389ab25216 KVM: nVMX: Pull KVM L0's desired controls directly from vmcs01
When preparing controls for vmcs02, grab KVM's desired controls from
vmcs01's shadow state instead of recalculating the controls from scratch,
or in the secondary execution controls, instead of using the dedicated
cache.  Calculating secondary exec controls is eye-poppingly expensive
due to the guest CPUID checks, hence the dedicated cache, but the other
calculations aren't exactly free either.

Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on
exit) as appropriate as they may be set in vmcs01, whereas the previous
implementation relied on dynamic bits being cleared in the calculator.

Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from
vmcs01 to vmcs02.  Whether or not PERF_GLOBAL_CTRL is loaded depends on
whether or not perf itself is active, so unless perf stops between the
exit from L1 and entry to L2, vmcs01 will hold the desired value.  This
is purely an optimization as atomic_switch_perf_msrs() will set/clear
the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in
the case where perf is active (versus starting with the bits clear in
vmcs02, which was the previous behavior).

Cc: Zeng Guang <guang.zeng@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:35:15 -04:00
Sean Christopherson
18712c1370 KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1.  At best, L1 will complain and inject the #PF back into L2.  At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.

Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347b ("KVM: VMX: introduce vmx_need_pf_intercept").

Fixes: 1dbf5d68af ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Junaid Shahid
85aa8889b8 kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13 03:20:58 -04:00
Sean Christopherson
284036c644 KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitions
Drop unnecessary MSR bitmap updates during nested transitions, as L1's
APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows,
and L2's MSR bitmap is managed separately.  In the unlikely event that L1
is pathological and loads APIC_BASE via the VM-Exit load list, KVM will
handle updating the bitmap in its normal WRMSR flows.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-39-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:58 -04:00
Sean Christopherson
816be9e9be KVM: nVMX: Don't evaluate "emulation required" on nested VM-Exit
Use the "internal" variants of setting segment registers when stuffing
state on nested VM-Exit in order to skip the "emulation required"
updates.  VM-Exit must always go to protected mode, and all segments are
mostly hardcoded (to valid values) on VM-Exit.  The bits of the segments
that aren't hardcoded are explicitly checked during VM-Enter, e.g. the
selector RPLs must all be zero.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:55 -04:00
Sean Christopherson
afc8de0118 KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit
Set L1's LDTR on VM-Exit per the Intel SDM:

  The host-state area does not contain a selector field for LDTR. LDTR is
  established as follows on all VM exits: the selector is cleared to
  0000H, the segment is marked unusable and is otherwise undefined
  (although the base address is always canonical).

This is likely a benign bug since the LDTR is unusable, as it means the
L1 VMM is conditioned to reload its LDTR in order to function properly on
bare metal.

Fixes: 4704d0befb ("KVM: nVMX: Exiting from L2 to L1")
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02 11:01:48 -04:00
Sean Christopherson
616007c866 KVM: x86: Enhance comments for MMU roles and nested transition trickiness
Expand the comments for the MMU roles.  The interactions with gfn_track
PGD reuse in particular are hairy.

Regarding PGD reuse, add comments in the nested virtualization flows to
call out why kvm_init_mmu() is unconditionally called even when nested
TDP is used.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-50-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 18:00:47 -04:00
Sean Christopherson
b33bb78a1f KVM: nVMX: Handle split-lock #AC exceptions that happen in L2
Mark #ACs that won't be reinjected to the guest as wanted by L0 so that
KVM handles split-lock #AC from L2 instead of forwarding the exception to
L1.  Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a
regular #AC and do the wrong thing, e.g. reinject it into L2.

Fixes: e6f8b6c12f ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest")
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210622172244.3561540-1-seanjc@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24 04:31:16 -04:00
Sean Christopherson
ba1f82456b KVM: nVMX: Dynamically compute max VMCS index for vmcs12
Calculate the max VMCS index for vmcs12 by walking the array to find the
actual max index.  Hardcoding the index is prone to bitrot, and the
calculation is only done on KVM bringup (albeit on every CPU, but there
aren't _that_ many null entries in the array).

Fixes: 3c0f99366e ("KVM: nVMX: Add a TSC multiplier field in VMCS12")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210618214658.2700765-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-21 12:58:55 -04:00
Sean Christopherson
c5ffd408cd KVM: nVMX: Drop redundant checks on vmcs12 in EPTP switching emulation
Drop the explicit check on EPTP switching being enabled.  The EPTP
switching check is handled in the generic VMFUNC function check, while
the underlying VMFUNC enablement check is done by hardware and redone
by generic VMFUNC emulation.

The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a
consistency check, keep it but add a WARN.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:54 -04:00
Sean Christopherson
546e8398bc KVM: nVMX: WARN if subtly-impossible VMFUNC conditions occur
WARN and inject #UD when emulating VMFUNC for L2 if the function is
out-of-bounds or if VMFUNC is not enabled in vmcs12.  Neither condition
should occur in practice, as the CPU is supposed to prioritize the #UD
over VM-Exit for out-of-bounds input and KVM is supposed to enable
VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of
those dependencies is obvious.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-15-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:54 -04:00
Sean Christopherson
c906066288 KVM: x86: Drop pointless @reset_roots from kvm_init_mmu()
Remove the @reset_roots param from kvm_init_mmu(), the one user,
kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
invalidated all roots.  This also happens to be why the reset_roots=true
paths doesn't leak roots; they're already invalid.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:54 -04:00
Sean Christopherson
39353ab579 KVM: nVMX: Use fast PGD switch when emulating VMFUNC[EPTP_SWITCH]
Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate
VMFUNC[EPTP_SWITCH] instead of nuking all MMUs.  EPTP_SWITCH is the EPT
equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow,
the only hiccup being that A/D enabling is buried in the EPTP.  But, that
is easily handled by bouncing through kvm_init_shadow_ept_mmu().

Explicitly request a guest TLB flush if VPID is disabled.  Per Intel's
SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined
mappings associated with VPID 0000H (for all PCIDs and for all EP4TA
values, where EP4TA is the value of bits 51:12 of EPTP)".

Note, this technically is a very bizarre bug fix of sorts if L2 is using
PAE paging, as avoiding the full MMU reload also avoids incorrectly
reloading the PDPTEs, which the SDM explicitly states are not touched:

  If PAE paging is in use, an EPTP-switching VMFUNC does not load the
  four page-directory-pointer-table entries (PDPTEs) from the
  guest-physical address in CR3. The logical processor continues to use
  the four guest-physical addresses already present in the PDPTEs. The
  guest-physical address in CR3 is not translated through the new EPT
  paging structures (until some operation that would load the PDPTEs).

In addition to optimizing L2's MMU shenanigans, avoiding the full reload
also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both
root_mmu and guest_mmu.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:53 -04:00
Sean Christopherson
25b62c6274 KVM: nVMX: Free only guest_mode (L2) roots on INVVPID w/o EPT
When emulating INVVPID for L1, free only L2+ roots, using the guest_mode
tag in the MMU role to identify L2+ roots.  From L1's perspective, its
own TLB entries use VPID=0, and INVVPID is not requied to invalidate such
entries.  Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0,
but it is not required to do so.

Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:53 -04:00
Sean Christopherson
50a417962a KVM: nVMX: Consolidate VM-Enter/VM-Exit TLB flush and MMU sync logic
Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync
is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the
all-encompassing nested_vmx_transition_tlb_flush().

Opportunistically add a comment explaning why nested EPT never needs to
sync the MMU on VM-Enter.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:53 -04:00
Sean Christopherson
b512910039 KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers
Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:52 -04:00
Sean Christopherson
272b0a998d KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switch
Drop bogus logic that incorrectly clobbers the accessed/dirty enabling
status of the nested MMU on an EPTP switch.  When nested EPT is enabled,
walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2.

This is likely a benign bug, as mmu->ept_ad is never consumed (since the
MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will
never propagate into future shadow pages since the nested MMU isn't used
to map anything, just to walk L2's page tables.

Note, KVM also does a full MMU reload, i.e. the guest_mmu will be
recreated using the new EPTP, and thus any change in A/D enabling will be
properly recognized in the relevant MMU.

Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:51 -04:00
Sean Christopherson
0e75225dfa KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap
Use BIT_ULL() instead of an open-coded shift to check whether or not a
function is enabled in L1's VMFUNC bitmap.  This is a benign bug as KVM
supports only bit 0, and will fail VM-Enter if any other bits are set,
i.e. bits 63:32 are guaranteed to be zero.

Note, "function" is bounded by hardware as VMFUNC will #UD before taking
a VM-Exit if the function is greater than 63.

Before:
  if ((vmcs12->vm_function_control & (1 << function)) == 0)
   0x000000000001a916 <+118>:	mov    $0x1,%eax
   0x000000000001a91b <+123>:	shl    %cl,%eax
   0x000000000001a91d <+125>:	cltq
   0x000000000001a91f <+127>:	and    0x128(%rbx),%rax

After:
  if (!(vmcs12->vm_function_control & BIT_ULL(function & 63)))
   0x000000000001a955 <+117>:	mov    0x128(%rbx),%rdx
   0x000000000001a95c <+124>:	bt     %rax,%rdx

Fixes: 27c42a1bb8 ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:51 -04:00
Sean Christopherson
07ffaf343e KVM: nVMX: Sync all PGDs on nested transition with shadow paging
Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2.  kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.

Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.

Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT.  When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.

Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings.  And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().

Fixes: 41fab65e7c ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:51 -04:00
Vitaly Kuznetsov
8629b625e0 KVM: nVMX: Request to sync eVMCS from VMCS12 after migration
VMCS12 is used to keep the authoritative state during nested state
migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're
in between L2->L1 vmexit and L1 guest run when actual sync to
enlightened (or shadow) VMCS happens. Nested state, however, has
no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()->
set_current_vmptr() always sets it. Enlightened vmptrld path, however,
doesn't have the quirk so some VMCS12 changes may not get properly
reflected to eVMCS and L1 will see an incorrect state.

Note, during L2 execution or when need_vmcs12_to_shadow_sync is not
set the change is effectively a nop: in the former case all changes
will get reflected during the first L2->L1 vmexit and in the later
case VMCS12 and eVMCS are already in sync (thanks to
copy_enlightened_to_vmcs12() in vmx_get_nested_state()).

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-11-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov
dc31338552 KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02()
When nested state migration happens during L1's execution, it
is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment.
At least genuine Hyper-V seems to not be very happy when 'clean fields'
data changes underneath it.

'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12()
and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead.

While at it, update a comment stating why exactly we need to reset
'hv_clean_fields' data from L0.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-10-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov
b7685cfd5e KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid()
'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened
VMCS sync when we exit to L1. The comment in nested_vmx_failValid()
validly states why shadow vmcs sync can be omitted but this doesn't
apply to enlightened VMCS as it 'shadows' all VMCS12 fields.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov
d6bf71a18c KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state()
'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1
hypervisor is not obliged to keep it up-to-date while it is mangling L2's
state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual
eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration
VMCS12 is used as a source of ultimate truth, we must make sure we pick all
the changes to eVMCS and thus 'clean fields' data must be ignored.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov
3b19b81acf KVM: nVMX: Release enlightened VMCS on VMCLEAR
Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when
enlightened VMCS is in use. TLFS has the following brief description:
"The L1 hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state". Normally,
this change can be ignored as unmapping active eVMCS can be postponed
until the next VMLAUNCH instruction but in case nested state is migrated
with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped
may result in its synchronization with VMCS12 and this is incorrect:
L1 hypervisor is free to reuse inactive eVMCS memory for something else.

Inactive eVMCS after VMCLEAR can just be unmapped.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov
278499686b KVM: nVMX: Introduce 'EVMPTR_MAP_PENDING' post-migration state
Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld()
can not be called directly from vmx_set_nested_state() as KVM may not have
all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be
restored yet). Enlightened VMCS is mapped later while getting nested state
pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID'
and it's indistinguishable from 'evmcs is not in use' case. This leads to
certain issues, in particular, if KVM_GET_NESTED_STATE is called right
after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting
state will be unset (and such state will later fail to load).

Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after
restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12()
is no longer needed.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov
25641cafab KVM: nVMX: Make copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() return 'void'
copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result,
make them return 'void'.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov
0276171680 KVM: nVMX: Release eVMCS when enlightened VMENTRY was disabled
In theory, L1 can try to disable enlightened VMENTRY in VP assist page and
try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld()
properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS
remains mapped and thus all evmptr_is_valid() checks will still pass and
nested_vmx_run() will proceed when it shouldn't.

Release eVMCS immediately when we detect that enlightened vmentry was
disabled by L1.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov
6a789ca5d5 KVM: nVMX: Don't set 'dirty_vmcs12' flag on enlightened VMPTRLD
'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02()
and both checks look like:

 'vmx->nested.dirty_vmcs12 || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)'

so for eVMCS case the flag changes nothing. Drop the assignment to avoid
the confusion.

No functional change intended.

Reported-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov
1e9dfbd748 KVM: nVMX: Use '-1' in 'hv_evmcs_vmptr' to indicate that eVMCS is not in use
Instead of checking 'vmx->nested.hv_evmcs' use '-1' in
'vmx->nested.hv_evmcs_vmptr' to indicate 'evmcs is not in use' state. This
matches how we check 'vmx->nested.current_vmptr'. Introduce EVMPTR_INVALID
and evmptr_is_valid() and use it instead of raw '-1' check as a preparation
to adding other 'special' values.

No functional change intended.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210526132026.270394-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:48 -04:00
Maxim Levitsky
158a48ecf7 KVM: x86: avoid loading PDPTRs after migration when possible
if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:48 -04:00
Maxim Levitsky
0f85722341 KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
Similar to the rest of guest page accesses after a migration,
this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:47 -04:00
Sean Christopherson
bcb72d0627 KVM: nVMX: Drop obsolete (and pointless) pdptrs_changed() check
Remove the pdptrs_changed() check when loading L2's CR3.  The set of
available registers is always reset when switching VMCSes (see commit
e5d03de593, "KVM: nVMX: Reset register cache (available and dirty
masks) on VMCS switch"), thus the "are PDPTRs available" check will
always fail.  And even if it didn't fail, reading guest memory to check
the PDPTRs is just as expensive as reading guest memory to load 'em.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210607090203.133058-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:46 -04:00
Krish Sadhukhan
b93af02c67 KVM: nVMX: nSVM: 'nested_run' should count guest-entry attempts that make it to guest code
Currently, the 'nested_run' statistic counts all guest-entry attempts,
including those that fail during vmentry checks on Intel and during
consistency checks on AMD. Convert this statistic to count only those
guest-entries that make it past these state checks and make it to guest
code. This will tell us the number of guest-entries that actually executed
or tried to execute guest code.

Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com>
Message-Id: <20210609180340.104248-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:35 -04:00
Jim Mattson
966eefb896 KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable
Don't allow posted interrupts to modify a stale posted interrupt
descriptor (including the initial value of 0).

Empirical tests on real hardware reveal that a posted interrupt
descriptor referencing an unbacked address has PCI bus error semantics
(reads as all 1's; writes are ignored). However, kvm can't distinguish
unbacked addresses from device-backed (MMIO) addresses, so it should
really ask userspace for an MMIO completion. That's overly
complicated, so just punt with KVM_INTERNAL_ERROR.

Don't return the error until the posted interrupt descriptor is
actually accessed. We don't want to break the existing kvm-unit-tests
that assume they can launch an L2 VM with a posted interrupt
descriptor that references MMIO space in L1.

Fixes: 6beb7bd52e ("kvm: nVMX: Refactor nested_get_vmcs12_pages()")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-8-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:31 -04:00
Jim Mattson
0fe998b295 KVM: nVMX: Fail on MMIO completion for nested posted interrupts
When the kernel has no mapping for the vmcs02 virtual APIC page,
userspace MMIO completion is necessary to process nested posted
interrupts. This is not a configuration that KVM supports. Rather than
silently ignoring the problem, try to exit to userspace with
KVM_INTERNAL_ERROR.

Note that the event that triggers this error is consumed as a
side-effect of a call to kvm_check_nested_events. On some paths
(notably through kvm_vcpu_check_block), the error is dropped. In any
case, this is an incremental improvement over always ignoring the
error.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210604172611.281819-7-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:31 -04:00
Jim Mattson
650293c3de KVM: nVMX: Add a return code to vmx_complete_nested_posted_interrupt
No functional change intended.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210604172611.281819-4-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:30 -04:00
Ilias Stamatis
d041b5ea93 KVM: nVMX: Enable nested TSC scaling
Calculate the TSC offset and multiplier on nested transitions and expose
the TSC scaling feature to L1.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210526184418.28881-11-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Ilias Stamatis
1ab9287add KVM: X86: Add vendor callbacks for writing the TSC multiplier
Currently vmx_vcpu_load_vmcs() writes the TSC_MULTIPLIER field of the
VMCS every time the VMCS is loaded. Instead of doing this, set this
field from common code on initialization and whenever the scaling ratio
changes.

Additionally remove vmx->current_tsc_ratio. This field is redundant as
vcpu->arch.tsc_scaling_ratio already tracks the current TSC scaling
ratio. The vmx->current_tsc_ratio field is only used for avoiding
unnecessary writes but it is no longer needed after removing the code
from the VMCS load path.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Message-Id: <20210607105438.16541-1-ilstam@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-17 13:09:29 -04:00
Vitaly Kuznetsov
f5c7e8425f KVM: nVMX: Always make an attempt to map eVMCS after migration
When enlightened VMCS is in use and nested state is migrated with
vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs
page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr'
and we can't read it from VP assist page because userspace may decide
to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state
(and QEMU, for example, does exactly that). To make sure eVMCS is
mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES
request.

Commit f2c7ef3ba9 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to
nested_vmx_vmexit() to make sure MSR permission bitmap is not switched
when an immediate exit from L2 to L1 happens right after migration (caused
by a pending event, for example). Unfortunately, in the exact same
situation we still need to have eVMCS mapped so
nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS.

As a band-aid, restore nested_get_evmcs_page() when clearing
KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far
from being ideal as we can't easily propagate possible failures and even if
we could, this is most likely already too late to do so. The whole
'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration
seems to be fragile as we diverge too much from the 'native' path when
vmptr loading happens on vmx_set_nested_state().

Fixes: f2c7ef3ba9 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210503150854.1144255-2-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07 06:06:11 -04:00
Linus Torvalds
152d32aa84 ARM:
- Stage-2 isolation for the host kernel when running in protected mode
 
 - Guest SVE support when running in nVHE mode
 
 - Force W^X hypervisor mappings in nVHE mode
 
 - ITS save/restore for guests using direct injection with GICv4.1
 
 - nVHE panics now produce readable backtraces
 
 - Guest support for PTP using the ptp_kvm driver
 
 - Performance improvements in the S2 fault handler
 
 x86:
 
 - Optimizations and cleanup of nested SVM code
 
 - AMD: Support for virtual SPEC_CTRL
 
 - Optimizations of the new MMU code: fast invalidation,
   zap under read lock, enable/disably dirty page logging under
   read lock
 
 - /dev/kvm API for AMD SEV live migration (guest API coming soon)
 
 - support SEV virtual machines sharing the same encryption context
 
 - support SGX in virtual machines
 
 - add a few more statistics
 
 - improved directed yield heuristics
 
 - Lots and lots of cleanups
 
 Generic:
 
 - Rework of MMU notifier interface, simplifying and optimizing
 the architecture-specific code
 
 - Some selftests improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
 y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
 c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
 Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
 +2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
 M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
 =AXUi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "This is a large update by KVM standards, including AMD PSP (Platform
  Security Processor, aka "AMD Secure Technology") and ARM CoreSight
  (debug and trace) changes.

  ARM:

   - CoreSight: Add support for ETE and TRBE

   - Stage-2 isolation for the host kernel when running in protected
     mode

   - Guest SVE support when running in nVHE mode

   - Force W^X hypervisor mappings in nVHE mode

   - ITS save/restore for guests using direct injection with GICv4.1

   - nVHE panics now produce readable backtraces

   - Guest support for PTP using the ptp_kvm driver

   - Performance improvements in the S2 fault handler

  x86:

   - AMD PSP driver changes

   - Optimizations and cleanup of nested SVM code

   - AMD: Support for virtual SPEC_CTRL

   - Optimizations of the new MMU code: fast invalidation, zap under
     read lock, enable/disably dirty page logging under read lock

   - /dev/kvm API for AMD SEV live migration (guest API coming soon)

   - support SEV virtual machines sharing the same encryption context

   - support SGX in virtual machines

   - add a few more statistics

   - improved directed yield heuristics

   - Lots and lots of cleanups

  Generic:

   - Rework of MMU notifier interface, simplifying and optimizing the
     architecture-specific code

   - a handful of "Get rid of oprofile leftovers" patches

   - Some selftests improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
  KVM: selftests: Speed up set_memory_region_test
  selftests: kvm: Fix the check of return value
  KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
  KVM: SVM: Skip SEV cache flush if no ASIDs have been used
  KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
  KVM: SVM: Drop redundant svm_sev_enabled() helper
  KVM: SVM: Move SEV VMCB tracking allocation to sev.c
  KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
  KVM: SVM: Unconditionally invoke sev_hardware_teardown()
  KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
  KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
  KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
  KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
  KVM: SVM: Move SEV module params/variables to sev.c
  KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
  KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
  KVM: SVM: Zero out the VMCB array used to track SEV ASID association
  x86/sev: Drop redundant and potentially misleading 'sev_enabled'
  KVM: x86: Move reverse CPUID helpers to separate header file
  KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
  ...
2021-05-01 10:14:08 -07:00
Sean Christopherson
27b4a9c454 KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
Append raw to the direct variants of kvm_register_read/write(), and
drop the "l" from the mode-aware variants.  I.e. make the mode-aware
variants the default, and make the direct variants scary sounding so as
to discourage use.  Accessing the full 64-bit values irrespective of
mode is rarely the desired behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:13 -04:00
Sean Christopherson
82277eeed6 KVM: nVMX: Truncate base/index GPR value on address calc in !64-bit
Drop bits 63:32 of the base and/or index GPRs when calculating the
effective address of a VMX instruction memory operand.  Outside of 64-bit
mode, memory encodings are strictly limited to E*X and below.

Fixes: 064aea7747 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:12 -04:00
Sean Christopherson
ee050a5775 KVM: nVMX: Truncate bits 63:32 of VMCS field on nested check in !64-bit
Drop bits 63:32 of the VMCS field encoding when checking for a nested
VM-Exit on VMREAD/VMWRITE in !64-bit mode.  VMREAD and VMWRITE always
use 32-bit operands outside of 64-bit mode.

The actual emulation of VMREAD/VMWRITE does the right thing, this bug is
purely limited to incorrectly causing a nested VM-Exit if a GPR happens
to have bits 63:32 set outside of 64-bit mode.

Fixes: a7cde481b6 ("KVM: nVMX: Do not forward VMREAD/VMWRITE VMExits to L1 if required so by vmcs12 vmread/vmwrite bitmaps")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422022128.3464144-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:12 -04:00
Sean Christopherson
dbdd096a5a KVM: VMX: Intercept FS/GS_BASE MSR accesses for 32-bit KVM
Disable pass-through of the FS and GS base MSRs for 32-bit KVM.  Intel's
SDM unequivocally states that the MSRs exist if and only if the CPU
supports x86-64.  FS_BASE and GS_BASE are mostly a non-issue; a clever
guest could opportunistically use the MSRs without issue.  KERNEL_GS_BASE
is a bigger problem, as a clever guest would subtly be broken if it were
migrated, as KVM disallows software access to the MSRs, and unlike the
direct variants, KERNEL_GS_BASE needs to be explicitly migrated as it's
not captured in the VMCS.

Fixes: 25c5f225be ("KVM: VMX: Enable MSR Bitmap feature")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210422023831.3473491-1-seanjc@google.com>
[*NOT* for stable kernels. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-26 05:27:10 -04:00
Sean Christopherson
72add915fb KVM: VMX: Enable SGX virtualization for SGX1, SGX2 and LC
Enable SGX virtualization now that KVM has the VM-Exit handlers needed
to trap-and-execute ENCLS to ensure correctness and/or enforce the CPU
model exposed to the guest.  Add a KVM module param, "sgx", to allow an
admin to disable SGX virtualization independent of the kernel.

When supported in hardware and the kernel, advertise SGX1, SGX2 and SGX
LC to userspace via CPUID and wire up the ENCLS_EXITING bitmap based on
the guest's SGX capabilities, i.e. to allow ENCLS to be executed in an
SGX-enabled guest.  With the exception of the provision key, all SGX
attribute bits may be exposed to the guest.  Guest access to the
provision key, which is controlled via securityfs, will be added in a
future patch.

Note, KVM does not yet support exposing ENCLS_C leafs or ENCLV leafs.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <a99e9c23310c79f2f4175c1af4c4cbcef913c3e5.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:56 -04:00
Sean Christopherson
3c0c2ad1ae KVM: VMX: Add basic handling of VM-Exit from SGX enclave
Add support for handling VM-Exits that originate from a guest SGX
enclave.  In SGX, an "enclave" is a new CPL3-only execution environment,
wherein the CPU and memory state is protected by hardware to make the
state inaccesible to code running outside of the enclave.  When exiting
an enclave due to an asynchronous event (from the perspective of the
enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
is automatically saved and scrubbed (the CPU loads synthetic state), and
then reloaded when re-entering the enclave.  E.g. after an instruction
based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
of the enclave instruction that trigered VM-Exit, but will instead point
to a RIP in the enclave's untrusted runtime (the guest userspace code
that coordinates entry/exit to/from the enclave).

To help a VMM recognize and handle exits from enclaves, SGX adds bits to
existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR.  Define the
new architectural bits, and add a boolean to struct vcpu_vmx to cache
VMX_EXIT_REASON_FROM_ENCLAVE.  Clear the bit in exit_reason so that
checks against exit_reason do not need to account for SGX, e.g.
"if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.

KVM is a largely a passive observer of the new bits, e.g. KVM needs to
account for the bits when propagating information to a nested VMM, but
otherwise doesn't need to act differently for the majority of VM-Exits
from enclaves.

The one scenario that is directly impacted is emulation, which is for
all intents and purposes impossible[1] since KVM does not have access to
the RIP or instruction stream that triggered the VM-Exit.  The inability
to emulate is a non-issue for KVM, as most instructions that might
trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
check.  For the few instruction that conditionally #UD, KVM either never
sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
if the feature is not exposed to the guest in order to inject a #UD,
e.g. RDRAND_EXITING.

But, because it is still possible for a guest to trigger emulation,
e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
from an enclave.  This is architecturally accurate for instruction
VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
to killing the VM.  In practice, only broken or particularly stupid
guests should ever encounter this behavior.

Add a WARN in skip_emulated_instruction to detect any attempt to
modify the guest's RIP during an SGX enclave VM-Exit as all such flows
should either be unreachable or must handle exits from enclaves before
getting to skip_emulated_instruction.

[1] Impossible for all practical purposes.  Not truly impossible
    since KVM could implement some form of para-virtualization scheme.

[2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
    CPL3, so we also don't need to worry about that interaction.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-20 04:18:54 -04:00
Maxim Levitsky
4020da3b9f KVM: x86: pending exceptions must not be blocked by an injected event
Injected interrupts/nmi should not block a pending exception,
but rather be either lost if nested hypervisor doesn't
intercept the pending exception (as in stock x86), or be delivered
in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
that corresponds to the pending exception.

The only reason for an exception to be blocked is when nested run
is pending (and that can't really happen currently
but still worth checking for).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-04-17 08:31:02 -04:00
Ingo Molnar
163b099146 x86: Fix various typos in comments, take #2
Fix another ~42 single-word typos in arch/x86/ code comments,
missed a few in the first pass, in particular in .S files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
2021-03-21 23:50:28 +01:00
Sean Christopherson
cb6a32c2b8 KVM: x86: Handle triple fault in L2 without killing L1
Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1.  Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:43:15 -04:00
Sean Christopherson
648fc8ae37 KVM: x86: Move nVMX's consistency check macro to common code
Move KVM's CC() macro to x86.h so that it can be reused by nSVM.
Debugging VM-Enter is as painful on SVM as it is on VMX.

Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK
to avoid any collisions with the uber-concise "CC".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:36 -04:00
Sean Christopherson
c805f5d558 KVM: nVMX: Defer the MMU reload to the normal path on an EPTP switch
Defer reloading the MMU after a EPTP successful EPTP switch.  The VMFUNC
instruction itself is executed in the previous EPTP context, any side
effects, e.g. updating RIP, should occur in the old context.  Practically
speaking, this bug is benign as VMX doesn't touch the MMU when skipping
an emulated instruction, nor does queuing a single-step #DB.  No other
post-switch side effects exist.

Fixes: 41ab937274 ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-14-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:42:21 -04:00
Dongli Zhang
43c11d91fb KVM: x86: to track if L1 is running L2 VM
The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
is running or used to run L2 VM.

An example of the usage of 'nested_run' is to help the host administrator
to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
that may happen with nested virtualization, the administrator will be able
to easily narrow down and confirm if the issue is due to nested
virtualization via 'nested_run'. For example, whether the fix like
commit 88dddc11a8 ("KVM: nVMX: do not use dangling shadow VMCS after
guest reset") is required.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15 04:28:02 -04:00
Makarand Sonare
a85863c2ec KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging
Currently, if enable_pml=1 PML remains enabled for the entire lifetime
of the VM irrespective of whether dirty logging is enable or disabled.
When dirty logging is disabled, all the pages of the VM are manually
marked dirty, so that PML is effectively non-operational.  Setting
the dirty bits is an expensive operation which can cause severe MMU
lock contention in a performance sensitive path when dirty logging is
disabled after a failed or canceled live migration.

Manually setting dirty bits also fails to prevent PML activity if some
code path clears dirty bits, which can incur unnecessary VM-Exits.

In order to avoid this extra overhead, dynamically enable/disable PML
when dirty logging gets turned on/off for the first/last memslot.

Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:34 -05:00
Sean Christopherson
c3bb9a2083 KVM: nVMX: Disable PML in hardware when running L2
Unconditionally disable PML in vmcs02, KVM emulates PML purely in the
MMU, e.g. vmx_flush_pml_buffer() doesn't even try to copy the L2 GPAs
from vmcs02's buffer to vmcs12.  At best, enabling PML is a nop.  At
worst, it will cause vmx_flush_pml_buffer() to record bogus GFNs in the
dirty logs.

Initialize vmcs02.GUEST_PML_INDEX such that PML writes would trigger
VM-Exit if PML was somehow enabled, skip flushing the buffer for guest
mode since the index is bogus, and freak out if a PML full exit occurs
when L2 is active.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-19 03:08:32 -05:00
Paolo Bonzini
78e550bad2 KVM: nVMX: no need to undo inject_page_fault change on nested vmexit
This is not needed because the tweak was done on the guest_mmu, while
nested_ept_uninit_mmu_context has just changed vcpu->arch.walk_mmu
back to the root_mmu.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-18 07:33:31 -05:00
Sean Christopherson
636e8b7334 KVM: VMX: Use GPA legality helpers to replace open coded equivalents
Replace a variety of open coded GPA checks with the recently introduced
common helpers.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210204000117.3303214-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 09:27:28 -05:00
Uros Bizjak
150f17bfab KVM/nVMX: Use __vmx_vcpu_run in nested_vmx_check_vmentry_hw
Replace inline assembly in nested_vmx_check_vmentry_hw
with a call to __vmx_vcpu_run.  The function is not
performance critical, so (double) GPR save/restore
in __vmx_vcpu_run can be tolerated, as far as performance
effects are concerned.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: dropped versioning info from changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20201231002702.2223707-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:32 -05:00
Jason Baron
b6a7cc3544 KVM: X86: prepend vmx/svm prefix to additional kvm_x86_ops functions
A subsequent patch introduces macros in preparation for simplifying the
definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform
expands the coverage of the macros. Add vmx/svm prefix to the following
functions: update_exception_bitmap(), enable_nmi_window(),
enable_irq_window(), update_cr8_intercept and enable_smi_window().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:29 -05:00
Chenyi Qiang
9a3ecd5e2a KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared
to 0 when the condition (e.g. RTM) happens. The value can be used to
initialize dr6 and also be the XOR mask between the #DB exit
qualification (or payload) and DR6.

Concerning that DR6_INIT is used as initial value only once, rename it
to DR6_ACTIVE_LOW and apply it in other places, which would make the
incoming changes for bus lock debug exception more simple.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com>
[Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:27 -05:00
Sean Christopherson
8e53324021 KVM: VMX: Convert vcpu_vmx.exit_reason to a union
Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32).  The
full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in
bits 15:0, and single-bit modifiers in bits 31:16.

Historically, KVM has only had to worry about handling the "failed
VM-Entry" modifier, which could only be set in very specific flows and
required dedicated handling.  I.e. manually stripping the FAILED_VMENTRY
bit was a somewhat viable approach.  But even with only a single bit to
worry about, KVM has had several bugs related to comparing a basic exit
reason against the full exit reason store in vcpu_vmx.

Upcoming Intel features, e.g. SGX, will add new modifier bits that can
be set on more or less any VM-Exit, as opposed to the significantly more
restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off
flows isn't scalable.  Tracking exit reason in a union forces code to
explicitly choose between consuming the full exit reason and the basic
exit, and is a convenient way to document and access the modifiers.

No functional change intended.

Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:21 -05:00
Paolo Bonzini
9a78e15802 KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
which may need to be loaded outside guest mode.  Therefore we cannot
WARN in that case.

However, that part of nested_get_vmcs12_pages is _not_ needed at
vmentry time.  Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
so that both vmentry and migration (and in the latter case, independent
of is_guest_mode) do the parts that are needed.

Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
Cc: <stable@vger.kernel.org> # 5.10.x
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25 18:54:09 -05:00
Maxim Levitsky
d51e1d3f6b KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration
Even when we are outside the nested guest, some vmcs02 fields
may not be in sync vs vmcs12.  This is intentional, even across
nested VM-exit, because the sync can be delayed until the nested
hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those
rarely accessed fields.

However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to
be able to restore it.  To fix that, call copy_vmcs02_to_vmcs12_rare()
before the vmcs12 contents are copied to userspace.

Fixes: 7952d769c2 ("KVM: nVMX: Sync rarely accessed guest fields only when needed")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25 18:52:09 -05:00
Maxim Levitsky
f2c7ef3ba9 KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit
It is possible to exit the nested guest mode, entered by
svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
if the nested run was not pending during the migration.

In this case we must not switch to the nested msr permission bitmap.
Also add a warning to catch similar cases in the future.

Fixes: a7d5c7ce41 ("KVM: nSVM: delay MSR permission processing to first nested VM run")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210107093854.882483-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 18:11:35 -05:00
Yadong Qi
bf0cd88ce3 KVM: x86: emulate wait-for-SIPI and SIPI-VMExit
Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.

According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.

In this patch:
    1. introduce SIPI exit reason,
    2. introduce wait-for-SIPI state for nVMX,
    3. advertise wait-for-SIPI support to guest.

When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.

Handle procdure:
Source vCPU:
    L2 write LAPIC.ICR(INIT).
    L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
       vCPU.
Target vCPU:
    L0 emulate an INIT VMExit to L1 if is guest mode.
    L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
    L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
       == WAIT_SIPI).

Source vCPU:
    L2 write LAPIC.ICR(SIPI).
    L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
       vCPU.
Target vCPU:
    L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
    L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
    L0 resume to L2.
    L2 start-up.

Signed-off-by: Yadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:09 -05:00
Sean Christopherson
c2fe3cd460 KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook
Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics.  This will be remedied in a future patch.

Fixes: 5e1746d620 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-11-15 09:49:07 -05:00
Linus Torvalds
f9a705ad1c ARM:
- New page table code for both hypervisor and guest stage-2
 - Introduction of a new EL2-private host context
 - Allow EL2 to have its own private per-CPU variables
 - Support of PMU event filtering
 - Complete rework of the Spectre mitigation
 
 PPC:
 - Fix for running nested guests with in-kernel IRQ chip
 - Fix race condition causing occasional host hard lockup
 - Minor cleanups and bugfixes
 
 x86:
 - allow trapping unknown MSRs to userspace
 - allow userspace to force #GP on specific MSRs
 - INVPCID support on AMD
 - nested AMD cleanup, on demand allocation of nested SVM state
 - hide PV MSRs and hypercalls for features not enabled in CPUID
 - new test for MSR_IA32_TSC writes from host and guest
 - cleanups: MMU, CPUID, shared MSRs
 - LAPIC latency optimizations ad bugfixes
 
 For x86, also included in this pull request is a new alternative and
 (in the future) more scalable implementation of extended page tables
 that does not need a reverse map from guest physical addresses to
 host physical addresses.  For now it is disabled by default because
 it is still lacking a few of the existing MMU's bells and whistles.
 However it is a very solid piece of work and it is already available
 for people to hammer on it.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+S8dsUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM40Af+M46NJmuS5rcwFfybvK/c42KT6svX
 Co1NrZDwzSQ2mMy3WQzH9qeLvb+nbY4sT3n5BPNPNsT+aIDPOTDt//qJ2/Ip9UUs
 tRNea0MAR96JWLE7MSeeRxnTaQIrw/AAZC0RXFzZvxcgytXwdqBExugw4im+b+dn
 Dcz8QxX1EkwT+4lTm5HC0hKZAuo4apnK1QkqCq4SdD2QVJ1YE6+z7pgj4wX7xitr
 STKD6q/Yt/0ndwqS0GSGbyg0jy6mE620SN6isFRkJYwqfwLJci6KnqvEK67EcNMu
 qeE017K+d93yIVC46/6TfVHzLR/D1FpQ8LZ16Yl6S13OuGIfAWBkQZtPRg==
 =AD6a
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "For x86, there is a new alternative and (in the future) more scalable
  implementation of extended page tables that does not need a reverse
  map from guest physical addresses to host physical addresses.

  For now it is disabled by default because it is still lacking a few of
  the existing MMU's bells and whistles. However it is a very solid
  piece of work and it is already available for people to hammer on it.

  Other updates:

  ARM:
   - New page table code for both hypervisor and guest stage-2
   - Introduction of a new EL2-private host context
   - Allow EL2 to have its own private per-CPU variables
   - Support of PMU event filtering
   - Complete rework of the Spectre mitigation

  PPC:
   - Fix for running nested guests with in-kernel IRQ chip
   - Fix race condition causing occasional host hard lockup
   - Minor cleanups and bugfixes

  x86:
   - allow trapping unknown MSRs to userspace
   - allow userspace to force #GP on specific MSRs
   - INVPCID support on AMD
   - nested AMD cleanup, on demand allocation of nested SVM state
   - hide PV MSRs and hypercalls for features not enabled in CPUID
   - new test for MSR_IA32_TSC writes from host and guest
   - cleanups: MMU, CPUID, shared MSRs
   - LAPIC latency optimizations ad bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
  kvm: x86/mmu: NX largepage recovery for TDP MMU
  kvm: x86/mmu: Don't clear write flooding count for direct roots
  kvm: x86/mmu: Support MMIO in the TDP MMU
  kvm: x86/mmu: Support write protection for nesting in tdp MMU
  kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
  kvm: x86/mmu: Support dirty logging for the TDP MMU
  kvm: x86/mmu: Support changed pte notifier in tdp MMU
  kvm: x86/mmu: Add access tracking for tdp_mmu
  kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
  kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
  kvm: x86/mmu: Add TDP MMU PF handler
  kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
  kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
  KVM: Cache as_id in kvm_memory_slot
  kvm: x86/mmu: Add functions to handle changed TDP SPTEs
  kvm: x86/mmu: Allocate and free TDP MMU roots
  kvm: x86/mmu: Init / Uninit the TDP MMU
  kvm: x86/mmu: Introduce tdp_iter
  KVM: mmu: extract spte.h and spte.c
  KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
  ...
2020-10-23 11:17:56 -07:00
Linus Torvalds
6873139ed0 objtool changes for v5.10:
- Most of the changes are cleanups and reorganization to make the objtool code
    more arch-agnostic. This is in preparation for non-x86 support.
 
 Fixes:
 
  - KASAN fixes.
  - Handle unreachable trap after call to noreturn functions better.
  - Ignore unreachable fake jumps.
  - Misc smaller fixes & cleanups.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+FgwIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1juGw/6A6goA5/HHapM965yG1eY/rTLp3eIbcma
 1ZbkUsP0YfT6wVUzw/sOeZzKNOwOq1FuMfkjuH2KcnlxlcMekIaKvLk8uauW4igM
 hbFGuuZfZ0An5ka9iQ1W6HGdsuD3vVlN1w/kxdWk0c3lJCVQSTxdCfzF8fuF3gxX
 lF3Bc1D/ZFcHIHT/hu/jeIUCgCYpD3qZDjQJBScSwVthZC+Fw6weLLGp2rKDaCao
 HhSQft6MUfDrUKfH3LBIUNPRPCOrHo5+AX6BXxLXJVxqlwO/YU3e0GMwSLedMtBy
 TASWo7/9GAp+wNNZe8EliyTKrfC3sLxN1QImfjuojxbBVXx/YQ/ToTt9fVGpF4Y+
 XhhRFv9520v1tS2wPHIgQGwbh7EWG6mdrmo10RAs/31ViONPrbEZ4WmcA08b/5FY
 KEkOVb18yfmDVzVZPpSc+HpIFkppEBOf7wPg27Bj3RTZmzIl/y+rKSnxROpsJsWb
 R6iov7SFVET14lHl1G7tPNXfqRaS7HaOQIj3rSUyAP0ZfX+yIupVJp32dc6Ofg8b
 SddUCwdIHoFdUNz4Y9csUCrewtCVJbxhV4MIdv0GpWbrgSw96RFZgetaH+6mGRpj
 0Kh6M1eC3irDbhBuarWUBAr2doPAq4iOUeQU36Q6YSAbCs83Ws2uKOWOHoFBVwCH
 uSKT0wqqG+E=
 =KX5o
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:
 "Most of the changes are cleanups and reorganization to make the
  objtool code more arch-agnostic. This is in preparation for non-x86
  support.

  Other changes:

   - KASAN fixes

   - Handle unreachable trap after call to noreturn functions better

   - Ignore unreachable fake jumps

   - Misc smaller fixes & cleanups"

* tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  perf build: Allow nested externs to enable BUILD_BUG() usage
  objtool: Allow nested externs to enable BUILD_BUG()
  objtool: Permit __kasan_check_{read,write} under UACCESS
  objtool: Ignore unreachable trap after call to noreturn functions
  objtool: Handle calling non-function symbols in other sections
  objtool: Ignore unreachable fake jumps
  objtool: Remove useless tests before save_reg()
  objtool: Decode unwind hint register depending on architecture
  objtool: Make unwind hint definitions available to other architectures
  objtool: Only include valid definitions depending on source file type
  objtool: Rename frame.h -> objtool.h
  objtool: Refactor jump table code to support other architectures
  objtool: Make relocation in alternative handling arch dependent
  objtool: Abstract alternative special case handling
  objtool: Move macros describing structures to arch-dependent code
  objtool: Make sync-check consider the target architecture
  objtool: Group headers to check in a single list
  objtool: Define 'struct orc_entry' only when needed
  objtool: Skip ORC entry creation for non-text sections
  objtool: Move ORC logic out of check()
  ...
2020-10-14 10:13:37 -07:00
Paolo Bonzini
729c15c20f KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES
We are going to use it for SVM too, so use a more generic name.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:49 -04:00
Aaron Lewis
476c9bd8e9 KVM: x86: Prepare MSR bitmaps for userspace tracked MSRs
Prepare vmx and svm for a subsequent change that ensures the MSR permission
bitmap is set to allow an MSR that userspace is tracking to force a vmx_vmexit
in the guest.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
[agraf: rebase, adapt SVM scheme to nested changes that came in between]
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-5-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:05 -04:00
Sean Christopherson
d85a8034c0 KVM: VMX: Rename "find_msr_entry" to "vmx_find_uret_msr"
Rename "find_msr_entry" to scope it to VMX and to associate it with
guest_uret_msrs.  Drop the "entry" so that the function name pairs with
the existing __vmx_find_uret_msr(), which intentionally uses a double
underscore prefix instead of appending "index" or "slot" as those names
are already claimed by other pieces of the user return MSR stack.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-13-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:58:00 -04:00
Sean Christopherson
eb3db1b137 KVM: VMX: Rename the "shared_msr_entry" struct to "vmx_uret_msr"
Rename struct "shared_msr_entry" to "vmx_uret_msr" to align with x86's
rename of "shared_msrs" to "user_return_msrs", and to call out that the
struct is specific to VMX, i.e. not part of the generic "shared_msrs"
framework.  Abbreviate "user_return" as "uret" to keep line lengths
marginally sane and code more or less readable.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:56 -04:00
Sean Christopherson
a128a934f2 KVM: VMX: Rename "vmx_find_msr_index" to "vmx_find_loadstore_msr_slot"
Add "loadstore" to vmx_find_msr_index() to differentiate it from the so
called shared MSRs helpers (which will soon be renamed), and replace
"index" with "slot" to better convey that the helper returns slot in the
array, not the MSR index (the value that gets stuffed into ECX).

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:55 -04:00
Sean Christopherson
ce833b2324 KVM: VMX: Prepend "MAX_" to MSR array size defines
Add "MAX" to the LOADSTORE and so called SHARED MSR defines to make it
more clear that the define controls the array size, as opposed to the
actual number of valid entries that are in the array.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923180409.32255-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:55 -04:00
Sean Christopherson
02f1965ff8 KVM: nVMX: Read EXIT_QUAL and INTR_INFO only when needed for nested exit
Read vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO only if the
VM-Exit is being reflected to L1 now that they are no longer passed
directly to the kvm_nested_vmexit tracepoint.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:53 -04:00
Sean Christopherson
cc167bd7ee KVM: x86: Use common definition for kvm_nested_vmexit tracepoint
Use the newly introduced TRACE_EVENT_KVM_EXIT to define the guts of
kvm_nested_vmexit so that it captures and prints the same information as
kvm_exit.  This has the bonus side effect of fixing the interrupt info
and error code printing for the case where they're invalid, e.g. if the
exit was a failed VM-Entry.  This also sets the stage for retrieving
EXIT_QUALIFICATION and VM_EXIT_INTR_INFO in nested_vmx_reflect_vmexit()
if and only if the VM-Exit is being routed to L1.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:52 -04:00
Sean Christopherson
f315f2b140 KVM: VMX: Add a helper to test for a valid error code given an intr info
Add a helper, is_exception_with_error_code(), to provide the simple but
difficult to read code of checking for a valid exception with an error
code given a vmcs.VM_EXIT_INTR_INFO value.  The helper will gain another
user, vmx_get_exit_info(), in a future patch.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:50 -04:00
Sean Christopherson
a9d7d76c66 KVM: x86: Read guest RIP from within the kvm_nested_vmexit tracepoint
Use kvm_rip_read() to read the guest's RIP for the nested VM-Exit
tracepoint instead of having the caller pass in an argument.  Params
that are passed into a tracepoint are evaluated even if the tracepoint
is disabled, i.e. passing in RIP for VMX incurs a VMREAD and retpoline
to retrieve a value that may never be used, e.g. if the exit is due to a
hardware interrupt.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923201349.16097-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:50 -04:00
Sean Christopherson
138534a810 KVM: nVMX: WARN on attempt to switch the currently loaded VMCS
WARN if KVM attempts to switch to the currently loaded VMCS.  Now that
nested_vmx_free_vcpu() doesn't blindly call vmx_switch_vmcs(), all paths
that lead to vmx_switch_vmcs() are implicitly guarded by guest vs. host
mode, e.g. KVM should never emulate VMX instructions when guest mode is
active, and nested_vmx_vmexit() should never be called when host mode is
active.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:49 -04:00
Sean Christopherson
ebec153a05 KVM: nVMX: Drop redundant VMCS switch and free_nested() call
Remove the explicit switch to vmcs01 and the call to free_nested() in
nested_vmx_free_vcpu().  free_nested(), which is called unconditionally
by vmx_leave_nested(), ensures vmcs01 is loaded prior to freeing vmcs02
and friends.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:46 -04:00
Sean Christopherson
df82a24b29 KVM: nVMX: Ensure vmcs01 is the loaded VMCS when freeing nested state
Add a WARN in free_nested() to ensure vmcs01 is loaded prior to freeing
vmcs02 and friends, and explicitly switch to vmcs01 if it's not.  KVM is
supposed to keep is_guest_mode() and loaded_vmcs==vmcs02 synchronized,
but bugs happen and freeing vmcs02 while it's in use will escalate a KVM
error to a use-after-free and potentially crash the kernel.

Do the WARN and switch even in the !vmxon case to help detect latent
bugs.  free_nested() is not a hot path, and the check is cheap.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:46 -04:00
Sean Christopherson
c61ca2fcbc KVM: nVMX: Move free_nested() below vmx_switch_vmcs()
Move free_nested() down below vmx_switch_vmcs() so that a future patch
can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not
the loaded VMCS when freeing nested resources.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:45 -04:00
Sean Christopherson
2ba4493a8b KVM: nVMX: Explicitly check for valid guest state for !unrestricted guest
Call guest_state_valid() directly instead of querying emulation_required
when checking if L1 is attempting VM-Enter with invalid guest state.
If emulate_invalid_guest_state is false, KVM will fixup segment regs to
avoid emulation and will never set emulation_required, i.e. KVM will
incorrectly miss the associated consistency checks because the nested
path stuffs segments directly into vmcs02.

Opportunsitically add Consistency Check tracing to make future debug
suck a little less.

Fixes: 2bb8cafea8 ("KVM: vVMX: signal failure for nested VMEntry if emulation_required")
Fixes: 3184a995f7 ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:45 -04:00
Sean Christopherson
b89d5ad00e KVM: nVMX: Reload vmcs01 if getting vmcs12's pages fails
Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM
expects vmcs01 to be loaded when is_guest_mode() is false.

Fixes: 671ddc700f ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: stable@vger.kernel.org
Cc: Dan Cross <dcross@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:44 -04:00
Sean Christopherson
fc387d8daf KVM: nVMX: Reset the segment cache when stuffing guest segs
Explicitly reset the segment cache after stuffing guest segment regs in
prepare_vmcs02_rare().  Although the cache is reset when switching to
vmcs02, there is nothing that prevents KVM from re-populating the cache
prior to writing vmcs02 with vmcs12's values.  E.g. if the vCPU is
preempted after switching to vmcs02 but before prepare_vmcs02_rare(),
kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl()
and cache the stale vmcs02 value.  While the current code base only
caches stale data in the preemption case, it's theoretically possible
future code could read a segment register during the nested flow itself,
i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(),
although it did introduce the bug.

This manifests as an unexpected nested VM-Enter failure when running
with unrestricted guest disabled if the above preemption case coincides
with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3
to to a L2 vCPU at CPL0.  stack_segment_valid() will see the new SS_SEL
but the old SS_AR_BYTES and incorrectly mark the guest state as invalid
due to SS.dpl != SS.rpl.

Don't bother updating the cache even though prepare_vmcs02_rare() writes
every segment.  With unrestricted guest, guest segments are almost never
read, let alone L2 guest segments.  On the other hand, populating the
cache requires a large number of memory writes, i.e. it's unlikely to be
a net win.  Updating the cache would be a win when unrestricted guest is
not supported, as guest_state_valid() will immediately cache all segment
registers.  But, nested virtualization without unrestricted guest is
dirt slow, saving some VMREADs won't change that, and every CPU
manufactured in the last decade supports unrestricted guest.  In other
words, the extra (minor) complexity isn't worth the trouble.

Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL
depending on when preemption occurs.  This is "ok" in that the usage is
imperfect by nature, i.e. it's used heuristically to improve performance
but doesn't affect functionality.  kvm_arch_vcpu_put() could be "fixed"
by also disabling preemption while loading segments, but that's
pointless and misleading as reading state from kvm_sched_{in,out}() is
guaranteed to see stale data in one form or another.  E.g. even if all
the usage of regs_avail is fixed to call kvm_register_mark_available()
after the associated state is set, the individual state might still be
stale with respect to the overall vCPU state.  I.e. making functional
decisions in an asynchronous hook is doomed from the get go.  Thankfully
KVM doesn't do that.

Fixes: de63ad4cf4 ("KVM: X86: implement the logic for spinlock optimization")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:44 -04:00
Sean Christopherson
7f3603b631 KVM: VMX: Rename RDTSCP secondary exec control name to insert "ENABLE"
Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
preparation for consolidating the logic for adjusting secondary exec
controls based on the guest CPUID model.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200923165048.20486-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:30 -04:00
Sean Christopherson
fc595f3599 KVM: nVMX: Add VM-Enter failed tracepoints for super early checks
Add tracepoints for the early consistency checks in nested_vmx_run().
The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as
there is no architectural way to check VMCS.LAUNCH_STATE, and subtle
bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in
the L1 VMM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812180615.22372-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:23 -04:00
Sean Christopherson
25bb2cf971 KVM: nVMX: Morph notification vector IRQ on nested VM-Enter to pending PI
On successful nested VM-Enter, check for pending interrupts and convert
the highest priority interrupt to a pending posted interrupt if it
matches L2's notification vector.  If the vCPU receives a notification
interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
VM-Enter), the pending interrupt (for L1) should be recognized and
processed as a posted interrupt when interrupts become unblocked after
VM-Enter to L2.

This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
trying to inject an interrupt into L2 by setting the appropriate bit in
L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
method of manually moving the vector from PIR->vIRR/RVI).  KVM will
observe the IPI while the vCPU is in L1 context and so won't immediately
morph it to a posted interrupt for L2.  The pending interrupt will be
seen by vmx_check_nested_events(), cause KVM to force an immediate exit
after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
After handling the VM-Exit, L1 will see that L2 has a pending interrupt
in PIR, send another IPI, and repeat until L2 is killed.

Note, posted interrupts require virtual interrupt deliveriy, and virtual
interrupt delivery requires exit-on-interrupt, ergo interrupts will be
unconditionally unmasked on VM-Enter if posted interrupts are enabled.

Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200812175129.12172-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:22 -04:00
Krish Sadhukhan
bddd82d19e KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it
Currently, prepare_vmcs02_early() does not check if the "unrestricted guest"
VM-execution control in vmcs12 is turned off and leaves the corresponding
bit on in vmcs02. Due to this setting, vmentry checks which are supposed to
render the nested guest state as invalid when this VM-execution control is
not set, are passing in hardware.

This patch turns off the "unrestricted guest" VM-execution control in vmcs02
if vmcs12 has turned it off.

Suggested-by: Jim Mattson <jmattson@google.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Message-Id: <20200921081027.23047-2-krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:18 -04:00
Babu Moger
3f3393b3ce KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c
Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:16 -04:00
Chenyi Qiang
efc831338b KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled
KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
system ioctl KVM_GET_MSR.  Add them to the setup of nested VMX controls MSR.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28 07:57:07 -04:00
Chenyi Qiang
c6b177a3be KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control
A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
in exit_ctls_high.

Fixes: 03a8871add ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
VM-{Entry,Exit} control")
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11 13:15:12 -04:00
Peter Shier
43fea4e425 KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
load_pdptrs to read the possibly updated PDPTEs from the guest
physical address referenced by CR3.  It loads them into
vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
vcpu->arch.regs_dirty.

At the subsequent assumed reentry into L2, the mmu will call
vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
if the L2 CRn write intercept always resumes L2.

The resume path calls vmx_check_nested_events which checks for
exceptions, MTF, and expired VMX preemption timers. If
vmx_check_nested_events finds any of these conditions pending it will
reflect the corresponding exit into L1. Live migration at this point
would also cause a missed immediate reentry into L2.

After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
values stored at last L2 exit. A repro of this bug showed L2 entering
triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.

When L2 is in PAE paging mode add a call to ept_load_pdptrs before
leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
vcpu->arch.walk_mmu->pdptrs[].

Tested:
kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
Verified that test fails without the fix.

Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
failures.

Signed-off-by: Peter Shier <pshier@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200820230545.2411347-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11 13:15:09 -04:00
Julien Thierry
00089c048e objtool: Rename frame.h -> objtool.h
Header frame.h is getting more code annotations to help objtool analyze
object files.

Rename the file to objtool.h.

[ jpoimboe: add objtool.h to MAINTAINERS ]

Signed-off-by: Julien Thierry <jthierry@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
2020-09-10 10:43:13 -05:00
Linus Torvalds
921d2597ab s390: implement diag318
x86:
 * Report last CPU for debugging
 * Emulate smaller MAXPHYADDR in the guest than in the host
 * .noinstr and tracing fixes from Thomas
 * nested SVM page table switching optimization and fixes
 
 Generic:
 * Unify shadow MMU cache data structures across architectures
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl8pC+oUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcOwgAjomqtEqQNlp7DdZT7VyyklzbxX1/
 ud7v+oOJ8K4sFlf64lSthjPo3N9rzZCcw+yOXmuyuITngXOGc3tzIwXpCzpLtuQ1
 WO1Ql3B/2dCi3lP5OMmsO1UAZqy9pKLg1dfeYUPk48P5+p7d/NPmk+Em5kIYzKm5
 JsaHfCp2EEXomwmljNJ8PQ1vTjIQSSzlgYUBZxmCkaaX7zbEUMtxAQCStHmt8B84
 33LczwXBm3viSWrzsoBV37I70+tseugiSGsCfUyupXOvq55d6D9FCqtCb45Hn4Vh
 Ik8ggKdalsk/reiGEwNw1/3nr6mRMkHSbl+Mhc4waOIFf9dn0urgQgOaDg==
 =YVx0
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "s390:
   - implement diag318

  x86:
   - Report last CPU for debugging
   - Emulate smaller MAXPHYADDR in the guest than in the host
   - .noinstr and tracing fixes from Thomas
   - nested SVM page table switching optimization and fixes

  Generic:
   - Unify shadow MMU cache data structures across architectures"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  KVM: SVM: Fix sev_pin_memory() error handling
  KVM: LAPIC: Set the TDCR settable bits
  KVM: x86: Specify max TDP level via kvm_configure_mmu()
  KVM: x86/mmu: Rename max_page_level to max_huge_page_level
  KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR
  KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch
  KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
  KVM: VMX: Make vmx_load_mmu_pgd() static
  KVM: x86/mmu: Add separate helper for shadow NPT root page role calc
  KVM: VMX: Drop a duplicate declaration of construct_eptp()
  KVM: nSVM: Correctly set the shadow NPT root level in its MMU role
  KVM: Using macros instead of magic values
  MIPS: KVM: Fix build error caused by 'kvm_run' cleanup
  KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF
  KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support
  KVM: VMX: optimize #PF injection when MAXPHYADDR does not match
  KVM: VMX: Add guest physical address check in EPT violation and misconfig
  KVM: VMX: introduce vmx_need_pf_intercept
  KVM: x86: update exception bitmap on CPUID changes
  KVM: x86: rename update_bp_intercept to update_exception_bitmap
  ...
2020-08-06 12:59:31 -07:00
Sean Christopherson
2a40b9001e KVM: x86: Pull the PGD's level from the MMU instead of recalculating it
Use the shadow_root_level from the current MMU as the root level for the
PGD, i.e. for VMX's EPTP.  This eliminates the weird dependency between
VMX and the MMU where both must independently calculate the same root
level for things to work correctly.  Temporarily keep VMX's calculation
of the level and use it to WARN if the incoming level diverges.

Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell,
and rename a 'cr3' param in the load_mmu_pgd prototype that managed to
survive the cr3 purge.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30 18:16:47 -04:00
Paolo Bonzini
5e105c88ab KVM: nVMX: check for invalid hdr.vmx.flags
hdr.vmx.flags is meant for future extensions to the ABI, rejecting
invalid flags is necessary to avoid broken half-loads of the
nVMX state.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27 09:04:50 -04:00
Paolo Bonzini
0f02bd0ade KVM: nVMX: check for required but missing VMCS12 in KVM_SET_NESTED_STATE
A missing VMCS12 was not causing -EINVAL (it was just read with
copy_from_user, so it is not a security issue, but it is still
wrong).  Test for VMCS12 validity and reject the nested state
if a VMCS12 is required but not present.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-27 09:04:49 -04:00
Paolo Bonzini
a0c134347b KVM: VMX: introduce vmx_need_pf_intercept
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200710154811.418214-7-mgamal@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10 17:01:52 -04:00
Paolo Bonzini
83d31e5271 KVM: nVMX: fixes for preemption timer migration
Commit 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration",
2020-06-01) accidentally broke nVMX live migration from older version
by changing the userspace ABI.  Restore it and, while at it, ensure
that vmx->nested.has_preemption_timer_deadline is always initialized
according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag.

Cc: Makarand Sonare <makarandsonare@google.com>
Fixes: 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10 06:15:36 -04:00
Xiaoyao Li
7c1b761be0 KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid()
The name of callback cpuid_update() is misleading that it's not about
updating CPUID settings of vcpu but updating the configurations of vcpu
based on the CPUIDs. So rename it to vcpu_after_set_cpuid().

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-09 07:08:18 -04:00
Sean Christopherson
b2656e4d8b KVM: nVMX: Wrap VM-Fail valid path in generic VM-Fail helper
Add nested_vmx_fail() to wrap VM-Fail paths that _may_ result in VM-Fail
Valid to make it clear at the call sites that the Valid flavor isn't
guaranteed.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200609015607.6994-1-sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08 16:21:47 -04:00
Sean Christopherson
02f5fb2e69 KVM: x86/mmu: Make .write_log_dirty a nested operation
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it
from the non-nested dirty log hooks.  And because it's a nested-only
operation.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08 16:21:38 -04:00
Sean Christopherson
fa71e9527f KVM: VMX: Use KVM_POSSIBLE_CR*_GUEST_BITS to initialize guest/host masks
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the
CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and
open coding the CR0 mask.  SVM doesn't utilize the masks, i.e. the masks
are effectively VMX specific even if they're not named as such.  This
avoids duplicate code, better documents the guest owned CR0 bit, and
eliminates the need for a build-time assertion to keep VMX and x86
synchronized.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-03 12:16:33 -04:00
Linus Torvalds
076f14be7f The X86 entry, exception and interrupt code rework
This all started about 6 month ago with the attempt to move the Posix CPU
 timer heavy lifting out of the timer interrupt code and just have lockless
 quick checks in that code path. Trivial 5 patches.
 
 This unearthed an inconsistency in the KVM handling of task work and the
 review requested to move all of this into generic code so other
 architectures can share.
 
 Valid request and solved with another 25 patches but those unearthed
 inconsistencies vs. RCU and instrumentation.
 
 Digging into this made it obvious that there are quite some inconsistencies
 vs. instrumentation in general. The int3 text poke handling in particular
 was completely unprotected and with the batched update of trace events even
 more likely to expose to endless int3 recursion.
 
 In parallel the RCU implications of instrumenting fragile entry code came
 up in several discussions.
 
 The conclusion of the X86 maintainer team was to go all the way and make
 the protection against any form of instrumentation of fragile and dangerous
 code pathes enforcable and verifiable by tooling.
 
 A first batch of preparatory work hit mainline with commit d5f744f9a2.
 
 The (almost) full solution introduced a new code section '.noinstr.text'
 into which all code which needs to be protected from instrumentation of all
 sorts goes into. Any call into instrumentable code out of this section has
 to be annotated. objtool has support to validate this. Kprobes now excludes
 this section fully which also prevents BPF from fiddling with it and all
 'noinstr' annotated functions also keep ftrace off. The section, kprobes
 and objtool changes are already merged.
 
 The major changes coming with this are:
 
     - Preparatory cleanups
 
     - Annotating of relevant functions to move them into the noinstr.text
       section or enforcing inlining by marking them __always_inline so the
       compiler cannot misplace or instrument them.
 
     - Splitting and simplifying the idtentry macro maze so that it is now
       clearly separated into simple exception entries and the more
       interesting ones which use interrupt stacks and have the paranoid
       handling vs. CR3 and GS.
 
     - Move quite some of the low level ASM functionality into C code:
 
        - enter_from and exit to user space handling. The ASM code now calls
          into C after doing the really necessary ASM handling and the return
 	 path goes back out without bells and whistels in ASM.
 
        - exception entry/exit got the equivivalent treatment
 
        - move all IRQ tracepoints from ASM to C so they can be placed as
          appropriate which is especially important for the int3 recursion
          issue.
 
     - Consolidate the declaration and definition of entry points between 32
       and 64 bit. They share a common header and macros now.
 
     - Remove the extra device interrupt entry maze and just use the regular
       exception entry code.
 
     - All ASM entry points except NMI are now generated from the shared header
       file and the corresponding macros in the 32 and 64 bit entry ASM.
 
     - The C code entry points are consolidated as well with the help of
       DEFINE_IDTENTRY*() macros. This allows to ensure at one central point
       that all corresponding entry points share the same semantics. The
       actual function body for most entry points is in an instrumentable
       and sane state.
 
       There are special macros for the more sensitive entry points,
       e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
       They allow to put the whole entry instrumentation and RCU handling
       into safe places instead of the previous pray that it is correct
       approach.
 
     - The INT3 text poke handling is now completely isolated and the
       recursion issue banned. Aside of the entry rework this required other
       isolation work, e.g. the ability to force inline bsearch.
 
     - Prevent #DB on fragile entry code, entry relevant memory and disable
       it on NMI, #MC entry, which allowed to get rid of the nested #DB IST
       stack shifting hackery.
 
     - A few other cleanups and enhancements which have been made possible
       through this and already merged changes, e.g. consolidating and
       further restricting the IDT code so the IDT table becomes RO after
       init which removes yet another popular attack vector
 
     - About 680 lines of ASM maze are gone.
 
 There are a few open issues:
 
    - An escape out of the noinstr section in the MCE handler which needs
      some more thought but under the aspect that MCE is a complete
      trainwreck by design and the propability to survive it is low, this was
      not high on the priority list.
 
    - Paravirtualization
 
      When PV is enabled then objtool complains about a bunch of indirect
      calls out of the noinstr section. There are a few straight forward
      ways to fix this, but the other issues vs. general correctness were
      more pressing than parawitz.
 
    - KVM
 
      KVM is inconsistent as well. Patches have been posted, but they have
      not yet been commented on or picked up by the KVM folks.
 
    - IDLE
 
      Pretty much the same problems can be found in the low level idle code
      especially the parts where RCU stopped watching. This was beyond the
      scope of the more obvious and exposable problems and is on the todo
      list.
 
 The lesson learned from this brain melting exercise to morph the evolved
 code base into something which can be validated and understood is that once
 again the violation of the most important engineering principle
 "correctness first" has caused quite a few people to spend valuable time on
 problems which could have been avoided in the first place. The "features
 first" tinkering mindset really has to stop.
 
 With that I want to say thanks to everyone involved in contributing to this
 effort. Special thanks go to the following people (alphabetical order):
 
    Alexandre Chartre
    Andy Lutomirski
    Borislav Petkov
    Brian Gerst
    Frederic Weisbecker
    Josh Poimboeuf
    Juergen Gross
    Lai Jiangshan
    Macro Elver
    Paolo Bonzini
    Paul McKenney
    Peter Zijlstra
    Vitaly Kuznetsov
    Will Deacon
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7j510THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoU2WD/4refvaNm08fG7aiVYem3JJzr0+Pq5O
 /opwnI/1D973ApApj5W/Nd53sN5tVqOiXncSKgywRBWZxRCAGjVYypl9rjpvXu4l
 HlMjhEKBmWkDryxxrM98Vr7hl3hnId5laR56oFfH+G4LUsItaV6Uak/HfXZ4Mq1k
 iYVbEtl2CN+KJjvSgZ6Y1l853Ab5mmGvmeGNHHWCj8ZyjF3cOLoelDTQNnsb0wXM
 crKXBcXJSsCWKYyJ5PTvB82crQCET7Su+LgwK06w/ZbW1//2hVIjSCiN5o/V+aRJ
 06BZNMj8v9tfglkN8LEQvRIjTlnEQ2sq3GxbrVtA53zxkzbBCBJQ96w8yYzQX0ux
 yhqQ/aIZJ1wTYEjJzSkftwLNMRHpaOUnKvJndXRKAYi+eGI7syF61qcZSYGKuAQ/
 bK3b/CzU6QWr1235oTADxh4isEwxA0Pg5wtJCfDDOG0MJ9ALMSOGUkhoiz5EqpkU
 mzFAwfG/Uj7hRjlkms7Yj2OjZfnU7iypj63GgpXghLjr5ksRFKEOMw8e1GXltVHs
 zzwghUjqp2EPq0VOOQn3lp9lol5Prc3xfFHczKpO+CJW6Rpa4YVdqJmejBqJy/on
 Hh/T/ST3wa2qBeAw89vZIeWiUJZZCsQ0f//+2hAbzJY45Y6DuR9vbTAPb9agRgOM
 xg+YaCfpQqFc1A==
 =llba
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 entry updates from Thomas Gleixner:
 "The x86 entry, exception and interrupt code rework

  This all started about 6 month ago with the attempt to move the Posix
  CPU timer heavy lifting out of the timer interrupt code and just have
  lockless quick checks in that code path. Trivial 5 patches.

  This unearthed an inconsistency in the KVM handling of task work and
  the review requested to move all of this into generic code so other
  architectures can share.

  Valid request and solved with another 25 patches but those unearthed
  inconsistencies vs. RCU and instrumentation.

  Digging into this made it obvious that there are quite some
  inconsistencies vs. instrumentation in general. The int3 text poke
  handling in particular was completely unprotected and with the batched
  update of trace events even more likely to expose to endless int3
  recursion.

  In parallel the RCU implications of instrumenting fragile entry code
  came up in several discussions.

  The conclusion of the x86 maintainer team was to go all the way and
  make the protection against any form of instrumentation of fragile and
  dangerous code pathes enforcable and verifiable by tooling.

  A first batch of preparatory work hit mainline with commit
  d5f744f9a2 ("Pull x86 entry code updates from Thomas Gleixner")

  That (almost) full solution introduced a new code section
  '.noinstr.text' into which all code which needs to be protected from
  instrumentation of all sorts goes into. Any call into instrumentable
  code out of this section has to be annotated. objtool has support to
  validate this.

  Kprobes now excludes this section fully which also prevents BPF from
  fiddling with it and all 'noinstr' annotated functions also keep
  ftrace off. The section, kprobes and objtool changes are already
  merged.

  The major changes coming with this are:

    - Preparatory cleanups

    - Annotating of relevant functions to move them into the
      noinstr.text section or enforcing inlining by marking them
      __always_inline so the compiler cannot misplace or instrument
      them.

    - Splitting and simplifying the idtentry macro maze so that it is
      now clearly separated into simple exception entries and the more
      interesting ones which use interrupt stacks and have the paranoid
      handling vs. CR3 and GS.

    - Move quite some of the low level ASM functionality into C code:

       - enter_from and exit to user space handling. The ASM code now
         calls into C after doing the really necessary ASM handling and
         the return path goes back out without bells and whistels in
         ASM.

       - exception entry/exit got the equivivalent treatment

       - move all IRQ tracepoints from ASM to C so they can be placed as
         appropriate which is especially important for the int3
         recursion issue.

    - Consolidate the declaration and definition of entry points between
      32 and 64 bit. They share a common header and macros now.

    - Remove the extra device interrupt entry maze and just use the
      regular exception entry code.

    - All ASM entry points except NMI are now generated from the shared
      header file and the corresponding macros in the 32 and 64 bit
      entry ASM.

    - The C code entry points are consolidated as well with the help of
      DEFINE_IDTENTRY*() macros. This allows to ensure at one central
      point that all corresponding entry points share the same
      semantics. The actual function body for most entry points is in an
      instrumentable and sane state.

      There are special macros for the more sensitive entry points, e.g.
      INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF.
      They allow to put the whole entry instrumentation and RCU handling
      into safe places instead of the previous pray that it is correct
      approach.

    - The INT3 text poke handling is now completely isolated and the
      recursion issue banned. Aside of the entry rework this required
      other isolation work, e.g. the ability to force inline bsearch.

    - Prevent #DB on fragile entry code, entry relevant memory and
      disable it on NMI, #MC entry, which allowed to get rid of the
      nested #DB IST stack shifting hackery.

    - A few other cleanups and enhancements which have been made
      possible through this and already merged changes, e.g.
      consolidating and further restricting the IDT code so the IDT
      table becomes RO after init which removes yet another popular
      attack vector

    - About 680 lines of ASM maze are gone.

  There are a few open issues:

   - An escape out of the noinstr section in the MCE handler which needs
     some more thought but under the aspect that MCE is a complete
     trainwreck by design and the propability to survive it is low, this
     was not high on the priority list.

   - Paravirtualization

     When PV is enabled then objtool complains about a bunch of indirect
     calls out of the noinstr section. There are a few straight forward
     ways to fix this, but the other issues vs. general correctness were
     more pressing than parawitz.

   - KVM

     KVM is inconsistent as well. Patches have been posted, but they
     have not yet been commented on or picked up by the KVM folks.

   - IDLE

     Pretty much the same problems can be found in the low level idle
     code especially the parts where RCU stopped watching. This was
     beyond the scope of the more obvious and exposable problems and is
     on the todo list.

  The lesson learned from this brain melting exercise to morph the
  evolved code base into something which can be validated and understood
  is that once again the violation of the most important engineering
  principle "correctness first" has caused quite a few people to spend
  valuable time on problems which could have been avoided in the first
  place. The "features first" tinkering mindset really has to stop.

  With that I want to say thanks to everyone involved in contributing to
  this effort. Special thanks go to the following people (alphabetical
  order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian
  Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai
  Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra,
  Vitaly Kuznetsov, and Will Deacon"

* tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits)
  x86/entry: Force rcu_irq_enter() when in idle task
  x86/entry: Make NMI use IDTENTRY_RAW
  x86/entry: Treat BUG/WARN as NMI-like entries
  x86/entry: Unbreak __irqentry_text_start/end magic
  x86/entry: __always_inline CR2 for noinstr
  lockdep: __always_inline more for noinstr
  x86/entry: Re-order #DB handler to avoid *SAN instrumentation
  x86/entry: __always_inline arch_atomic_* for noinstr
  x86/entry: __always_inline irqflags for noinstr
  x86/entry: __always_inline debugreg for noinstr
  x86/idt: Consolidate idt functionality
  x86/idt: Cleanup trap_init()
  x86/idt: Use proper constants for table size
  x86/idt: Add comments about early #PF handling
  x86/idt: Mark init only functions __init
  x86/entry: Rename trace_hardirqs_off_prepare()
  x86/entry: Clarify irq_{enter,exit}_rcu()
  x86/entry: Remove DBn stacks
  x86/entry: Remove debug IDT frobbing
  x86/entry: Optimize local_db_save() for virt
  ...
2020-06-13 10:05:47 -07:00
Paolo Bonzini
77f81f37fb Merge branch 'kvm-basic-exit-reason' into HEAD
Using a topic branch so that stable branches can simply cherry-pick the
patch.

Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11 12:35:14 -04:00
Sean Christopherson
2ebac8bb3c KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON,
when determining whether a nested VM-Exit should be reflected into L1 or
handled by KVM in L0.

For better or worse, the switch statement in nested_vmx_exit_reflected()
currently defaults to "true", i.e. reflects any nested VM-Exit without
dedicated logic.  Because the case statements only contain the basic
exit reason, any VM-Exit with modifier bits set will be reflected to L1,
even if KVM intended to handle it in L0.

Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY,
i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to
L1, as "failed VM-Entry" is the only modifier that KVM can currently
encounter.  The SMM modifiers will never be generated as KVM doesn't
support/employ a SMI Transfer Monitor.  Ditto for "exit from enclave",
as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to
enter an enclave in a KVM guest (L1 or L2).

Fixes: 644d711aa0 ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit")
Cc: Jim Mattson <jmattson@google.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-11 11:28:11 -04:00
Peter Zijlstra
84b6a34915 x86/entry: Optimize local_db_save() for virt
Because DRn access is 'difficult' with virt; but the DR7 read is cheaper
than a cacheline miss on native, add a virt specific fast path to
local_db_save(), such that when breakpoints are not in use to avoid
touching DRn entirely.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org
2020-06-11 15:15:22 +02:00
Vitaly Kuznetsov
7a35e515a7 KVM: VMX: Properly handle kvm_read/write_guest_virt*() result
Syzbot reports the following issue:

WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
 kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
Call Trace:
...
RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
...
 nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
 handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
 handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
 vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067

'exception' we're trying to inject with kvm_inject_emulated_page_fault()
comes from:

  nested_vmx_get_vmptr()
   kvm_read_guest_virt()
     kvm_read_guest_virt_helper()
       vcpu->arch.walk_mmu->gva_to_gpa()

but it is only set when GVA to GPA conversion fails. In case it doesn't but
we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
'exception'. This happen when the argument is MMIO.

Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
KVM code where kvm_read/write_guest_virt*() return result is mishandled.
VMX instructions along with INVPCID have the same issue. This was already
noticed before, e.g. see commit 541ab2aeb2 ("KVM: x86: work around
leak of uninitialized stack contents") but was never fully fixed.

KVM could've handled the request correctly by going to userspace and
performing I/O but there doesn't seem to be a good need for such requests
in the first place.

Introduce vmx_handle_memory_failure() as an interim solution.

Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
to have a good enum describing this tristate, just add "int *ret" to
nested_vmx_get_vmptr() interface to pass the information.

Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-08 07:59:08 -04:00
Makarand Sonare
8d7fbf01f9 KVM: selftests: VMX preemption timer migration test
When a nested VM with a VMX-preemption timer is migrated, verify that the
nested VM and its parent VM observe the VMX-preemption timer exit close to
the original expiration deadline.

Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20200526215107.205814-3-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01 04:26:10 -04:00
Peter Shier
850448f35a KVM: nVMX: Fix VMX preemption timer migration
Add new field to hold preemption timer expiration deadline
appended to struct kvm_vmx_nested_state_hdr. This is to prevent
the first VM-Enter after migration from incorrectly restarting the timer
with the full timer value instead of partially decayed timer value.
KVM_SET_NESTED_STATE restarts timer using migrated state regardless
of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER.

Fixes: cf8b84f48a ("kvm: nVMX: Prepare for checkpointing L2 state")

Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Message-Id: <20200526215107.205814-2-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01 04:26:10 -04:00
Vitaly Kuznetsov
68fd66f100 KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.

While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.

The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01 04:26:06 -04:00
Paolo Bonzini
cc440cdad5 KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE
Similar to VMX, the state that is captured through the currently available
IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was
running at the moment when the process was interrupted to save its state.

In particular, the SVM-specific state for nested virtualization includes
the L1 saved state (including the interrupt flag), the cached L2 controls,
and the GIF.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01 04:26:05 -04:00
Jim Mattson
ada0098df6 KVM: nVMX: Change emulated VMX-preemption timer hrtimer to absolute
Prepare for migration of this hrtimer, by changing it from relative to
absolute. (I couldn't get migration to work with a relative timer.)

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200508203643.85477-3-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15 12:26:25 -04:00
Jim Mattson
1739f3d56d KVM: nVMX: Really make emulated nested preemption timer pinned
The PINNED bit is ignored by hrtimer_init. It is only considered when
starting the timer.

When the hrtimer isn't pinned to the same logical processor as the
vCPU thread to be interrupted, the emulated VMX-preemption timer
often fails to adhere to the architectural specification.

Fixes: f15a75eedc ("KVM: nVMX: make emulated nested preemption timer pinned")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Message-Id: <20200508203643.85477-2-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15 12:26:24 -04:00
Sean Christopherson
6c1c6e5835 KVM: nVMX: Remove unused 'ops' param from nested_vmx_hardware_setup()
Remove a 'struct kvm_x86_ops' param that got left behind when the nested
ops were moved to their own struct.

Fixes: 33b2217245 ("KVM: x86: move nested-related kvm_x86_ops to a separate struct")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15 12:26:24 -04:00
Sean Christopherson
9e826feb8f KVM: nVMX: Drop superfluous VMREAD of vmcs02.GUEST_SYSENTER_*
Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit
as the vmcs12 fields are updated in vmx_set_msr(), and writes to the
corresponding MSRs are always intercepted by KVM when running L2.

Dropping the propagation was intended to be done in the same commit that
added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was
only shuffled around[2][3].

[1] https://patchwork.kernel.org/patch/10933215
[2] https://patchwork.kernel.org/patch/10933215/#22682289
[3] https://lore.kernel.org/patchwork/patch/1088643

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200428231025.12766-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15 12:26:17 -04:00
Sean Christopherson
68cda40d9f KVM: nVMX: Tweak handling of failure code for nested VM-Enter failure
Use an enum for passing around the failure code for a failed VM-Enter
that results in VM-Exit to provide a level of indirection from the final
resting place of the failure code, vmcs.EXIT_QUALIFICATION.  The exit
qualification field is an unsigned long, e.g. passing around
'u32 exit_qual' throws up red flags as it suggests KVM may be dropping
bits when reporting errors to L1.  This is a red herring because the
only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely
close to overflowing a u32.

Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated
by the MSR load list, which returns the (1-based) entry that failed, and
the number of MSRs to load is a 32-bit VMCS field.  At first blush, it
would appear that overflowing a u32 is possible, but the number of MSRs
that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC).

In other words, there are two completely disparate types of data that
eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is
an 'unsigned long' in nature.  This was presumably the reasoning for
switching to 'u32' when the related code was refactored in commit
ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()").

Using an enum for the failure code addresses the technically-possible-
but-will-never-happen scenario where Intel defines a failure code that
doesn't fit in a 32-bit integer.  The enum variables and values will
either be automatically sized (gcc 5.4 behavior) or be subjected to some
combination of truncation.  The former case will simply work, while the
latter will trigger a compile-time warning unless the compiler is being
particularly unhelpful.

Separating the failure code from the failed MSR entry allows for
disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the
conundrum where KVM has to choose between 'u32 exit_qual' and tracking
values as 'unsigned long' that have no business being tracked as such.
To cement the split, set vmcs12->exit_qualification directly from the
entry error code or failed MSR index instead of bouncing through a local
variable.

Opportunistically rename the variables in load_vmcs12_host_state() and
vmx_set_nested_state() to call out that they're ignored, set exit_reason
on demand on nested VM-Enter failure, and add a comment in
nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap.

No functional change intended.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-15 12:07:31 -04:00
Sean Christopherson
0cc69204e7 KVM: nVMX: Unconditionally validate CR3 during nested transitions
Unconditionally check the validity of the incoming CR3 during nested
VM-Enter/VM-Exit to avoid invoking kvm_read_cr3() in the common case
where the guest isn't using PAE paging.  If vmcs.GUEST_CR3 hasn't yet
been cached (common case), kvm_read_cr3() will trigger a VMREAD.  The
VMREAD (~30 cycles) alone is likely slower than nested_cr3_valid()
(~5 cycles if vcpu->arch.maxphyaddr gets a cache hit), and the poor
exchange only gets worse when retpolines are enabled as the call to
kvm_x86_ops.cache_reg() will incur a retpoline (60+ cycles).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200502043234.12481-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:15:09 -04:00
Sean Christopherson
1af1bb0562 KVM: nVMX: Skip IBPB when temporarily switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when temporarily loading vmcs02 to synchronize it to vmcs12, i.e.
give copy_vmcs02_to_vmcs12_rare() the same treatment as
vmx_switch_vmcs().

Make vmx_vcpu_load() static now that it's only referenced within vmx.c.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200506235850.22600-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:15:03 -04:00
Sean Christopherson
5c911beff2 KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02
Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS
switch when running with spectre_v2_user=on/auto if the switch is
between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02.
The IBPB is intended to prevent one guest from attacking another, which
is unnecessary in the nested case as it's the same guest from KVM's
perspective.

This all but eliminates the overhead observed for nested VMX transitions
when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which
can be significant, e.g. roughly 3x on current systems.

Reported-by: Alexander Graf <graf@amazon.com>
Cc: KarimAllah Raslan <karahmed@amazon.de>
Cc: stable@vger.kernel.org
Fixes: 15d4507152 ("KVM/x86: Add IBPB support")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200501163117.4655-1-sean.j.christopherson@intel.com>
[Invert direction of bool argument. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:15:02 -04:00
Sean Christopherson
1cd2f0b0dd KVM: nVMX: Prioritize SMI over nested IRQ/NMI
Check for an unblocked SMI in vmx_check_nested_events() so that pending
SMIs are correctly prioritized over IRQs and NMIs when the latter events
will trigger VM-Exit.  This also fixes an issue where an SMI that was
marked pending while processing a nested VM-Enter wouldn't trigger an
immediate exit, i.e. would be incorrectly delayed until L2 happened to
take a VM-Exit.

Fixes: 64d6067057 ("KVM: x86: stubs for SMM support")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:43 -04:00
Sean Christopherson
15ff0b450b KVM: nVMX: Preserve IRQ/NMI priority irrespective of exiting behavior
Short circuit vmx_check_nested_events() if an unblocked IRQ/NMI is
pending and needs to be injected into L2, priority between coincident
events is not dependent on exiting behavior.

Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:42 -04:00
Sean Christopherson
429ab576f3 KVM: nVMX: Report NMIs as allowed when in L2 and Exit-on-NMI is set
Report NMIs as allowed when the vCPU is in L2 and L2 is being run with
Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective
in this case.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:32 -04:00
Sean Christopherson
d2060bd42e KVM: nVMX: Open a window for pending nested VMX preemption timer
Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and
use it to effectively open a window for servicing the expired timer.
Like pending SMIs on VMX, opening a window simply means requesting an
immediate exit.

This fixes a bug where an expired VMX preemption timer (for L2) will be
delayed and/or lost if a pending exception is injected into L2.  The
pending exception is rightly prioritized by vmx_check_nested_events()
and injected into L2, with the preemption timer left pending.  Because
no window opened, L2 is free to run uninterrupted.

Fixes: f4124500c2 ("KVM: nVMX: Fully emulate preemption timer")
Reported-by: Jim Mattson <jmattson@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com>
[Check it in kvm_vcpu_has_events too, to ensure that the preemption
 timer is serviced promptly even if the vCPU is halted and L1 is not
 intercepting HLT. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:27 -04:00
Sean Christopherson
6ce347af14 KVM: nVMX: Preserve exception priority irrespective of exiting behavior
Short circuit vmx_check_nested_events() if an exception is pending and
needs to be injected into L2, priority between coincident events is not
dependent on exiting behavior.  This fixes a bug where a single-step #DB
that is not intercepted by L1 is incorrectly dropped due to servicing a
VMX Preemption Timer VM-Exit.

Injected exceptions also need to be blocked if nested VM-Enter is
pending or an exception was already injected, otherwise injecting the
exception could overwrite an existing event injection from L1.
Technically, this scenario should be impossible, i.e. KVM shouldn't
inject its own exception during nested VM-Enter.  This will be addressed
in a future patch.

Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g.
SMI should take priority over VM-Exit on NMI/INTR, and NMI that is
injected into L2 should take priority over VM-Exit INTR.  This will also
be addressed in a future patch.

Fixes: b6b8a1451f ("KVM: nVMX: Rework interception of IRQs and NMIs")
Reported-by: Jim Mattson <jmattson@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-13 12:14:25 -04:00
Paolo Bonzini
4aef2ec902 Merge branch 'kvm-amd-fixes' into HEAD 2020-05-13 12:14:05 -04:00
Sean Christopherson
f9336e3281 KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning
Use BUG() in the impossible-to-hit default case when switching on the
scope of INVEPT to squash a warning with clang 11 due to clang treating
the BUG_ON() as conditional.

  >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free'
     is used uninitialized whenever 'if' condition is false
     [-Wsometimes-uninitialized]
                   BUG_ON(1);

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: ce8fe7b77b ("KVM: nVMX: Free only the affected contexts when emulating INVEPT")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-04 11:58:55 -04:00
Sean Christopherson
87796555d4 KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32
Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the
EXIT_QUALIFICATION field is naturally sized, not a 32-bit field.

The bug is most easily observed by doing VMXON (or any VMX instruction)
in L2 with a negative displacement, in which case dropping the upper
bits on nested VM-Exit results in L1 calculating the wrong virtual
address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields:

  Unhandled cpu exception 14 #PF at ip 0000000000400553
  rbp=0000000000537000 cr2=0000000100536ff8

Fixes: fbdd502503 ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-24 12:51:21 -04:00
Sean Christopherson
9bd4af240f KVM: nVMX: Drop a redundant call to vmx_get_intr_info()
Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from
vmx_get_intr_info() that was inadvertantly introduced along with the
caching mechanism.  EXIT_REASON_EXCEPTION_NMI, the only consumer of
intr_info, populates the variable before using it.

Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 18:24:28 -04:00
Paolo Bonzini
33b2217245 KVM: x86: move nested-related kvm_x86_ops to a separate struct
Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to
nested virtualization into a separate struct.

As a result, these ops will always be non-NULL on VMX.  This is not a problem:

* check_nested_events is only called if is_guest_mode(vcpu) returns true

* get_nested_state treats VMXOFF state the same as nested being disabled

* set_nested_state fails if you attempt to set nested state while
  nesting is disabled

* nested_enable_evmcs could already be called on a CPU without VMX enabled
  in CPUID.

* nested_get_evmcs_version was fixed in the previous patch

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-23 09:04:57 -04:00
Sean Christopherson
b8d295f96b KVM: nVMX: Remove non-functional "support" for CR3 target values
Remove all references to cr3_target_value[0-3] and replace the fields
in vmcs12 with "dead_space" to preserve the vmcs12 layout.  KVM doesn't
support emulating CR3-target values, despite a variety of code that
implies otherwise, as KVM unconditionally reports '0' for the number of
supported CR3-target values.

This technically fixes a bug where KVM would incorrectly allow VMREAD
and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3].  Per
Intel's SDM, the number of supported CR3-target values reported in
VMX_MISC also enumerates the existence of the associated VMCS fields:

  If a future implementation supports more than 4 CR3-target values, they
  will be encoded consecutively following the 4 encodings given here.

Alternatively, the "bug" could be fixed by actually advertisting support
for 4 CR3-target values, but that'd likely just enable kvm-unit-tests
given that no one has complained about lack of support for going on ten
years, e.g. KVM, Xen and HyperV don't use CR3-target values.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:09 -04:00
Sean Christopherson
8791585837 KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags
Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the
nomenclature in .get_exit_info()), and use it to cache VMX's
vmcs.EXIT_INTR_INFO.  Drop a comment in vmx_recover_nmi_blocking() that
is obsoleted by the generic caching mechanism.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:07 -04:00
Sean Christopherson
5addc23519 KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags
Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the
nomenclature in .get_exit_info()), and use it to cache VMX's
vmcs.EXIT_QUALIFICATION.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:07 -04:00
Sean Christopherson
ec0241f3bb KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switch
Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that
the entire register cache is reset when switching the active VMCS, e.g.
vmx_segment_cache_test_set() will reset the segment cache due to
VCPU_EXREG_SEGMENTS being unavailable.

Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked
by the nested code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:06 -04:00
Sean Christopherson
e5d03de593 KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switch
Reset the per-vCPU available and dirty register masks when switching
between vmcs01 and vmcs02, as the masks track state relative to the
current VMCS.  The stale masks don't cause problems in the current code
base because the registers are either unconditionally written on nested
transitions or, in the case of segment registers, have an additional
tracker that is manually reset.

Note, by dropping (previously implicitly, now explicitly) the dirty mask
when switching the active VMCS, KVM is technically losing writes to the
associated fields.  But, the only regs that can be dirtied (RIP, RSP and
PDPTRs) are unconditionally written on nested transitions, e.g. explicit
writeback is a waste of cycles, and a WARN_ON would be rather pointless.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:06 -04:00
Sean Christopherson
9932b49e5a KVM: nVMX: Invoke ept_save_pdptrs() if and only if PAE paging is enabled
Invoke ept_save_pdptrs() when restoring L1's host state on a "late"
VM-Fail if and only if PAE paging is enabled.  This saves a CALL in the
common case where L1 is a 64-bit host, and avoids incorrectly marking
the PDPTRs as dirty.

WARN if ept_save_pdptrs() is called with PAE disabled now that the
nested usage pre-checks is_pae_paging().  Barring a bug in KVM's MMU,
attempting to read the PDPTRs with PAE disabled is now impossible.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415203454.8296-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:06 -04:00
Sean Christopherson
4dcefa312a KVM: nVMX: Rename exit_reason to vm_exit_reason for nested VM-Exit
Use "vm_exit_reason" for code related to injecting a nested VM-Exit to
VM-Exits to make it clear that nested_vmx_vmexit() expects the full exit
eason, not just the basic exit reason.  The basic exit reason (bits 15:0
of vmcs.VM_EXIT_REASON) is colloquially referred to as simply "exit
reason".

Note, other flows, e.g. vmx_handle_exit(), are intentionally left as is.
A future patch will convert vmx->exit_reason to a union + bit-field, and
the exempted flows will interact with the unionized of "exit_reason".

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415175519.14230-10-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:05 -04:00
Sean Christopherson
2a7833899f KVM: nVMX: Cast exit_reason to u16 to check for nested EXTERNAL_INTERRUPT
Explicitly check only the basic exit reason when emulating an external
interrupt VM-Exit in nested_vmx_vmexit().  Checking the full exit reason
doesn't currently cause problems, but only because the only exit reason
modifier support by KVM is FAILED_VMENTRY, which is mutually exclusive
with EXTERNAL_INTERRUPT.  Future modifiers, e.g. ENCLAVE_MODE, will
coexist with EXTERNAL_INTERRUPT.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415175519.14230-9-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:05 -04:00
Sean Christopherson
f47baaed4f KVM: nVMX: Pull exit_reason from vcpu_vmx in nested_vmx_reflect_vmexit()
Grab the exit reason from the vcpu struct in nested_vmx_reflect_vmexit()
instead of having the exit reason explicitly passed from the caller.
This fixes a discrepancy between VM-Fail and VM-Exit handling, as the
VM-Fail case is already handled by checking vcpu_vmx, e.g. the exit
reason previously passed on the stack is bogus if vmx->fail is set.

Not taking the exit reason on the stack also avoids having to document
that nested_vmx_reflect_vmexit() requires the full exit reason, as
opposed to just the basic exit reason, which is not at all obvious since
the only usages of the full exit reason are for tracing and way down in
prepare_vmcs12() where it's propagated to vmcs12.

No functional change intended.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415175519.14230-8-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:04 -04:00
Sean Christopherson
1d283062c9 KVM: nVMX: Drop a superfluous WARN on reflecting EXTERNAL_INTERRUPT
Drop the WARN in nested_vmx_reflect_vmexit() that fires if KVM attempts
to reflect an external interrupt.  The WARN is blatantly impossible to
hit now that nested_vmx_l0_wants_exit() is called from
nested_vmx_reflect_vmexit() unconditionally returns true for
EXTERNAL_INTERRUPT.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20200415175519.14230-7-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-04-21 09:13:04 -04:00